web-crawler

scraping and parsing google data like page rank and more for a domain

Hi, I need to scrape/parse some search engines related data for a given domain name(site). I need Google Page Rank (only for the domain name, not each pages). Number of indexed results/pages (google, bing). Number of Backlinks (google, bing, yahoo). Traffic Rank (alexa). Site thumbnail. Could you provide me some pointers on where...

any suggestions for getting the data for the web portal?

I have launched a new Movie Based web portal, I have completed the programming part and Made the site dynamic. My Question, How can i get the data about the Movies from different languages For example: www.imdb.com has a huge database collection. Is there any web crwaling methodology where we can get it? Or dirty method of Complete Data ...

Intelligently extracting tags from blogs and other web pages

I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site. If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed. I...

Mod Rewrite Producing 404 When Crawled (works fine when viewed in browser)

I have the following code in my .htaccess: RewriteEngine On RewriteBase / RewriteRule ^index\.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] My pretty-link converting is done on my index.php. If the user is looking for something that doesn't exist, a 404 Header is produc...

how to crawl a 403 forbidden SNS

i'm crawling an SNS with crawler written in python it works for a long time, but few days ago, the webpages got from my severs were ERROR 403 FORBIDDEN. i tried to change the cookie, change the browser, change the account, but all failed. and it seems that are the forbidden severs are in the same network segment. what can i do? steal...

In robots.txt, what will Disallow: /?s block?

What will this line block when the search engine crawls the website? Disallow: /?s ...

Perl Module to get all pages of a website?

Is there a module out there that can give me links to all the pages a website has?? Why I need it: I want to crawl some sites and search for tags in them, searching only on mainpage is not enough. Thanks, ...

Protecting email addresses on HTML pages from email spiders

How do you prevent emails being gathered from HTML pages by email spiders? Does mailto: linking them increase the likelihood of them being picked up? I think I read somewhere about a built-in Ruby function that confuses email spiders by decimal-encoding the email address - can anyone link me to some documentation or tell me how effective...

What languages are good for writing a web crawler?

I have substantial PHP experience, although I realize that PHP probably isn't the best language for a large-scale web crawler because a process can't run indefinitely. What languages do people suggest? ...

Robots.txt not working

Hi, I have used robots.txt to restrict one of the folders in my site. The folder consists of the sites in under construction. Google has indexed all those sites which are in testing phase. So I used robots.txt. I first submitted the site and robots.txt is enabled. Now the status is success for the www.mysite.com/robots.txt. But the goog...

get div around searched keyword (file_get_contents('url')

So im creating a webcrawler and everything works, only got 1 problem. With file_get_contents($page_data["url"]); I get the content of a webpage. This webpage is scanned when one of my keywords excists on the webpage. $find = $keywords; $str = file_get_contents($page_data["url"]); if(strpos($str, $find) == true) When i want to insert...

Visit Half Million Pages with Perl

Currently I'm using Mechanize and the get() method to get each site, and check with content() method each mainpage for something. I have a very fast computer + 10Mbit connection, and still, it took 9 hours to check 11K sites, which is not acceptable, the problem is, the speed of the get() function, which , obviously, needs to get the pag...

robots.txt How to not allow engines to crawl urls with this in url: “http:

Disallow: /*“http: is what I've been using - my guess is I may need to escape the quotation mark somehow. In Google webmaster tools, it's not even reading that quotation mark (where it allows you to see the robots.txt file and test it on a few urls). On Google Webmaster Tools, it displays the robots.txt file without the quotes for ...

Which information is stored by Google crawler?

.. and how the web crawler infers the semantics of information on the website? List out the ranking signal in separate answer. ...

Libraries/Tools for Website Parsing

I would like to start working with parsing large numbers of raw HTML pages into semantic data structures. Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language. So far, planning on using Hadoop to manage a lot of the processing, but curious about alter...

Which web language can be used for data mining or web crawling

if I want to build a complex webiste like google news , which gathers data from oher websites. like data mining , crawling. In which language should i build the website. Currently i know only PHP. Can i do that in PHP ...

what tools or languages or How can i build the site like google news

I have to build the website where i need to crawl to filter or u can say read the 50 webistes. then after reading those webistes i need to filter the news e,g news related to Mercedez benz and then i need to display that on that webiste with refrence to original source. Basically what google news is doing Currently i know PHP and can b...

web crawler performance

I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process. When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc. Any resources you may share in this r...

pcntl_fork() function question

Goodmorning stackoverflow, I'm still busy with my webcrawler and i just need some last help. Because crawling the web can cost a lot of time I want to let pcntl_fork() help me in creating multiple childs to split my code in parts. Master - crawling the domain Child - When receiving a link child must crawl the link found on the domain ...

can someone suggest a web spider?

Is there a web spider which can grap the contents of the forums? My company does not provide the internet connection, so I want to grap the threads of a forum, then I can have a look at the contents in company. I have tried the WebLech, it can just grap the static pages. ...