web-crawler

How to force a page to be removed from the search engine index?

Situation: Google has indexed a page in a forum. The thread is now deleted. How/whether can I make Google and other search engines to delete the cached copy? I doubt they would have anything against that since the linked page does not exist anymore and keeping the index updated and valid should be in their best interests. Is this possib...

Detecting honest web crawlers

I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyo...

How to prevent robots.txt passing from staging env to production?

I had happen in the past that one of our IT Specialist will move the robots.txt from staging from production accidentally. Blocking google and others from indexing our customers' site in production. Is there a good way of managing this situation? Thanks in advance. ...

Detecting CacheBuster querystrings when crawling a page

I've put together a fairly simple crawling engine that works quite well and for the most part avoids getting stuck in circular loop traps. (Ie, Page A links to Page B and Page B links to Page A). The only time it gets stuck in this loop is when both pages link to each other with a cachebuster querystring, basically it is a unique querys...

Web crawlers and Google App Engine Hosted applications

Is it impossible to run a web crawler on GAE along side with my app considering the I am running the free startup version? ...

saving / mirroring / crawling web pages that use javascript to generate content

I want to download web pages that use javascript to output the data. Wget can do everything else, but run javascript. Even something like:firefox -remote "saveURL(www.mozilla.org, myfile.html)" would be great (unfortunately that kind of command does not exist). ...

How can I make this recursive crawl function iterative?

For academic and performance sake, given this crawl recursive web-crawling function (which crawls only within the given domain) what would be the best approach to make it run iteratively? Currently when it runs, by the time it finishes python has climbed to using over 1GB of memory which isn't acceptable for running in a shared environme...

Crawling The Internet

Hi All, I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to. How do I implement a crawler? I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/) Are there others? What opinions do...

Best Way to automatically find links to your content?

So, here is the task I've found myself thinking of. Pretend for a moment, that I have a large body of content. I want to see what websites are linking to my content. I know that I could look into TrackBack or PingBack but what about those that aren't using tools capable of dealing with that? It would seem that some form of Web Crawler t...

Best open source library or application to crawl and data mine web sites

I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads. ...

How do web spiders differ from Wget's spider?

The next sentence caught my eye in Wget's manual wget --spider --force-html -i bookmarks.html This feature needs much more work for Wget to get close to the functionality of real web spiders. I find the following lines of code relevant for the spider option in wget. src/ftp.c 780: /* If we're in spider mode, don't really retrie...

Web Crawling and Link Evaluation

Hi All, I know that cURL will download a complete file. What I really want is to take all links on a page and evaluate against my specific criteria, location of the link, etc and decide if I should grab that page and parse it for information. More specifically, I want to find links that pertain to entertainment events and parse the d...

WebBrowser.Refresh problem in VB.Net

I'm working on a webcrawler in VB.net, and using the System.Forms.WebBrowser object for handling navigation on sites that use javascript or form posts, but I'm having a problem. When I navigate backwards (WebBrowser.GoBack()) to a page that was loaded with a form post, the page has expired and I have to do a refresh to resend the reques...

Online tool for crawling a website and retriving all meta information for every page

Does anyone know of a free online tool that can crawl any given website and return just the Meta Keywords and Meta Description information? ...

Proper etiquette for a web crawler http requests

I have a simple web crawler to request all the pages from a website's sitemap that I need to cache and index. After several requests, the website begins serving blank pages. There is nothing in their robots.txt except the link to their sitemap, so I assume I am not breaking their "rules". I have a descriptive header that links to ex...

Information on web crawling techniques

Hello, I am building a small web crawler and I was wondering if anybody had some interesting info on the actual implementation (just crawling, no searching, no ranking, no classification, just crawling, kiss :). For the record, I already have the O'Reilly "Spidering hacks" and the No Starch Press "Webbots, spiders, and screen scrapers"...

How to migrate resources from proprietary CMS?

I need to migrate our website from a proprietary CMS that uses active server pages. Is there a tool or technique that will help download the resources from the existing site? I guess I'm looking for a tool that will crawl and scrape the entire site. An additional challenge is that the site uses SSL and is protected with forms-based au...

What is the best language for a WebCrawler

I would like to know what is the best language for a lightweight and fast web crawler. Is it better to do it in C99, C++ or any other language? ...

C# HTTPWebResponse + StreamReader Very Slow!!!

Hi all. I'm trying to implement a limited web crawler in C# (for a few hundred sites only) using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string. I'm only downloading pages which are about 5-10K. It's all very slow! For example, the average GetRespons...

How to download Google search results?

Apologies if this is too ignorant a question or has been asked before. A cursory look did not find anything matching this exactly. The question is: how can I download all Word documents that Google has indexed? It would be a daunting task indeed to do it by hand... Thanks for all pointers. ...