Situation: Google has indexed a page in a forum. The thread is now deleted. How/whether can I make Google and other search engines to delete the cached copy? I doubt they would have anything against that since the linked page does not exist anymore and keeping the index updated and valid should be in their best interests.
Is this possib...
I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyo...
I had happen in the past that one of our IT Specialist will move the robots.txt from staging from production accidentally. Blocking google and others from indexing our customers' site in production. Is there a good way of managing this situation?
Thanks in advance.
...
I've put together a fairly simple crawling engine that works quite well and for the most part avoids getting stuck in circular loop traps. (Ie, Page A links to Page B and Page B links to Page A).
The only time it gets stuck in this loop is when both pages link to each other with a cachebuster querystring, basically it is a unique querys...
Is it impossible to run a web crawler on GAE along side with my app considering the I am running the free startup version?
...
I want to download web pages that use javascript to output the data. Wget can do everything else, but run javascript.
Even something like:firefox -remote "saveURL(www.mozilla.org, myfile.html)"
would be great (unfortunately that kind of command does not exist).
...
For academic and performance sake, given this crawl recursive web-crawling function (which crawls only within the given domain) what would be the best approach to make it run iteratively? Currently when it runs, by the time it finishes python has climbed to using over 1GB of memory which isn't acceptable for running in a shared environme...
Hi All,
I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.
How do I implement a crawler?
I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)
Are there others?
What opinions do...
So, here is the task I've found myself thinking of. Pretend for a moment, that I have a large body of content. I want to see what websites are linking to my content. I know that I could look into TrackBack or PingBack but what about those that aren't using tools capable of dealing with that?
It would seem that some form of Web Crawler t...
I would like to know what is the best eopen-source library for crawling and analyzing websites. One example would be a crawler property agencies, where I would like to grab information from a number of sites and aggregate them into my own site. For this I need to crawl the sites and extract the property ads.
...
The next sentence caught my eye in Wget's manual
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real web spiders.
I find the following lines of code relevant for the spider option in wget.
src/ftp.c
780: /* If we're in spider mode, don't really retrie...
Hi All,
I know that cURL will download a complete file.
What I really want is to take all links on a page and evaluate against my specific criteria, location of the link, etc and decide if I should grab that page and parse it for information.
More specifically, I want to find links that pertain to entertainment events and parse the d...
I'm working on a webcrawler in VB.net, and using the System.Forms.WebBrowser object for handling navigation on sites that use javascript or form posts, but I'm having a problem. When I navigate backwards (WebBrowser.GoBack()) to a page that was loaded with a form post, the page has expired and I have to do a refresh to resend the reques...
Does anyone know of a free online tool that can crawl any given website and return just the Meta Keywords and Meta Description information?
...
I have a simple web crawler to request all the pages from a website's sitemap that I need to cache and index. After several requests, the website begins serving blank pages.
There is nothing in their robots.txt except the link to their sitemap, so I assume I am not breaking their "rules". I have a descriptive header that links to ex...
Hello,
I am building a small web crawler and I was wondering if anybody had some interesting info on the actual implementation (just crawling, no searching, no ranking, no classification, just crawling, kiss :).
For the record, I already have the O'Reilly "Spidering hacks" and the No Starch Press "Webbots, spiders, and screen scrapers"...
I need to migrate our website from a proprietary CMS that uses active server pages. Is there a tool or technique that will help download the resources from the existing site? I guess I'm looking for a tool that will crawl and scrape the entire site.
An additional challenge is that the site uses SSL and is protected with forms-based au...
I would like to know what is the best language for a lightweight and fast web crawler.
Is it better to do it in C99, C++ or any other language?
...
Hi all.
I'm trying to implement a limited web crawler in C# (for a few hundred sites only)
using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string.
I'm only downloading pages which are about 5-10K.
It's all very slow! For example, the average GetRespons...
Apologies if this is too ignorant a question or has been asked before. A cursory look did not find anything matching this exactly. The question is: how can I download all Word documents that Google has indexed? It would be a daunting task indeed to do it by hand... Thanks for all pointers.
...