Information on web crawling techniques

views:

459

answers:

+1 Q:

Information on web crawling techniques

Hello,

I am building a small web crawler and I was wondering if anybody had some interesting info on the actual implementation (just crawling, no searching, no ranking, no classification, just crawling, kiss :).

For the record, I already have the O'Reilly "Spidering hacks" and the No Starch Press "Webbots, spiders, and screen scrapers". These books are excellent, but they tend to keep things simple and don't elaborate much on scaling, storing data, parallel stuff and other more advanced topics. Of course, I could review the code of an existing open source crawler, but that would be going on the other edge (C++ crawlers seem complicated...). I am looking for some interesting/aditionnal information.

Any help is welcome, thanks in advance.

+1 A:

If you are interested in implementation details of a web crawler, you may study existing open source implementations. Here is a list of Open Source Crawlers in Java. Most of those project are inactive. But the Internet Archive's crawler Heritix and Apache Nutch are mature active projects with lots to learn from.

Palimondo 2009-06-03 11:54:56

+1 A:

http://arachnode.net is active. Ask whatever questions you'd like!

arachnode dot net 2009-08-29 09:43:57

ansaurus

tags:

views:

answers:

Information on web crawling techniques

related questions