views:

459

answers:

2

Hello,

I am building a small web crawler and I was wondering if anybody had some interesting info on the actual implementation (just crawling, no searching, no ranking, no classification, just crawling, kiss :).

For the record, I already have the O'Reilly "Spidering hacks" and the No Starch Press "Webbots, spiders, and screen scrapers". These books are excellent, but they tend to keep things simple and don't elaborate much on scaling, storing data, parallel stuff and other more advanced topics. Of course, I could review the code of an existing open source crawler, but that would be going on the other edge (C++ crawlers seem complicated...). I am looking for some interesting/aditionnal information.

Any help is welcome, thanks in advance.

+1  A: 

If you are interested in implementation details of a web crawler, you may study existing open source implementations. Here is a list of Open Source Crawlers in Java. Most of those project are inactive. But the Internet Archive's crawler Heritix and Apache Nutch are mature active projects with lots to learn from.

Palimondo
+1  A: 

http://arachnode.net is active. Ask whatever questions you'd like!

arachnode dot net