Hello,
I am building a small web crawler and I was wondering if anybody had some interesting info on the actual implementation (just crawling, no searching, no ranking, no classification, just crawling, kiss :).
For the record, I already have the O'Reilly "Spidering hacks" and the No Starch Press "Webbots, spiders, and screen scrapers". These books are excellent, but they tend to keep things simple and don't elaborate much on scaling, storing data, parallel stuff and other more advanced topics. Of course, I could review the code of an existing open source crawler, but that would be going on the other edge (C++ crawlers seem complicated...). I am looking for some interesting/aditionnal information.
Any help is welcome, thanks in advance.