views:

665

answers:

3

I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only.

Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in.

I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Heritrix crashes about every day, and no attemps with JVM parameters to limit memory usage were successful).

From your experiences in the field, which crawler would you use for extracting and parsing content from a thousand of sources?

+2  A: 

I would suggest writing your own using Python with the Scrapy and either lxml or BeautifulSoup packages. You should find a few good tutorials in Google for those. I use Scrapy+lxml at work to spider ~600 websites checking for broken links.

digitala
Did you build your own control system on top of Scapy, or does Scapy provide methods for that purpose? (For example I want each crawl to run max 6 hours then restart, and I developed a specific java program which controls the crawl and restarts the crawl when Heritrix is hung)
Enrico Detoma
Not sure what you're asking; Scrapy is a framework for scraping, so you build on-top of it. Seems odd that you'd want to scrape for a "time"; wouldn't it be better to set a maximum "level" then have it simply finish when it's done?
digitala
A: 

Wow. State of the art crawlers like the search engines use crawl and index 1 million URLs On a sinlge box a day. Sure the HTML to XML rendering step takes a bit but I agree with you on the performance. I've only used private crawlers so I can't recommend one you'll be able to use but hope this performance numbers help in your evaluation.

Epsilon Prime
We were able to write a custom crawler that can extract ~ 2mm pages / day. The hardest thing about scaling it was ensuring that Frontier (pages was have already visited) lookup was fast as the number of pages harvested grew.
Joel
+1  A: 

I would not use the 2.x branch (which has been discontinued) or the 3.x (current development) for any 'serious' crawling unless you want to help improve Heritrix or just like being on the bleeding edge.

Heritrix 1.14.3 is the most recent stable release and it really is stable, used by many institutions for both small and large scale crawling. I'm using to run crawls against tens of thousands of domains, collecting tens of millions of URLs in under a week.

The 3.x branch is getting closer to a stable release, but even then I'd wait a bit for general use at The Internet Archive and others to improve its performance and stability.

Kris
Thank you, I suspected about the 2.0 version, but since we did some custom development I was a bit afraid of converting it to 1.14 just to discover that 1.14 doesn't work too. Now I'm more confident about doing the downgrade to 1.14.
Enrico Detoma
H2 is about the worst choice possible. H3 is now in beta and getting better while H1 has been stable for years.
Kris
I detected a pattern here, with Heritrix 1.14 => 2.0 => 3.0.We could name it "The Xp => Vista => 7" pattern from a famous contemporary example :-)
Enrico Detoma
Thank you! You saved my day. Heritrix 1.14.3 works much better than the awful 2.0.
Enrico Detoma