I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only.
Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in.
I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Heritrix crashes about every day, and no attemps with JVM parameters to limit memory usage were successful).
From your experiences in the field, which crawler would you use for extracting and parsing content from a thousand of sources?