The crawler needs to have an extendable architecture to allow changing the internal process, like implementing new steps (pre-parser, parser, etc...)
I found the Heritrix Project (http://crawler.archive.org/).
But there are other nice projects like that?