I'm looking to use Heritrix to crawl web-sites. I'm wondering what tools Heritrix users are using to extract text from crawled files prior to indexing them with Lucene.
I'm looking to use Heritrix to crawl web-sites. I'm wondering what tools Heritrix users are using to extract text from crawled files prior to indexing them with Lucene.