I am using Lucene .NEt to do full-text searching. Till now I have been indexing PDF docs, but now I have a few webpages that I need to index. What's the best/easiest way to index HTML documents to add to my Lucene index? I am using .NET/C#
A:
I am currently working on this problem, the best answer I have found to date is using the HTML Agility Pack to get the plain text content out of the HTML.
Adam Pope
2010-03-23 09:57:31