How can I index HTML documents? | ansaurus

tags:

views:

95

answers:

2

+1 Q:

How can I index HTML documents?

I am using Lucene .NEt to do full-text searching. Till now I have been indexing PDF docs, but now I have a few webpages that I need to index. What's the best/easiest way to index HTML documents to add to my Lucene index? I am using .NET/C#

A:

Google can index your content for you.

Pierreten 2009-12-17 02:01:33

A:

I am currently working on this problem, the best answer I have found to date is using the HTML Agility Pack to get the plain text content out of the HTML.

Adam Pope 2010-03-23 09:57:31

related questions

Subsonic Vs NHibernate

How to check For File Lock in C# ?

Is nAnt still supported and suitable for .net 3.5/VS2008?

Limit size of Queue<T> in .NET?

Viewstate invalid when using Safari

Free OCR library

Unhandled Exception Handler in .NET 1.1

Localising date format descriptors

Get a new object instance from a Type in C#

VFP .NET OLEdb provider does not work in Win 64-Bits. Help

.NET Testing Framework Advice

Embedded Database for .net that can run off a network

Automatically update version number

Homegrown consumption of web services

How do you migrate a large app from VB6 to VB .net ?

.NET Migrations Engine

Adding Scripting functionality to .NET applications

SQLite and XSD

Floating Point Number parsing: Is there a Catch All algorithm?

How do I programmatically create a PDF in my .NET application?

How do I sync the SVN revision number with my ASP.NET web site?

XSD DataSets and ignoring foreign keys

Anatomy of a "Memory Leak"

Reliable Timer in a Console Application

How do I calculate relative time?