views:

193

answers:

2

Hello All,

I'm creating a little algo to fetch text from web sites.. then find answers (will post the script once completed).

To do that, I need to convert all HTML code within and into plain readable english text.

I've manually removed all html tags, but some css entries are hard to get rid of. Any simple ideas on how to convert html to plain english text?

Thanks.

+2  A: 

some one already made all the work for you.

balexandre
Isn't parsing to DOM and using InnerText would be better?
Dmitriy Matveev
A: 

I developed something similar avoiding Regex's performance penalty : strip_tags equivalent for ASP.NET (can be run on desktop .NET assemblies too)

Andrei Rinea
+1 for a good subject :)
Andrei Rinea