tags:

views:

116

answers:

5

i have taken over a code base and i have to read in these html files that were genreated by microsoft word, i think so it has all kinds of whacky inline formatting.

is there anyway to parse out all of the bad inline formatting and just get the text from this stream. i basically want a purifier programatically so i can then apply some sensible css

A: 

Give the HTML Agility Pack a shot.

Chris Doggett
A: 

You should use HTML Tidy - it's uniquitous when it comes to cleansing HTML. There's an article on DevX that describes how to do it from .NET.

Dan Diplo
A: 

This site seems to do the trick,

http://textism.com/wordcleaner/

Sir David of Lee
A: 

in the end i just wrote a small class that did a bunch of find and replaces. not pretty but it worked.

ooo