i have taken over a code base and i have to read in these html files that were genreated by microsoft word, i think so it has all kinds of whacky inline formatting.
is there anyway to parse out all of the bad inline formatting and just get the text from this stream. i basically want a purifier programatically so i can then apply some sensible css