In my current company, we have this decade old...let's call it a "Hello World" application.
While wanting to create a newer version of it, we also want to preserve older entries. These older entries contain hideous Word-generated HTML which was never filtered before.
If and when we move to a newer system, I'd prefer to have that HTML cleaned and filtered in order to have the site comply with HTML standards as much as possible.
However, just cleaning that code like Jeff Atwood described in his blog or in any other way I know of would also ruin the style and formatting.
Now, that just might cause our users to revolt and then all hell will break loose - not a very good idea.
So the question is: Can Word's HTML be cleaned while preserving basic formatting? (e.g: coloring, italicized, bold text and so on)
Preferably using publicly available code or library, such as HTML Tidy, examples in C# would be much appreciated.