views:

108

answers:

3

Is there a .NET utility out there to take an HTML snippet (not a whole document) and output a compliant standard HTML equivalent?

IE, both <b>die Bundesliga Mannschaften</b> and <span style="font-weight:bold">die Bundesliga Mannschaften</span> both resolve to the same thing.

I'm not trying to repair anything, just standardize some well-formed, albeit outdated, description texts so that the final output has a consistaint format.

Thanks

+1  A: 

There are not any HTML normalization tools in .Net that I'm aware of, however, a good place to start is Tidy (or the fork of the original). At that point you stand a chance of being able to interpret your HTML in .Net as a DOM document and then could transform various pieces based on some rules you set forth. If you are given XHTML your job may be a lot easier, requiring just a CSS interpreter to handle style attributes as part of your normalization code.

Alternatively you could work on porting HtmlCleaner from Java to .Net.

sixlettervariables
This is exactly how I envisioned the solution, parse the HTML to a DOM or pseudo-DOM memory structure that held CSS formatting attributes, then output the HTML string...loos like I'll end up writting it myself.
Paul
I would suggest the tidy fork as a starting point though. It does things like merge nested spans/divs, cleans up irrelevant markup, etc. It'll at least get you a clean, reliable HTML to turn into DOM. Next is that CSS parser, then making it all Linq-to-XML...
sixlettervariables
+1  A: 

Note that both the strings you provide are valid, standard compliant HTML. What you probably want to is to transform equivalent presentational markup into a canonical format. I dont know a tool which does this automatically, but you can use XSLT to solve it.

Edit: sixlettervariables points out that you cannot parse CSS in XSLT. So the trick would be to transform <b> into <span style="font-weight:bold"> rather than the other way around :-)

JacquesB
The trick would be having the XSLT handle CSS! Imagine a second CSS statement in the same style attribute. Not a fun problem.
sixlettervariables
A: 

I think I found what I needed in the Microsoft.mshtml namespace.

Paul