views:

91

answers:

1

Does anybody know of an html cleaner for .NET that can parse html and (for instance) convert it to a more machine friendly format such as xhtml?

I've tried the HTML Agility Pack, but that fails to correctly parse even fairly simple examples.

To give an example of html that should be parsed correctly:

<html><body>
    <ul><li>TestElem1
        <li>TestElem2
        <li>TestElem3 List:
            <ul><li>Nested1
                <li>Nested2</li>
                <li>Nested3
            </ul>
        <li>TestElem4
    </ul>
    <p>paragraph 1
    <p>paragraph 2
    <p>paragraph 3
</body></html>

li tags don't need to be closed (see spec), and neither do P tags. In other words, the above sample should be parsed as:

<html><body>
    <ul><li>TestElem1</li>
        <li>TestElem2</li>
        <li>TestElem3 List:
            <ul><li>Nested1</li>
                <li>Nested2</li>
                <li>Nested3</li>
            </ul></li>
        <li>TestElem4</li>
    </ul>
    <p>paragraph 1</p>
    <p>paragraph 2</p>
    <p>paragraph 3</p>
</body></html>

Since the aim is to use the library on various machines, it's a big disadvantage to need to fall back to native code (such as a wrapper around html tidy) which would require extra deployment hassle and sacrifice platform independance, not to mention being impossible in sandboxed scenarios.

Any suggestions? To recap, I'm looking for:

  • An html cleaner ala HTML tidy
  • Must be able to deal with real world html, not just xhtml, at the very least correctly reading valid html 4
  • Must be able to convert to a more easily processable xml format
  • Should be a purely managed app.
+1  A: 

Saw your post when starting to post the same question...

Have you found anything?

I noticed that there's a version for JSP (http://jtidy.sourceforge.net/).

And I'm sure you've seen the one that calls out to HtmlTidy (http://schneegans.de/asp.net/tidy/) but none that are managed/"modern".

Rob
I haven't found anything. What I'm doing in the meantime is having a few regex based hacks to "hopefully" make the html valid xhtml - nothing fancy enough to parse the above example of mine. If it can be parsed as xhtml (which is usually the case, since most html is actually syntactically pretty clean), I use linq to xml to extract those elements+attributes that are in a whitelisted known safe set and trash the rest.That works good enough for now, in particular since browsers generate pretty parseable stuff so tinymce and ckeditor end up sending fairly clean things over the wire.
Eamon Nerbonne
@Eamon: thanks for the info!
Rob