views:

90

answers:

2

Just looking for a really easy way to clean up some HTML (possibly with embedded JavaScript). Tried two different HtmlTidy .NET ports and both and throwing exceptions...

Sorry, by "clean" I mean "indent". The HTML is not malformed, at all. It's XHTML strict.


Finally got something working with SGML, but this is seriously the most ridiculous chunk of code ever to indent some HTML.

private static string FormatHtml(string input)
{
    var sgml = new SgmlReader {DocType = "HTML", InputStream = new StringReader(input)};
    using (var sw = new StringWriter())
    using (var xw = new XmlTextWriter(sw) { Indentation = 2, Formatting = Formatting.Indented })
    {
        sgml.Read();
        while (!sgml.EOF)
            xw.WriteNode(sgml, true);
    }

    return sw.ToString();
}
+3  A: 

UPDATE:

Check HtmlTextWriter or XhtmlTextWriter, usage: Formatting Html Output with HtmlTextWriter, maybe HTML construction via HtmlTextWriter will be better?

Also check : LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter

http://www.manoli.net/csharpformat/, here source code in case you miss it.


Maybe you want to do it yourself? This project can be helpful: Html Agility Pack

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature

Sample applications:

  • Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.

  • Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.

  • Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.


Also you can try this implementation: A managed wrapper for the HTML Tidy library

Nick Martyshchenko
I've heard of and have used HtmlAgilityPack a lot in the past..but can it tidy up HTML?
Mark
HAP is not a replacement for Tidy rather it can build DOM for you and you can process it accordingly. Also Im not sure is it smart enough to parse malformed HTML (if you have to process something weird). BTW, can you define a bit better what you mean by "clean", which rules have to be applied? Also you can use original HTML Tidy (http://bit.ly/aahXs8) without rely on wrapper if you just need to clean some files not on regular basis.
Nick Martyshchenko
I don't need to to process the DOM, I just want to indent it. I specifically want a C# version because I need to use it in my C# project. I'm generating some HTML as a string, I want to take that string, have it indented, and output another string. No more, no less. Thought it would be easy to find a library to do that.
Mark
That codeproject looks nice, but it doesn't compile either. DLL linker errors.
Mark
Also, what DLL do I need to reference to access HtmlTextWriter? I can't find it anywhere in VS2010. System.Web.UI doesn't exist.
Mark
Probably your app is target to client profile? You have to switch it to full and reference System.Web.dll
Nick Martyshchenko
There are also XhtmlTextWriter Class http://bit.ly/9VlCND, since you have to output XHTML
Nick Martyshchenko
Ahh... good call with the client profile. I'm going to look at the HtmlWriters.
Mark
+1  A: 

I've used SGML Reader to convert HTML to XHTML in the past. Might be worth looking into...

I never had any problems with it when I was using it.

Abe Miessler
I did look into it. I can't figure out how to get a string back...
Mark
Take a look at this link: http://www.eggheadcafe.com/articles/20030317.asp
Abe Miessler
A bit ridiculous to format some HTML, but it does work. Thanks :)
Mark