ansaurus

Question

C# Is there a LINQ to HTML, or some other good .Net HTML manipulation API?

Answer 1

A:

HTML is rarely well-formed enough that you could reliably use LINQ to XML. It's conceivable that you might find an HTML "cleaner" that could fix the formatting well enough to be read, but there's not telling how robust it would be.

I assume this is a "screenscraper" that reads from an HTML table over which you have no control. Don't stress over robustness in this case, screen-scraping is inherently brittle. If your requirements are set in stone, design the scraper to be easily updatable if/when the HTML you are scraping changes.

Dave Swersky 2009-02-12 16:51:31

Answer 2

+1 A:

I had to do this in a recent project and I used LINQ to XML. If you know it's always going to be clean XHTML then you can probably recursively copy the DOM pretty easily, but I used the DevComponents HTMLDocument class library (http://www.devcomponents.com/htmldoc/) to convert HTML to XML then pulled that into an XElement. This reduces the challenge to getting your HTML into an XElement hierarchy. The one caveat is it chokes on script elements, so I deleted those by brute force.

    /// <summary>
    /// Extracts an HtmlDocument DOM to an XElement DOM that can be queried using LINQ to XML.
    /// </summary>
    /// <param name="htmlDocument">HtmlDocument containing DOM of page to extract.</param>
    /// <returns>HTML content as <see cref="XElement" /> for consumption by LINQ to XML.</returns>
    public XElement ExtractXml(HtmlDocument htmlDocument) {
        XmlDocument xmlDoc = htmlDocument.ToXMLDocument();

        // Find and remove all script tags from XML DOM or LINQ to XML will choke on XElement.Parse(XmlDocument).
        IList<XmlNode> nodes = new List<XmlNode>();
        foreach (XmlNode node in xmlDoc.GetElementsByTagName("script"))
            nodes.Add(node);
        foreach (XmlNode node in nodes)
            node.ParentNode.RemoveChild(node);

        return XElement.Parse(xmlDoc.OuterXml);
    }

AndyM 2009-02-12 16:55:28

Answer 3

+3 A:

~~Even though it's not LINQ based,~~ I suggest researching the HTML Agility Pack from CodePlex.

Note: Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface)

From the HTML Agility Pack page:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

LaptopHeaven 2009-02-12 16:57:21

Have you used this product with success?

Peter J 2009-02-12 17:24:27

What does it's complex license mean?

Ian Ringrose 2009-03-09 16:37:18

Answer 4

A:

I've posted some code providing "LINQ to HTML" functionality here:

http://stackoverflow.com/questions/100358/looking-for-c-html-parser/624410#624410

Frank Schwieterman 2009-03-08 22:17:27

Answer 5

+2 A:

There's a LINQ to HTML library here:

http://www.justagile.com/linq-to-html.aspx

keith 2009-12-03 07:52:31

ansaurus

tags:

views:

answers:

C# Is there a LINQ to HTML, or some other good .Net HTML manipulation API?

related questions