views:

10908

answers:

4

I want to know how to use the HTML Agility Pack as I am totally new to it. My XHTML document is not completely valid. Thats why i wanted to use it. Can any one tell me how to use it in my project? My project is in C#.

+2  A: 

if you're just trying to parse your document into something easy to process this is a handy extension method that uses HTML Agility pakc under the hood:

http://vijay.screamingpens.com/archive/2008/05/26/linq-amp-lambda-part-3-html-agility-pack-to-linq.aspx

SeanG
+26  A: 
  1. Download and build the HTMLAgilityPack solution.

  2. In your application, add a reference to HTMLAgilityPack.dll in the HTMLAgilityPack\Debug (or Realease) \bin folder.

Then, as an example:

    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    // There are various options, set as needed
    htmlDoc.OptionFixNestedTags=true;

    // filePath is a path to a file containing the html
    htmlDoc.Load(filePath);

    // Use:  htmlDoc.LoadXML(xmlString);  to load from a string

   // ParseErrors is an ArrayList containing any errors from the Load statement
   if (htmlDoc.ParseErrors!=null && htmlDoc.ParseErrors.Count>0)
   {
       // Handle any parse errors as required

   }
   else
   {

        if (htmlDoc.DocumentNode != null)
        {
            HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

            if (bodyNode != null)
            {
                // Do something with bodyNode
            }
        }
    }

(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)

The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)

HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.

Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.

There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.

Ash
Also note that Load accepts a Stream parameter, which is convenient in many situations. I used it for a HTTP stream (WebResponse.GetResponseStream). Another good method to be aware of is HtmlEntity.DeEntitize (part of HTML Agility Pack). This is needed to process entities manually in some cases.
Matthew Flaschen
note: in the latest beta of Html Agility Pack (1.4.0 Beta 2 released Oct 3 2009) the help file has been moved out into a separate download because of dependencies on Sandcastle, DocProject and the Visual Studio 2008 SDK.
rtpHarry
`SelectSingleNode() ` seems to have been removed a while ago
Chris S
+4  A: 

I dont know if this will be of any help to you but I have written a couple of articles which introduce the basics.

The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.

rtpHarry