views:

203

answers:

2

I have some HTML, eg:

<%@ Page Title="About Us" Language="C#" MasterPageFile="~/Site.master" AutoEventWireup="true"
    CodeBehind="ContentManagedTargetPage.aspx.cs" Inherits="xxx.ContentManagedTargetPage" %>
<%@ Register TagPrefix="CxCMS" Namespace="xxx.ContentManagement.ASPNET.UI" Assembly="xxx.ContentManagement.ASPNET" %>
<asp:Content ID="HeaderContent" runat="server" ContentPlaceHolderID="HeadContent">
</asp:Content>
<asp:Content ID="BodyContent" runat="server" ContentPlaceHolderID="MainContent">
    <h2>
        Content Managed
    </h2>
    <p>
        Put content here.
        [<CxCMS:ContentManagedPlaceHolder Key="keyThingy" runat="server" />]
    </p>
</asp:Content>

And I want to find all the instances of the CxCMS:ContentManagedPlaceHolder element.

I'm using HTML Agility Pack, which seems the best fit.

However, despite looking at the [meagre] documentation, I can't get my code to work.

I would expect the following to work:

string searchForElement = "CxCMS:ContentManagedPlaceHolder";
IEnumerable<HtmlNode> contentPlaceHolderHtmlNodes = HtmlDocument.DocumentNode.Descendants(searchForElement);
int count = contentPlaceHolderHtmlNodes.Count();                

But I get nothing back.

If I change to DescendantsOrSelf, I get the document node back, "#document" - which is incorrect:

string searchForElement = "CxCMS:ContentManagedPlaceHolder";
IEnumerable<HtmlNode> contentPlaceHolderHtmlNodes = HtmlDocument.DocumentNode.DescendantsOrSelf(searchForElement);
int count = contentPlaceHolderHtmlNodes.Count();                

I also tried using LINQ:

string searchForElement = "CxCMS:ContentManagedPlaceHolder";
IEnumerable<HtmlNode> contentPlaceHolderHtmlNodes = HtmlDocument.DocumentNode.DescendantsOrSelf().Where(q=>q.Name==searchForElement);
int count = contentPlaceHolderHtmlNodes.Count();                

As neither of these methods work, I moved onto using SelectNodes, instead:

string searchForElement = "CxCMS:ContentManagedPlaceHolder";
string xPath="//"+searchForElement // "//CxCMS:ContentManagedPlaceHolder"
var nodes= HtmlDocument.DocumentNode.SelectNodes(xPath);

This just throws the exception: "Namespace Manager or XsltContext needed. This query has a prefix, variable, or user-defined function.". I can't find any way of adding namespace management to the HtmlDocument object.

What am I missing, here? The DescendantsOrSelf() method works if using a "standard" HTML tag, such as "p", but not the one I have. Surely it should work? (It needs to!)

A: 

As usual I spend an hour or so playing, I ask the question, and I figure it out seconds after.

When searching using DescendantsOrSelf(), the node name must be in lower case.

Program.X
A: 

Your example is actually ASPX. If you're parsing the ouput of that page, it's doubtful that <CxCMS:ContentManagedPlaceHolder Key="keyThingy" runat="server" /> actually renders as that on the client side. Look at the html source on the client, find the output tags that correspond to the <CxCMS:ContentManagedPlaceHolder Key="keyThingy" runat="server" />, and then use those in HtmlDocument.DocumentNode.Descendants.

On the other hand, if you're parsing the ASPX source, you may need to tweak your input to HtmlDocument.DocumentNode.Descendants so that the HtmlAgilityPack recognizes it, however keep in mind that ASPX != html and I don't think the HtmlAgilityPack is built to parse it.

Edit: Looking through HtmlNode.cs in the HtmlAgilityPack source code, it looks like you're right about it needing to be lowercase due to the following two sections:

    /// <summary>
    /// Gets or sets this node's name.
    /// </summary>
    public string Name
    {
        get
        {
            if (_name == null)
            {
                Name = _ownerdocument._text
                                     .Substring(_namestartindex, _namelength);
            }
            return _name != null ? _name.ToLower() : string.Empty;
        }
        set { _name = value; }
    }

and

    /// <summary>
    /// Get all descendant nodes with matching name
    /// </summary>
    /// <param name="name"></param>
    /// <returns></returns>
    public IEnumerable<HtmlNode> Descendants(string name)
    {
        foreach (HtmlNode node in Descendants())
            if (node.Name == name)
                yield return node;
    }

Note the _name.ToLower() in the getter for Name, and the case-sensitive if (node.Name == name) in the Decendants method. This is the same check used the the DescendantsAndSelf, Element and Elements methods.

jball
Yes, I am working with ASPX source. It seems to work in the tests I have done so far, after figuring the lower-case thing out! Thanks.
Program.X