views:

98

answers:

3

This might sound a bit complicated, but what I want to do is find all <a>s that contain <img>s such that the images that are in the same node with the greatest number of other images are chosen first.

For example, if my page looks like this:

If the blue squares are <div>s and the pink squares are <img>s then the middle div contains the most images, then those images are chosen first. Since they aren't nested any deeper than that, they are just appear in the order that they are on the page. Next the first div is chosen (contains the 2nd most images), and so forth... does that make sense?

We can think of it sort of recursively. First the body would be chosen since that will always contain the most images, then each of the direct children are examined to see which contains the most image descendants (not necessarily direct), then we go into that node, and repeat...

+1  A: 

You could try looking at the count of images for every node.

    public static XmlNode FindNodeWithMostImages(XmlNodeList

nodes) {

        var greatestImageCount = 0;
        XmlNode nodeWithMostImages = null;

        foreach (XmlNode node in nodes)
        {
            var currentNode = node;
            var currentNodeImageCount = node.SelectNodes("*/child::img").Count;

            if (currentNodeImageCount > greatestImageCount)
            {
                greatestImageCount = currentNodeImageCount;
                nodeWithMostImages = node;
            }
        }

        return nodeWithMostImages;
    }
Jason Rowe
I guess that's the only way, huh? Little more elegant with LINQ I think, but I guess that's on the right track.
Mark
Awhile back I did look into recursive LINQ and found this extension. You might be able to do something like this example: http://codepaste.net/gf3q5a
Jason Rowe
+1  A: 

XPATH 1.0 does not provide the ability to sort a collection. You will need to leverage XPATH with something else.

Here is an example XSLT solution that will find all elements that contain descendant <img> elements, and then sorts them by the count of their descendant <img> elements in descending order.

    <xsl:template match="/">
        <!--if only want <a>, then select //a[descendant::img] -->
        <xsl:for-each select="//*[descendant::img]">
            <xsl:sort select="count(descendant::img)" order="descending" />

                <!--Example output to demonstrate what elements have been selected-->
                <xsl:value-of select="name()"/><xsl:text> has </xsl:text>
                <xsl:value-of select="count(.//img)" />  
                <xsl:text> descendant images                     
                </xsl:text>

        </xsl:for-each>

    </xsl:template>

</xsl:stylesheet>

I wasn't clear from your question and examples whether you want to find any element with descendant <img> or just <a> with descendant <img>.

If you wanted to just find <a> elements with descendant <img> elements, then adjust the XPATH in the for-each to select: //a[descendant::img]

Mads Hansen
Oh, sorry. I made a few changes to the question and it became more and more apparent that xpath wasn't quite sufficient. I was hoping that the tags `c#` and `htmlagilitypack` would that hint that I prefer using those technologies, as that's what the rest of my app is written in. This is kind of neat though ;) Hopefully the comments below the Q clear up your other questions.
Mark
A: 

Current solution:

    private static int Count(HtmlNodeCollection nc) {
        return nc == null ? 0 : nc.Count;
    }

    private static void BuildList(HtmlNode node, ref List<HtmlNode> list) {
        var sortedNodes = from n in node.ChildNodes
                          orderby Count(n.SelectNodes(".//a[@href and img]")) descending
                          select n;
        foreach (var n in sortedNodes) {
            if (n.Name == "a") list.Add(n);
            else if (n.HasChildNodes) BuildList(n, ref list);
        }
    }

Example usage:

    private static void ProcessDocument(HtmlDocument doc, Uri baseUri) {
        var linkNodes = new List<HtmlNode>(100);
        BuildList(doc.DocumentNode, ref linkNodes);
        // ...

It's a bit inefficient though because it does a lot of recounting, but oh well.

Mark