ansaurus

Question

Answer 1

+1 A:

After you made sure your document is well-formed (by parsing it using lxml, for example), you could use XPath to query for all nodes that have no further child elements.

//*[count(*) = 0]

Tomalak 2009-05-05 14:17:09

HTML not XHTML - so wont work because of <img> etc not being well formed.

Dead account 2009-05-05 14:20:36

I was referring to lxml, where XPath should work.

Tomalak 2009-05-05 14:25:13

Answer 2

+3 A:

In .NET I've used HtmlAgilityPack library to do all html parsings easy. It loads DOM and you can select by nodes, in your case select nodes with no childs. Maybe that helps.

Paul G. 2009-05-05 14:18:37

Answer 3

A:

That's one of the few situations where you could actually use a Regular Expression to parse the HTML string.

\<(\w+)[^>]*>[^\<]*\</\1\s*>

Lucero 2009-05-05 14:19:42

Answer 4

A:

Hi,

If you can use or DOM handling (i.e. in a browser) you can work with the parentNode attribute for all the tags and recursively count the total, and keep the largest one.

In javascript-pseudocode (tested on FireFox):

var allElements = document.getElementsByTagName("*");
var maxElementReference, maxParentNodeCount = 0;
var i;

for (i = 0; i < allElements.length; i++) {

    var count = recursiveCountParentNodeOn(allElements[i]);

    if (maxParentNodeCount < count) {
        maxElementReference = allElements[i];
        maxParentNodeCount = count;
    }
}

ATorras 2009-05-05 14:45:43

ansaurus

tags:

views:

answers:

HTML Parsing - Get Innermost HTML Tags

related questions