tags:

views:

335

answers:

4

When I parse HTML I wish to obtain only the innermost tags for the entire document. My intention is to semantically parse data from the HTML doc.

So if I have some html like this

<html>
     <table>
           <tr><td>X</td></tr>
           <tr><td>Y</td></tr>
     </table>
</html>

I want <td>X</td> and <td>Y</td> alone. Is this possible using Beautiful Soup or lxml?

+1  A: 

After you made sure your document is well-formed (by parsing it using lxml, for example), you could use XPath to query for all nodes that have no further child elements.

//*[count(*) = 0]
Tomalak
HTML not XHTML - so wont work because of <img> etc not being well formed.
Dead account
I was referring to lxml, where XPath should work.
Tomalak
+3  A: 

In .NET I've used HtmlAgilityPack library to do all html parsings easy. It loads DOM and you can select by nodes, in your case select nodes with no childs. Maybe that helps.

Paul G.
A: 

That's one of the few situations where you could actually use a Regular Expression to parse the HTML string.

\<(\w+)[^>]*>[^\<]*\</\1\s*>
Lucero
A: 

Hi,

If you can use or DOM handling (i.e. in a browser) you can work with the parentNode attribute for all the tags and recursively count the total, and keep the largest one.

In javascript-pseudocode (tested on FireFox):

var allElements = document.getElementsByTagName("*");
var maxElementReference, maxParentNodeCount = 0;
var i;

for (i = 0; i < allElements.length; i++) {

    var count = recursiveCountParentNodeOn(allElements[i]);

    if (maxParentNodeCount < count) {
        maxElementReference = allElements[i];
        maxParentNodeCount = count;
    }
}
ATorras