tags:

views:

303

answers:

2

Hi,

I'm trying to get an XPath expression together that will give me all the descendent elements of a node that match a filter (e.g. [contains(@class,"interesting")] but which don't have a specific ancestor e.g. [contains(@class,"frame")]. Probably best explained by example:

    <div class="frame">
        <p class="interesting">alice</p>
        <p class="interesting">bob</p>
        <p class="interesting">carol>/p>

        <div> 
            <div>
                <h3 class="interesting">david</h3>
            </div>
        </div>

        <div class="frame">
            <p class="interesting">drevil</p>
        </div>
    </div>

So in this example, I want to be able to match all the "interesting" elements, that are descendents of the first div with class="frame". But I don't want the "interesting" elements underneath the nested "frame" div.

Ideally I'd have a single XPath expression that would give me those elements with content alice, bob, carol and david. But not drevil.

It is like the presence of the nested frame occludes that branch of the tree from the search.

Any ideas? All responses much appreciated.


In response to Robert, I have this Python code (though I will utlimately do it browser side):

from lxml import etree

from StringIO import StringIO

testxml = """
<div>
    <div class="frame">
        <p class="interesting">alice</p>
        <p class="interesting">bob</p>
        <p class="interesting">carol</p>

        <div> 
            <div>
                <h3 class="interesting">david</h3>
            </div>
        </div>

        <div class="frame">
            <p class="interesting">drevil</p>
        </div>
    </div>    
</div>
"""

xsl = """
<xsl:stylesheet version="1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;

    <xsl:template match="/">
        <output>
           <xsl:apply-templates select="//div[@class='frame'][1]/*"/>
        </output>
    </xsl:template>

    <xsl:template match="*">
       <xsl:apply-templates select="*"/>
    </xsl:template>

    <xsl:template match="*[@class='frame']"/>

    <xsl:template match="*[@class='interesting']">
       <xsl:copy-of select="."/>
    </xsl:template>

</xsl:stylesheet>
"""


def test_xsl():
    xslt_doc = etree.parse(StringIO(xsl))
    transform = etree.XSLT(xslt_doc)
    doc = etree.parse(StringIO(testxml))
    result = transform(doc)
    print result

if __name__=="__main__":
    test_xsl()

This gives the following result:

<?xml version="1.0"?>
<output>
    <p class="interesting">alice</p>
    <p class="interesting">bob</p>
    <p class="interesting">carol</p>
    <h3 class="interesting">david</h3>
    <p class="interesting">drevil</p>
</output>

As you can see drevil is lurking.

Note, Tomalak is correct in that the 2nd match on * has no effect (other than to remove spaces from the output which is a bit odd!).

It just twigged though that I might not be able to go with the XSLT approach, the whole point of doing an XPath query in the first place was to gain references to nodes within the original HTML document. If I do a transform, the nodes contained in the new result document will be copies and not the original ones I'm looking for and thus no use!

This might be the dumbest question ever, but is there a way to maintain a references from nodes in the transformed document to nodes in the original?

Thanks Tomalak, Robert and mykhal for your help so far. I think I just need to buy a book on XSLT...

+1  A: 

you can use selector limiting ancestor div[@class="frame"] elements to 1

//div[@class="frame"][1]//*[@class="interesting" and count(ancestor::div[@class="frame"])=1]

it worked:

>>> import lxml.html
>>> data = """
        <div class="frame">
            <p class="interesting">alice</p>
            <p class="interesting">bob</p>
            <p class="interesting">carol</p>

            <div> 
                <div>
                    <h3 class="interesting">david</h3>
                </div>
            </div>

            <div class="frame">
                <p class="interesting">drevil</p>
            </div>
        </div>
    """
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath('//div[@class="frame"][1]//*[@class="interesting" and count(ancestor::div[@class="frame"])=1]/text()')
['alice', 'bob', 'carol', 'david']
mykhal
.. in human language: for the first frame div, select all its descendants with interesting class, but only those having exactly one frame div ancestor
mykhal
by the way, please notice that your HTML code example is invalid, you should flip the angle bracket after carol :)
mykhal
Got this to work, I knew I wasn't being creative enough in my use of filters somehow but couldn't find any significantly complicated examples on the web.
andre_b
Thanks for your help BTW!
andre_b
this was really not a simple one :)
mykhal
A: 

mykhal's answer is probably the best you can do in XPath, at least as you've defined the problem.

The trouble with it is that it could be punishingly inefficient when used on large documents with many potentially interesting elements. For every potentially interesting element it finds, it has to examine every node in its ancestor axis.

In XSLT, you can implement a series of templates that find only the elements you're looking for, and that not only visit each element only once, also don't visit any elements that they don't have to:

<xsl:template match="/">
    <output>
       <xsl:apply-templates select="/descendant::*[@class='frame'][1]/*"/>
    </output>
</xsl:template>

<xsl:template match="*[@class='frame']"/>

<xsl:template match="*[@class='interesting']">
   <xsl:copy-of select="."/>
</xsl:template>

The built-in template behavior for elements, which is used whenever templates are applied to an element and no higher-ranking template is found, is to apply templates to its children.

The first template finds the ancestor element you're interested in, and applies templates to its child elements.

The second template says, basically, "If you're recursing down the elements and hit an element with a class attribute of 'frame', don't examine its descendants." This keeps the transform from ever even examining an uninteresting element.

And finally, the last template defines what to do when you hit an interesting element - in this case, it copies it to the output in its entirety.

Robert Rossney
Your `<xsl:template match="*">` effectively is the built-in default template for elements. For all I know, you could remove it entirely without changing the output of your XSLT.
Tomalak
I can't actually get this to work, that pesky drevil keeps spoiling the party! But I _do_ get the principal - I was aware of the performance problems with finding the ancestory, indeed the reason I wanted an XPath expression was so that I could avoid doing it in JavaScript which would have been even slower. I will need to brush up on my XSLT, but will post my code at the top of the page anyway.
andre_b
Fixed both issues alluded to in the comments. `//div[@class='frame'][1]` really means `/descendant-or-self::node()/child::div[@class='frame'][1]` (see the note at the end of section 2.5 of the XPath recommendation). So it was selecting all `div` elements with a `class` attribute of "frame" that were the first child of their parent node, i.e. all of them.
Robert Rossney