tags:

views:

65

answers:

1

I have a small problem with Xpath contains with dom4j ...

Lets say my XML is

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
    </Addr>
</Home>

Lets say I want to find all the nodes that have ABC in the text given the root Element...

So the xpath that i would needed to write would be

//*[contains(text(),'ABC')]

However this is not what Dom4j returns .... is this a dom4j problem or my understanding how xpath works. since that query returns only the Street Element and not the Comment element.

The DOM makes the Comment element a composite element with four tags two

[Text = 'XYZ'][BR][BR][Text = 'ABC'] 

I would assume that the query should still return the element since it should find the element and run contains on it but it doesn't ... ...

the following query returns the element but it returns far more then just the element, it returns the parent elements as well ... which is undesirable to the problem ...

//*[contains(text(),'ABC')]

Does any one know the xpath query that would return just the Elements <Street/> and <Comment/> ?

+2  A: 

The <Comment> tag contains two text nodes and two <br> nodes as children.

Your xpath expression was

//*[contains(text(),'ABC')]

To break this down,

  1. * is a selector that matches any element (i.e. tag) -- it returns a node-set.
  2. The [] are a conditional that operates on each individual node in that node set. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
  3. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
  4. contains is a function that operates on a string. If it is passed a node set, the node set is converted into a string by returning the string-value of the node in the node-set that is first in document order. Hence, it can match only the first text node in your <Comment> element -- namely BLAH BLAH BLAH. Since that doesn't match, you don't get a <Comment> in your results.

You need to change this to

//*[text()[contains(.,'ABC')]]
  1. * is a selector that matches any element (i.e. tag) -- it returns a node-set.
  2. The outer [] are a conditional that operates on each individual node in that node set -- here it operates on each element in the document.
  3. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
  4. The inner [] are a conditional that operates on each node in that node set -- here each individual text node. Each individual text node is the starting point for any path in the brackets, and can also be referred to explicitly as . within the brackets. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
  5. contains is a function that operates on a string. Here it is passed an individual text node (.). Since it is passed the second text node in the <Comment> tag individually, it will see the 'ABC' string and be able to match it.
Ken Bloom
Awesome im a little bit of an xpath noob, so let me get this, text() is a function that takes the expression contains(.,'ABC'), Is there a chance you can explain so i don't do this kinda stupid stuff again ;)
I've edited my answer to provide a long explanation. I don't really know that much about XPath myself -- I just experimented a bit until I stumbled on that combination. Once I had a working combination, I made a guess what was going on and looked in the [XPath standard](http://www.w3.org/TR/xpath/) to confirm what I thought was going on and write the explanation.
Ken Bloom
@Ken Bloom: +1 Good answer and explanation.
Alejandro