views:

701

answers:

1

Hi,

I am using XPath to query HTML sites, which works pretty good so far, but now I hit a (brick)wall and can't find a solution :-)

The html looks like this:

<ul>
<li><a href="">Text1<span>AnotherText1</span></a></li>
<li><a href="">Text2<span>AnotherText2</span></a></li>
<li><a href="">Text3<span>AnotherText3</span></a></li>
</ul>

I want to select the "TextX" part, but NOT the AnotherTextX part in the <span></span> So far I couldn't come up with any (pure) XPath solution to do that (and in my setup I unfortunately need a pure XPath solution.

This selects kind of what I want, but it results in "TextXAnotherTextX" and I only need "TextX".

/ul/li/a

Any hints? :-)

+2  A: 

This gets you the first direct text node child of <a>:

/ul/li/a/text()[1]

and this would get you any direct text node child (separately):

/ul/li/a/text()

Both of the above return "TextX", but if you had:

<li><a href="">Text4<span>AnotherText3</span>TrailingText</a></li>

then the latter would return: ["Text4", "TrailingText"], while the former would return "Text4" only.

Your expression /ul/li/a gets the string value of <a>, which is defined as the concatenation of the string value of all the children of <a>, so you get "TextXAnotherTextX".

Tomalak
Thanks for your very helpful response!I've been searching the web for days! :-)Stack Overflow is really VERY good!