tags:

views:

28

answers:

1

The divs below appear in that order in the HTML I am parsing.

//div[contains(@class,'top-container')]//font/text()

I'm using the xpath expression above to try to get any data in the first div below in which a hyphen is used to delimit the data:

Wednesday - Chess at Higgins Stadium
Thursday - Cook-off

The problem is I am getting data from the second div below such as:

Monday 10:00 - 11:00
Tuesday 10:00 - 11:00

How do I only retrieve the data from the first div? (I also want to exclude any elements in the first div that do not contain this hyphenated data)?

<div class="top-container"> 
<div dir="ltr"> 
<div dir="ltr"><font face="Arial" color="#000000" size="2">Wednesday - Chess at Higgins Stadium</font></div> 
<div dir="ltr"><font face="Arial" size="2">Thursday - Cook-off</font></div> 
<div dir="ltr"><font face="Arial" size="2"></font>&nbsp;</div> 
<div dir="ltr">&nbsp;</div> 
<div dir="ltr"><font face="Arial" color="#000000" size="2"></font>&nbsp;</div>
</div> 

<div dir="ltr"> 
<div RE><font face="Arial"> 
<div dir="ltr"> 
<div RE><font face="Arial" size="2"><strong>Alex Dawkin </strong></font></div> 
<div RE><font face="Arial" size="2">Monday 10:00 - 11:00 </font></div> 
<div RE><font size="2">Tuesday 10:00 - 11:00 </font></div> 
<div RE> 
<div RE><font face="Arial" size="2"></font></div><font face="Arial" size="2"></font></div> 
<div RE>&nbsp;</div> 
<div RE>&nbsp;</div> 
+2  A: 

Your XPATH was matching on any font element that is a descendant of <div class="top-container">.

div[1] will address the first div child element of the "top-container" element. If you add that to your XPATH, it will return the desired results.

//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()

If you want to ensure that only text() nodes that contain "-" are addressed, then you should also add a predicate filter to the text().

//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()[contains(.,'-')]

Instead of checking only for nodes that contain "-", how would you modify the last expression to just check for non-empty strings?

If you want to return any text() node with a value, then the predicate filter on text() is not necessary. If a text node doesn't have content, then it isn't a text node and won't be selected.

However, if you only want to select text() nodes that contain text other than whitespace, you could use this expression:

//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()[normalize-space()]

normalize-space() removes any leading and trailing whitespace characters. So, if the text() only contained whitespace(including &nbsp;), the result would be nothing and evaluate to false() in the predicate filter, so only text() containing something other than whitespace will be selected.

Mads Hansen
Thanks. This is great. Instead of checking only for nodes that contain "-", how would you modify the last expression to just check for non-empty strings?
August
Great answer. This is awesome! Thanks so much!
August
Boolean value of a string is true only if it's not an empty string. So, `text()[normalize-string()]` is enough for select not white space only text nodes. Also, if `font` elements contains only a text node, then `font[contains(.,'-')]` is enough for select `font` elements having `-` character in their string value. At last if you really want to test a `@class` use `contains(concat(' ',@class,' '),' class-to-test ')`.
Alejandro
Good points @Alejandro. I've updated the answer to use the simplified predicate filter for non-whitespace `text()`, and more safe match for `@class` value
Mads Hansen
Also +1 for a good answer.
Alejandro