tags:

views:

62

answers:

2

I'm trying to retrieve a select amount of elements that doesn't contain the value   (a space) using the HtmlAgilityPack in C#. Here's my XPath expression:

"(td)[(position() >= 10 and position() <= last()) and not(.='&nbsp;')]"

but it is still giving me these nodes, I've tried using a literal space, &#160; ALT + 1060 - nothing seems to work. Here is what I'm parsing:

 <tr height=20 style='mso-height-source:userset;height:15.0pt'>
  <td height=20 class=xl96 style='height:15.0pt'>&nbsp;</td>
  <td class=xl97>&nbsp;</td>
  <td class=xl106 style='border-top:none'>JIM COCKS</td>
  <td class=xl107 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl107 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl107 style='border-top:none;border-left:none'>HOL</td>
  <td class=xl76>&nbsp;</td>
  <td class=xl103 style='border-left:none'>&nbsp;</td>
  <td class=xl97>&nbsp;</td>
  <td class=xl104 style='border-top:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>09:30</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td> 
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>17:00</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl104 style='border-top:none;border-left:none'>&nbsp;</td>
  <td class=xl76>&nbsp;</td>
 </tr>

The items with the class 'xl104' is what I want to grab (I've done this with position statements as their classes change) but I only want nodes that contain something other than &nbsp;, e.g. the 09:30 AND 17:00 you see above.

+1  A: 
"(td)[(position() >= 10 and position() <= last()) and not(.='&nbsp;')]" 

not(.='&nbsp;')

tests that the whole text() node is not the string '&nbsp;'.

You want to use the XPath contains() function:

not(contains(., '&#xA0;'))
Dimitre Novatchev
Yeah, I was tempted to say that too. However his `td` elements appear to have   as their whole text node value... nothing else. So it would be puzzling if this were indeed the problem.
LarsH
That did the trick! I knew there was a contains() fn but it never crossed my mind to use it because, as LarsH said, all of the td elements _just_ have   in them. Thanks anyway! :-)
eth0
@eth0, in light of this, I suspect your input XML is not what you think it is (i.e. not what you showed above). Extra whitespace may have crept in. What happens when you select `string-length(td[10])`?
LarsH
A: 

I'm trying to retrieve a select amount of elements that doesn't contain the value &nbsp;

I believe @Dimitre has answered for that specification of the task.

I only want nodes that contain something other than &nbsp;

A slightly different specification. Does this work? (Edited; thanks to Alejandro.)

"td[position() >= 10 and translate(., '&#xA0;', '') != '']" 

This is equivalent and shorter, but less readable:

"td[position() >= 10 and translate(., '&#xA0;', '')]" 

Anyway, you found the problem so we won't go farther with this.

Do note, though, that using &nbsp; literally in XPath won't normally work unless you define it. This character entity is predefined in HTML but not in XML. That's why &#160; or &#xA0; is more reliable. However, it's possible that the HtmlAgilityPack defines   for you.

LarsH
@LarsH: `fn:position()` result should always be less than or equal to `fn:last()`. Also, the boolean value of a string should be false if it's empty, and true otherwise. So, `td[position() >= 10 and translate(.,'A0;','')]`
Alejandro
@Alej: Thanks... to be honest I only looked at the part of the predicate related to nbsp. I'll edit my answer.
LarsH