tags:

views:

73

answers:

3

I have this HTML/XML:

\t\t\t\t\t    \r\n\t\t
<a href="/test.aspx">
  <span class=test>
    <b>blabla</b>
  </span>
</a>
<br/>
this is the text I want
<br/>
<span class="test">
  <b>code: 123</b>
</span>
<br/>
<span class="test"></span>
\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t

In C#4 I use the HtmlAgilityPack lib to select the Node with XPath and get the InnerText property. This will get all the text inside the node. How can I get only the text "this is the text I want"?

/text() only returns \t\t\t\t\t \r\n\t\t

+3  A: 
/div/text()

From the example given, this XPath will get you all text nodes underneath the div element, in this case test2.

If you could elaborate more on the question we might better be able to help you. The Div contains 3 children: a span element, a text node and a b element. The span and b each have a text node child. Using XPath you could select elements only (/div/*), text nodes only (/div/text()) or all node types (/div/node()).

EDIT: /text() will only return you root level text nodes. In this case I would expect it to return a node list containing 3 text nodes:

\t\t\t\t\t    \r\n\t\t 
this is the text I want
\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t

Are you perhaps only selecting the first node in the resultant node list? There are a few issues of well-formedness such as your <br> should probably be <br/>.

ChrisCM
Hi, please see my edit. Do you have any idea why it does not return all the text?
peter
Hi, I was using SelectSingleNode, this is why it was returning only /t/t/t/t/t. I should have used SelectNodes... doh. Thanks
peter
No probs, glad you got to the bottom of it :)
ChrisCM
How this answer relates to question?
Alejandro
Oh! Sorry. @peter: don't change the question. Good practice is to ask a new question, otherwise other people will not be beneficiated from the answer.
Alejandro
Incorrect. As mentioned in another comment, the original OP was vague and warranted my original answer. The OP was updated with a more complete fragment. My answer was updated to take this into account and my suggestion (which I have now made bold) that he was not getting the entire nodelist (containing the text he wanted) turned out to be the solution. Thus my answer was accepted.
ChrisCM
@ChrisCM: Sorry, I'll correct the name in my answer :(
Dimitre Novatchev
A: 

How can I get only the text "this is the text I want"?

text()[preceding-sibling::node()[1][self::br]]
      [following-sibling::node()[1][self::br]]

Meaning: the text node between two br elements.

Alejandro
+1  A: 

@peter: You should not edit your question so that people don't see how the accepted answer relates to the question!!!

The answer to your new question:

/br[1]/following-sibling::text()[1]

selects the wanted text node (the quotes are mine):

"   
this is the text I want   
"
Dimitre Novatchev
+1 This is more schema related.
Alejandro
What question? I'm not the OP. I suggested an answered to the original (vague) question. The OP updated his question with a more complete fragment of HTML, I updated my answer (see the EDIT: section) to cover the new example. In the end, it wasn't even the XPath that was incorrect but he was picking a single node (the first from the list) instead of the entire nodelist in C#
ChrisCM