tags:

views:

41

answers:

3

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:

              <td><a name="hw">HELLOWORLD</a>Hello world</td>

I need to extract "Hello world" text from the above html script.

I have tried extracting the text in this fashion:

     $hw :=data($item//a[@name='hw']/text())

However what I always get is "HELLOWORLD" instead of "Hello world".

Is there a way to extract "Hello World". Please help.

What if I want to do it this way:

<td>
 <a name="hw1">HELLOWORLD1</a>Hello world1
 <a name="hw2">HELLOWORLD2</a>Hello world2
 <a name="hw3">HELLOWORLD3</a>Hello world3
</td>

I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].

+3  A: 

Your xpath is selecting the text of the a nodes, not the text of the td nodes:

$item//a[@name='hw']/text()

Change it to this:

$item[a/@name='hw']/text()

Update (following comments and update to question):

This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:

$item[a/@name='hw']//text()[2]
Oded
This isn't working for me. FYI $item contains the entire html page as xml
Technocrat
What if the html contains :<td> <a name="hw1">HELLOWORLD1</a>Hello world1 <a name="hw2">HELLOWORLD2</a>Hello world2 <a name="hw3">HELLOWORLD3</a>Hello world3</td>i want to extract only Hello world2 without using text(2)[2]. Is there a way to specify "extract the text after /a[@name="hw1" and before /a[@name="hw2"
Technocrat
@Technocrat - answer updated, following your expanded explanation.
Oded
You seem not to look at your answers!
Dimitre Novatchev
A: 

This handles your expanded case, while letting you select by attribute value rather than position:

let $item := 
  <td>
    <a name="hw1">HELLOWORLD1</a>Hello world1
    <a name="hw2">HELLOWORLD2</a>Hello world2
    <a name="hw3">HELLOWORLD3</a>Hello world3
  </td>

return $item//node()[./preceding-sibling::a/@name = "hw2"][1]

This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Dave Cassel
A: 

I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].

If there is just one text node between the two <a> elements, then the following would be quite simple:

/a[@name='hw3']/preceding::text()[1]

If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:

$ns1[count(.|$ns2) = count($ns2)]

So, just replace in the above expression $ns1 with:

/a[@name='hw2']/following-sibling::text()

and $ns2 with:

/a[@name='hw3']/preceding-sibling::text()

Lastly, if you really have XQuery (or XPath 2), then this is simply:

   /a[@name='hw2']/following-sibling::text() 

intersect

   /a[@name='hw3']/preceding-sibling::text()
Dimitre Novatchev