tags:

views:

14

answers:

1

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:

 <td><a name="hw">HELLOWORLD</a>Hello world</td>

I need to extract "Hello world" text from the above html script.

I have tried extracting the text in this fashion:

  $hw :=data($item//a[@name='hw']/text())

However what I always get is "HELLOWORLD" instead of "Hello world".

Is there a way to extract "Hello World". Please help.

What if I want to do it this way:

     <td>
       <a name="hw1">HELLOWORLD1</a>Hello world1
       <a name="hw2">HELLOWORLD2</a>Hello world2
       <a name="hw3">HELLOWORLD3</a>Hello world3
     </td>

I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[@name='hw2'] and /a[@name='hw3'].

A: 

First of all, you are looking for the a nodes whose name attributes start with 'hw'. This can be achieved with the following path:

$item//a[starts-with(@name,'hw')]

Once you have found your a nodes you want to retrieve the first text node that follows the a node. This can be done as so:

$item//a[starts-with(@name,'hw')]/following-sibling::text()[1]
Oliver Hallam
Thank you so much problemo solved
Technocrat