tags:

views:

26

answers:

3

The HTML you see below is text I have scraped from a remote site, as-is, into a local variable.

Now I need to parse the authorName and bookTitle from the HTML tags into their own variables, given the following consistent format of the scraped text:

<p>
  William Faulkner - 'Light In August'
  <br/>
  William Faulkner - 'Sanctuary'
  <br/>
  William Faulkner - 'The Sound and the Fury'
</p>

Is it possible to do this in XPath?

+2  A: 

Yes. And easy, too:

//p/text()

Will give you three separate text nodes:

"
  William Faulkner - 'Light In August'
  ",
"
  William Faulkner - 'Sanctuary'
  ",
"
  William Faulkner - 'The Sound and the Fury'
"

Remember that preceding and trailing whitespace (including any line breaks) is always part of the text node. Trim the result.

I take it that you do not need help with splitting the resulting strings into author and title.

Tomalak
Interesting, can you provide some documentation that shows this is how text() works?
Abe Miessler
@Abe: Before I start, can you indicate why you think that this is how `text()` does *not* work?
Tomalak
Relax tough guy... I didn't say I don't think it works, I just didn't know it could work like this and would like to be able to read up on it. Not challenging you, just seeking knowledge...
Abe Miessler
@Abe: I'm perfectly relaxed. `text()` selects all text node children of a node. And the `<p>` has five children here, three of type text interspersed with two of type element (`<br/>`).
Tomalak
Awesome thanks!
Abe Miessler
+1  A: 

You can get the $N-th author name with this XPath expression:

substring-before(normalize-space(p/text()[$N]), ' -')

You can get the $N-th title with this XPath expression:

substring-after(normalize-space(p/text()[$N]), ' - ')

You can get first the count of all text nodes with:

count(p/text())

then you can issue the first two XPath expressions, substituting $N with the numbers in the interval

[1,count(p/text())]
Dimitre Novatchev
@Tomalak: Thanks, I saw this immediately (before your comment) and it was already corrected when I read it. Monday morning, you know... :(
Dimitre Novatchev
Comment already deleted, right after I saw you fixed it. Seems there is some caching involved with the comments, if I change them you do not seem to see the change right away.
Tomalak
+2  A: 

In XPath 1.0 you can select the text node childs of p:

/p/text()

You can also get the string before (author) and after (title) - character for each text node

substring-before(/p/text()[1],'-')

Result:

  William Faulkner 

substring-after(/p/text()[1],'-')

Result:

 'Light In August'       

In XPath 2.0:

/p/text()/substring-before(.,'-')

Result in a sequence of 3 items:

William Faulkner William Faulkner William Faulkner 

And

/p/text()/substring-after(.,'-')

Result also in a sequence of 3 items:

'Light In August' 'Sanctuary' 'The Sound and the Fury'
Alejandro