ansaurus

Question

How to parse author name and book title from scraped HTML using XPath?

Answer 1

+2 A:

Yes. And easy, too:

//p/text()

Will give you three separate text nodes:

"
  William Faulkner - 'Light In August'
  ",
"
  William Faulkner - 'Sanctuary'
  ",
"
  William Faulkner - 'The Sound and the Fury'
"

Remember that preceding and trailing whitespace (including any line breaks) is always part of the text node. Trim the result.

I take it that you do not need help with splitting the resulting strings into author and title.

Tomalak 2010-10-18 16:02:36

Interesting, can you provide some documentation that shows this is how text() works?

Abe Miessler 2010-10-18 16:04:52

@Abe: Before I start, can you indicate why you think that this is how `text()` does *not* work?

Tomalak 2010-10-18 16:06:05

Relax tough guy... I didn't say I don't think it works, I just didn't know it could work like this and would like to be able to read up on it. Not challenging you, just seeking knowledge...

Abe Miessler 2010-10-18 16:07:44

@Abe: I'm perfectly relaxed. `text()` selects all text node children of a node. And the `<p>` has five children here, three of type text interspersed with two of type element (`<br/>`).

Tomalak 2010-10-18 16:12:55

Awesome thanks!

Abe Miessler 2010-10-18 16:20:08

Answer 2

+1 A:

You can get the $N-th author name with this XPath expression:

substring-before(normalize-space(p/text()[$N]), ' -')

You can get the $N-th title with this XPath expression:

substring-after(normalize-space(p/text()[$N]), ' - ')

You can get first the count of all text nodes with:

count(p/text())

then you can issue the first two XPath expressions, substituting $N with the numbers in the interval

[1,count(p/text())]

Dimitre Novatchev 2010-10-18 16:06:08

@Tomalak: Thanks, I saw this immediately (before your comment) and it was already corrected when I read it. Monday morning, you know... :(

Dimitre Novatchev 2010-10-18 16:11:20

Comment already deleted, right after I saw you fixed it. Seems there is some caching involved with the comments, if I change them you do not seem to see the change right away.

Tomalak 2010-10-18 16:35:12

Answer 3

+2 A:

In XPath 1.0 you can select the text node childs of p:

/p/text()

You can also get the string before (author) and after (title) - character for each text node

substring-before(/p/text()[1],'-')

Result:

  William Faulkner 

substring-after(/p/text()[1],'-')

Result:

 'Light In August'

In XPath 2.0:

/p/text()/substring-before(.,'-')

Result in a sequence of 3 items:

William Faulkner William Faulkner William Faulkner

And

/p/text()/substring-after(.,'-')

Result also in a sequence of 3 items:

'Light In August' 'Sanctuary' 'The Sound and the Fury'

Alejandro 2010-10-18 16:08:43

ansaurus

tags:

views:

answers:

How to parse author name and book title from scraped HTML using XPath?

related questions