views:

109

answers:

1

I'm trying to scrap customer reviews from a site and ran into an interesting set-up.

<div class="Review">
  <img class="stars" etc>
  <b>ReviewerName</b>
  - yyyy-mm-dd
  <br/>
  <p>Review</p>
  <a>was this helpful links</a>
  <hr/>
  <br/>
  <!-- Repeat above for additional reviews. -->
</div>

For the life of me I can't come up with an XPath that will capture the date (- yyyy-mm-dd), as there is no HTML formatting around it. Anyone have a solution?

Jon

+2  A: 

Assuming a structure like this:

<div class="Review">
    <img class="stars"/><b>ReviewerName</b> - yyyy-mm-dd<br/>
</div>

The following xpath selects the date yyyy-mm-dd

substring-after(/div/b/following-sibling::text()[1],' - ')
Jörn Horstmann
The `substring-after()` is a bit assumptive about white-space that may or may not be there in the expected form. I would probably remove the dash in the "outside" programming language. Especially since most programming languages do not let you access XPath function results, only node-sets.
Tomalak