views:

52

answers:

2

I'm deveoping web scraping scoftware that relies on XPath to extract information from web pages.

One application of the software is to scrape reviews of shows from websites. One page I'm trying to scrape is the Guardian's latest Edinburgh festival reviews: http://www.guardian.co.uk/culture/edinburghfestival+tone/reviews

The section I want is at the bottom, titled "Most recent". The XPath expression for the list of review items (that is the pic, the stars, the date, the blurb, etc) is

//ul[@id='auto-trail-block']

which returns a list of li elements, each corresponding to one review item.

If I want to refer to only the blurb, the closest I can get is to say

//ul[@id='auto-trail-block']/div[@class='trailtext']

but when I collect the text content from each item of the list, it includes lots of Javascript and nasty stuff I don't need. I can't refer to the blurb itself because it is not inside a p element, but within a div element that contains script elements and strong elements that contain javascript and unrelated text respectively.

In the debugger it the DOM looks like this:

<ul id="auto-trail-block" ...>
  <li ...>
    <div ...>
    <div ...>
      <div ...>
      <div class="trailtext">
        <script ...>
        <div ...>
        <span ...>
        <strong .../>
        <br/>
        The Text I want to copy!
        <strong .../>
        <a .../>
        <div .../>
      </div>
    </div>
  </li>
  <li ...>
    ...
  </li>
  ...
</ul>

Is there any way to refer to the text content contained in just the div and not any of its subelements?

+1  A: 

My approach would be to select the trailtext div, remove the script tags with their content and all HTML tags. What's left would be the content you want.

Just wondering - what does the inner text node of //ul[@id='auto-trail-block']/div[@class='trailtext'] return? I would guess mostly the blurb, so clearing out the script tags should almost get you there.

Oded
+1  A: 

If you only want the text node children of div[@class='trailtext'], then use text()

//ul[@id='auto-trail-block']//div[@class='trailtext']/text()
Mads Hansen