I'm deveoping web scraping scoftware that relies on XPath to extract information from web pages.
One application of the software is to scrape reviews of shows from websites. One page I'm trying to scrape is the Guardian's latest Edinburgh festival reviews: http://www.guardian.co.uk/culture/edinburghfestival+tone/reviews
The section I want is at the bottom, titled "Most recent". The XPath expression for the list of review items (that is the pic, the stars, the date, the blurb, etc) is
//ul[@id='auto-trail-block']
which returns a list of li elements, each corresponding to one review item.
If I want to refer to only the blurb, the closest I can get is to say
//ul[@id='auto-trail-block']/div[@class='trailtext']
but when I collect the text content from each item of the list, it includes lots of Javascript and nasty stuff I don't need. I can't refer to the blurb itself because it is not inside a p element, but within a div element that contains script elements and strong elements that contain javascript and unrelated text respectively.
In the debugger it the DOM looks like this:
<ul id="auto-trail-block" ...>
<li ...>
<div ...>
<div ...>
<div ...>
<div class="trailtext">
<script ...>
<div ...>
<span ...>
<strong .../>
<br/>
The Text I want to copy!
<strong .../>
<a .../>
<div .../>
</div>
</div>
</li>
<li ...>
...
</li>
...
</ul>
Is there any way to refer to the text content contained in just the div and not any of its subelements?