views:

235

answers:

1

I've been tasked with getting all the SMS updates from this page and putting them into a JSON feed using Yahoo Pipes. I'm not entirely sure how I would get each update, as they are not individual elements, but just a collection of title, etc. Any shared wisdom would be much appreciated!

+2  A: 
<h1 id="blogtitle">SMS Update</h1> 
<div class="blogposttime blogdetail">Left at 2nd January 2010 at 01:12</div> 
<div class="blogcategories blogdetail">Recieved by SMS (Location: Pokhara - Nepal)</div> 
<p class="blogpostmessage"> 
RACE DAY! We took the extra day off to pimp the rick some more, including a huge Australian flag. Quiet night at a pub with 6 other teams. Time for brekkie and then we're off to the rickshaw grounds for 8:30 for 10am start.
</p> 

That seems a fairely easy job for a DOM/XML parser.

Since the blocks are not enclosed in XML tags you could look for elements that are present in each block, for example the <h1 id="blogtitle">SMS Update</h1> defines the start of a new block.

Use your DOM parser to look for all the elements with id blogtitle. At this point you can use a DOM function to reference the nextSibling of the blogtitle element. All you need is the 3 siblings after the blogtitle element.

With a little work you can easily use this logic to build your JSON object.

Luca Matteis
Thanks, but I "solved" this by finding all the blogtitle elements on the page, as well as the posttime, etc and just iterating over them using Nokogiri (Ruby) since they're always in the right order. Seems to be working swimmingly. Thank you for your answer, however.
Ryan Bigg