views:

780

answers:

2

How could I use ruby to extract information from a table consisting of these rows? Is it possible to detect the comments using nokogiri?

  EXTRACT LINK 1 EXTRACT DESCRIPTION EXTRACT LINK 2 Mr P 1 46 Today, 12:04 AM
Last post by: underft -->
A: 

I have a html page I want to parse and extract "EXTRACT_THIS" and "EXTRACT_THIS_TOO" the page contains over 100 nested tables with no ids. Using XPath is difficult because the page does not always validate. The only way I can see the data is reliably marked is using html comments at the start of each row. Can I do this using ruby. The html below is simplified to help understanding.

<!-- begin TOPIC 1-->
<tr>
<td>EXTRACT_THIS</td><td>EXTRACT_THIS_TOO</td>
<tr>
<!-- end topic 1-->
<!-- begin TOPIC 2-->
<tr>
<td>EXTRACT_THIS</td><td>EXTRACT_THIS_TOO</td>
<tr>
<!-- end topic 2-->
A: 

You could implement a Nokogiri SAX Parser. This is done faster than it might seem at first sight. You get events for Elements, Attributes and Comments.

Within your parser, your should rememeber the state, like @currently_interested = true to know which parts to rememeber and which not.

bb