views:

458

answers:

2
A: 

In case there isn't a library to do that for ruby, here's some code to get you started writing this yourself:

require 'nokogiri'
doc=Nokogiri("<table><tr><th>la</th><th><b>lu</b></th></tr><tr><td>lala</td><td>lulu</td></tr><tr><td><b>lila</b></td><td>lolu</td></tr></table>")
header, *rest = (doc/"tr").map do |row|
  row.children.map do |c|
    c.text
  end
end
header.map! do |str| str.to_sym end
item_struct = Struct.new(*header)
table = rest.map do |row|
  item_struct.new(*row)
end
table[1].lu #=> "lolu"

This code is far from perfect, obviously, but it should get you started.

sepp2k
+2  A: 

You might like to try Hpricot (gem install hpricot, prepend the usual sudo for *nix systems)

I placed your HTML into input.html, then ran this:

require 'hpricot'

doc = Hpricot.XML(open('input.html'))

table = doc/:table

(table/:tr).each do |row|
  (row/:td).each do |cell|
    puts cell.inner_html
  end
end

which, for the first row, gives me

<span class="black">12:17AM </span>
<span class="black">
    <a href="http://www.mta.info/mnr/html/planning/schedules/ref.htm"&gt;&lt;/a&gt;&lt;/span&gt;
<span class="black">1:22AM  </span>
<span class="black">
    <a href="http://www.mta.info/mnr/html/planning/schedules/ref.htm"&gt;&lt;/a&gt;&lt;/span&gt;
<span class="black">65</span>
<span class="black">TRANSFER AT STAMFORD (AR 1:01AM & LV 1:05AM)                                                                            </span>
<span class="black">

 N


</span>

So already we're down to the content of the TD tags. A little more work and you're about there.

(BTW, the HTML looks a little malformed: you have <th> tags in <tbody>, which seems a bit perverse: <tbody> is fairly pointless if it's just going to be another level within <table>. It makes much more sense if your <tr><th>...</th></tr> stuff is in a separate <thead> section within the table. But it may not be "your" HTML, of course!)

Mike Woodhouse