views:

1561

answers:

1

Hi all

I have just installed ruby+mechanize. It seems to me that it is posible in ruby nokogiri what I want to do but I do not know how to do it.

What about this table? It is just part of html of vBulletin forum site. I tried to keep the html structure but deleted some text and tag attributes. I want to get some details per thread like: Title,Author,Date,Time,Replies,Views

Please note that there are few tables in the html document? I am after one particular table with its tbody <tbody id="threadbits_forum_251">. The name will be always the same (I hope) Can I use the tbody and the name in the code?

Thank you

R

<table >
    <tbody>
        <tr>    <!-- table header --> </tr>
    </tbody>
    <!-- show threads -->
    <tbody id="threadbits_forum_251">
        <tr>
            <td></td>
            <td></td>
            <td>
                <div>
                    <a href="showthread.php?t=230708" >Vb4 Gold Released</a>
                </div>
                <div>
                    <span><a>Paul M</a></span>
                </div>
            </td>
            <td>
                    06 Jan 2010 <span class="time">23:35</span><br />
                    by <a href="member.php?find=lastposter&amp;t=230708">shane943</a> 
                </div>
            </td>
            <td><a href="#">24</a></td>
            <td>1,320</td>
        </tr>

    </tbody>
</table>
+3  A: 
#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
  (The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
  detail = {}
  [
    [:title, 'td[3]/div[1]/a/text()'],
    [:name, 'td[3]/div[2]/span/a/text()'],
    [:date, 'td[4]/text()'],
    [:time, 'td[4]/span/text()'],
    [:number, 'td[5]/a/text()'],
    [:views, 'td[6]/text()'],
  ].collect do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end
pp details

# => [{:time=>"23:35",
# =>   :title=>"Vb4 Gold Released",
# =>   :number=>"24",
# =>   :date=>"06 Jan 2010",
# =>   :views=>"1,320",
# =>   :name=>"Paul M"}]
Wayne Conrad
I think the css equivalent would be `doc.css('tbody#threadbits_forum_251 tr')`, but I haven't actually tested that in code...
kejadlen
wow, I have to try and let you know!!! thank you so much. R
Radek
@Kejadlen, I replaced the doc.xpath(...) call with your doc.css call, and it worked great.
Wayne Conrad
is it possible that somebody would explain the syntax to me? thank you in advance.
Radek
What's got you stumped? Is it the Ruby syntax, the xpath syntax, or both?
Wayne Conrad
Radek
and why does it start whith // ? I cannot find any good (good enough for ME) documentation on that...
Radek
Yes, you already have nokogiri. See http://stackoverflow.com/questions/2060247/how-to-read-someone-elses-forum/2060983#2060983 for an example using mechanize. That example doesn't directly use nokogiri, except on the commented-out line to print the fetched html. But nokogiri is there inside mechanize if you need it (just call page.parser). The xpath you quoted means "get me any table, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251."
Wayne Conrad
@Wayne,thank you sooooo much.I updated the code following your other example and it is working now very nicely. I still have few questions.The most important is if you could suggest any documentation for me.Next one is why there is /tr at the end of the xpath you nicely explained to me.I want to extract url of the post too I tried [:url, 'td[3]/div[1]/a'], [:url, 'td[3]/div[1]/a href/text()'], [:url, 'td[3]/div[1]/a/href/text()'],[:url, 'td[3]/div[1]/a/href'], and nothing worked.Where can I learn how to extract href, id, alt, src etc? Thank you
Radek
@Wayne and another question is that I want to add some info from the post itself so I have to click it and add the info to the detail object. Where in your code I can add such code? I hope I am not asking much.. could you explain the code after details ??? Thank you
Radek
the forum I use to learn mechanize/nokorigi/parsing is http://www.vbulletin.org/forum/forumdisplay.php?f=251
Radek
Radek, These are all great questions. What would you say to creating more SO questions? That way you'll get more people's answers.
Wayne Conrad
@Wayne thank you so much for your help!!!
Radek
@Wayne Conrad: Wayne can I ask why you use array of hashes to store the data? why not hash of hashes or object? thank you
Radek
Mostly, because an array of hashes was the simplest thing that could possibly work, making for a clearer example. Also, and I don't know if this matters for you, in Ruby < 1.9, hashes don't have a well-defined order so you lose the original order of the rows.
Wayne Conrad
yes, the order matters. Thank you.
Radek