views:

1443

answers:

2

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.

Sample code:

 require 'rubygems'
 require 'mechanize'

   post_agent = WWW::Mechanize.new
    post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
    puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts  post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts post_page.parser.xpath('//[@id="post1960370"]/tbody/tr[1]/td/div[2]/text()')

all my attempts end with empty string or an error.


I cannot find any documentation on using nokogiri within mechanize. Mechanize doc says at the bottom of the page that After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using nokogiri methods. But what methods? Where can I read about them with samples and explained syntax? I did not find anything on nokogiri web either.

+3  A: 

Hi again, Radek. I'm going to show you how to fish.

When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:

puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip

start with this:

puts post_page.parser.xpath('//table').to_html

This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:

puts post_page.parser.xpath("//table[@class='userdata']").to_html

Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:

puts post_page.parser.xpath("//table[@class='userdata']//tr").to_html

If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.

And that's how you do it.

Wayne Conrad
PS: This is a general tutorial showing how you figure out the correct xpath: You don't start with a fully specified xpath, because then you've got no idea what's wrong if it returns nothing. Start with something so general that it's guaranteed to return something, and then keep making it more specific until you have the one thing you want. By doing it in steps, when it doesn't work you know it's the last thing you added to the xpath.
Wayne Conrad
@Wayne Conrad: Hi Wayne,thank you for nice tutorial.I will try what you say but I thought that as I want only the first instance of the element it would be easy and fast to use absolute xpath. And it would give me the first item from the array.
Radek
So you would follow all these steps even if you want to get the number how many times this question was viewed?
Radek
Yes, I always figure out my xpaths iteratively. Someone who is good at xpath might be able to get it right the first time. That someone is not me.It's not the xpath that decides whether you get one thing or many. It's whether you call "xpath" or "at_xpath". If you call "xpath", you'll always get one thing; if multiple elements matched, you'll only get the first one. If you call "at_xpath", you'll always get an array, even if you matched just one thing.
Wayne Conrad
wow,this is something I was looking for. the difference between 'xpath' and 'at_xpath'.Great! thank you for that.How did you learn that?
Radek
I cannot get why full xpath doesn't work!? Full xpath + 'at_xpath' will give the the first match and I would be happy :-)
Radek
Did you try what I said? Start with '//table', then get it to pick out just the one table that has the data you want.
Wayne Conrad
I am almost there. I have an array of 15 tables (=15posts) where the first one table has the data that I want. The xpath is "//div[@id='posts']/div/table" if I add tbody to be more specific it gives me nill
Radek
one line solution is puts post_page.parser.xpath("//div[@id='posts']/div/table/tr/td/div[2]")[0].xpath('text()').to_s.strip
Radek
What happens if you use two slashes before tbody instead of one? What does that tell you?
Wayne Conrad
"//div[@id='posts']/div/table//tbody/tr/td" gives nill too
Radek
when I used .at_xpath anywhere in this exercise I got no results
Radek
Even when you use at_xpath('//table')?
Wayne Conrad
yes, at_xpath('//table') gives me something. Even puts post_page.parser.at_xpath("//div[@id='posts']/div/table/tr/td/div[2]") gives me what I want. But to extract the final piece I have to use xpath, at_xpath give me empty string.
Radek
Go back to the other answers I've given you. You'll see something different at the end of the final xpaths.
Wayne Conrad
not sure what you mean. .to_s.strip? or text()? I used both
Radek
I apologize for my inability to communicate clearly. I honestly don't know where to go now--this is now an individual tutoring session, which I'm not sure SO is for. That's not what bothers me, though. I'm bothered that I haven't figured out how to communicate the key concepts I want to get across. There are general problem solving principles in programming that I want to communicate that will help you solve not just this problem but any problem. Sadly, I am not up to the task.
Wayne Conrad
@Wayne Conrad: you did good job. I can now do fishing by myself :-) I will post separate question/s for clarification. Let's close is here. Thank you so much.
Radek
A: 

Hi, I had the same problem: With the absolute xpath, I got only an empty string. I found out, that the xpath did not like the "tbody".

This path showed me the whole table:

'/html/body/div/table'

This path showed me nothing:

'/html/body/div/table/tbody/tr'

This path showed me the row and everything was fine:

'/html/body/div/table//tr'

Can anybody explain that?

tzzzpfff