ansaurus

Question

extract single string from html using ruby/mechanize (and nokogiri)

Answer 1

+3 A:

Hi again, Radek. I'm going to show you how to fish.

When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:

puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip

start with this:

puts post_page.parser.xpath('//table').to_html

This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:

puts post_page.parser.xpath("//table[@class='userdata']").to_html

Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:

puts post_page.parser.xpath("//table[@class='userdata']//tr").to_html

If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.

And that's how you do it.

Wayne Conrad 2010-01-22 03:29:20

PS: This is a general tutorial showing how you figure out the correct xpath: You don't start with a fully specified xpath, because then you've got no idea what's wrong if it returns nothing. Start with something so general that it's guaranteed to return something, and then keep making it more specific until you have the one thing you want. By doing it in steps, when it doesn't work you know it's the last thing you added to the xpath.

Wayne Conrad 2010-01-22 03:57:18

@Wayne Conrad: Hi Wayne,thank you for nice tutorial.I will try what you say but I thought that as I want only the first instance of the element it would be easy and fast to use absolute xpath. And it would give me the first item from the array.

Radek 2010-01-22 03:58:53

So you would follow all these steps even if you want to get the number how many times this question was viewed?

Radek 2010-01-22 04:03:42

Yes, I always figure out my xpaths iteratively. Someone who is good at xpath might be able to get it right the first time. That someone is not me.It's not the xpath that decides whether you get one thing or many. It's whether you call "xpath" or "at_xpath". If you call "xpath", you'll always get one thing; if multiple elements matched, you'll only get the first one. If you call "at_xpath", you'll always get an array, even if you matched just one thing.

Wayne Conrad 2010-01-22 04:17:20

wow,this is something I was looking for. the difference between 'xpath' and 'at_xpath'.Great! thank you for that.How did you learn that?

Radek 2010-01-22 04:18:59

I cannot get why full xpath doesn't work!? Full xpath + 'at_xpath' will give the the first match and I would be happy :-)

Radek 2010-01-22 04:19:35

Did you try what I said? Start with '//table', then get it to pick out just the one table that has the data you want.

Wayne Conrad 2010-01-22 04:31:45

I am almost there. I have an array of 15 tables (=15posts) where the first one table has the data that I want. The xpath is "//div[@id='posts']/div/table" if I add tbody to be more specific it gives me nill

Radek 2010-01-22 04:42:01

one line solution is puts post_page.parser.xpath("//div[@id='posts']/div/table/tr/td/div[2]")[0].xpath('text()').to_s.strip

Radek 2010-01-22 05:14:34

What happens if you use two slashes before tbody instead of one? What does that tell you?

Wayne Conrad 2010-01-22 05:14:59

"//div[@id='posts']/div/table//tbody/tr/td" gives nill too

Radek 2010-01-22 05:18:45

when I used .at_xpath anywhere in this exercise I got no results

Radek 2010-01-22 05:24:23

Even when you use at_xpath('//table')?

Wayne Conrad 2010-01-22 05:37:38

yes, at_xpath('//table') gives me something. Even puts post_page.parser.at_xpath("//div[@id='posts']/div/table/tr/td/div[2]") gives me what I want. But to extract the final piece I have to use xpath, at_xpath give me empty string.

Radek 2010-01-22 06:07:49

Go back to the other answers I've given you. You'll see something different at the end of the final xpaths.

Wayne Conrad 2010-01-22 07:56:11

not sure what you mean. .to_s.strip? or text()? I used both

Radek 2010-01-22 09:50:08

I apologize for my inability to communicate clearly. I honestly don't know where to go now--this is now an individual tutoring session, which I'm not sure SO is for. That's not what bothers me, though. I'm bothered that I haven't figured out how to communicate the key concepts I want to get across. There are general problem solving principles in programming that I want to communicate that will help you solve not just this problem but any problem. Sadly, I am not up to the task.

Wayne Conrad 2010-01-22 17:14:42

@Wayne Conrad: you did good job. I can now do fishing by myself :-) I will post separate question/s for clarification. Let's close is here. Thank you so much.

Radek 2010-01-22 18:10:06

Answer 2

A:

Hi, I had the same problem: With the absolute xpath, I got only an empty string. I found out, that the xpath did not like the "tbody".

This path showed me the whole table:

'/html/body/div/table'

This path showed me nothing:

'/html/body/div/table/tbody/tr'

This path showed me the row and everything was fine:

'/html/body/div/table//tr'

Can anybody explain that?

tzzzpfff 2010-02-25 20:10:27

ansaurus

tags:

views:

answers:

extract single string from html using ruby/mechanize (and nokogiri)

related questions