views:

1252

answers:

4

I have not found any documentation nor tutorial for that. Does anything like that exist?


doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')

the code above will get me any table, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251. But why does it start with double //? why there is /tr at the end? See for more details


Could anybody tell me how to extract href, id, alt, src etc using nokogiri?

td[3]/div[1]/a/text()' <--- extracts text

but how can I extract other things?

+4  A: 

Seems you need to read a XPath Tutorial

Your //table/tbody[@id="threadbits_forum_251"]/tr expression means:

  • // - Anywhere in your XML document
  • table/tbody - take a table element with a tbody child
  • [@id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
  • tr - and take its tr elements

So, basically, you need to know:

  • attributes begins with @
  • conditions go inside [] brackets

If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/@href if there is just one <a> element.

Rubens Farias
@Rubens thank you. And you're right I need to read the XPath Tutorial.I thought it was nokorigi doc I need to read... would you know if there is any tool that would give me full Xpath if I click and object on html page?
Radek
I dont know, but XPath isn't that hard; consider your filesystem, and lets assume every folder is a XML element; so, when you select your `system32` folder, you'll get `\windows\system32` path; just replace that `\\`` by `/`, consider attributes beginning with `@` and conditions by `[]` and you're good to go
Rubens Farias
A: 

your xpath is correct and you seem to have answered your own question's first part (almost):

doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')

"the code above will get me any table table's tr, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251"


// means the following element can appear anywhere in the document

/tr at end means, get the tr node of the matching element.

You dont need to extract each attribute one by one. Just get the entire node containing all 4 attributes in nokogiri, and get the attributes using:

theNode['href']
theNode['src']

where theNode is your Nokogiri Node object.

Edit: Sorry I haven't used these libraries, but I think the xpath evaluation and parsing is being done by mechanize. So here's how you would get the entire element and it's attributes in one go.

doc.xpath("td[3]/div[1]/a").each do |anchor|
    puts anchor['href']
    puts anchor['src']
    ...
end
Anurag
@Anurag thank you for nice explanation.I am using mechanize not pure nokogiri,can I use theNode['href'] somehow in [:title, 'td[3]/div[1]/a/text()'],? I want to extract href instead of text
Radek
`[:address, 'td[3]/div[1]/a/@href']` ?
Rubens Farias
+1  A: 

FireXPath is highly recommended for quickly testing out/tweaking xpaths like this.

Dave Sims
thank you for that suggestion, I will explore it
Radek
+1  A: 

XPather for Firefox is by far the easiest tool to extract the xpath from an element.

[https://addons.mozilla.org/en-US/firefox/addon/1192]

ryan