ansaurus

Question

Answer 1

A:

You are probably better off using hpricot's CSS parsing instead of XPath. _why was talking about possibly depricating XPath at one point.

Do you have a better example of the data? Do they use css tags that are easily referenced?

It's much easier to search like:

doc.search("#id_tag > table > tr.class_tag > td").each do |aaa|
    aaa.search("blah > blah").each do |bbb|
        bbb.inner_html

There was an older page on _why's website (which I can't seem to find now) that was discussing hpricot, and some of the comments hinted at how the CSS version was a better choice over XPath when doing nested searches similar to what you are.

Wish I could give a better answer, but I seriously recommend giving the CSS method a shot and see how it goes before tearing your hair out with XPath.

bojo 2009-04-10 06:43:09

Answer 2

+1 A:

I'm now using css and I "figure" it with this great tool : www.selectorgadget.com

2009-04-30 10:44:11

Answer 3

+1 A:

It's probably worth noting that Nokogiri uses the same API as Hpricot, but also supports XPath expressions.

Mike Dalessio 2009-05-11 05:21:50

Answer 4

+3 A:

Your problem is in XPather (or firebug XPath). Firefox i think is internally fixing badly formated tables to have tbody element even if in HTML there is none. Nokogiri is not doing that, instead it allows tr tag to be inside table.

so there's a big chance your path looks to nokogiri like this:

/html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr

and that's how nokogiri will accept it :)

you might want to check out this

require 'open-uri'
require 'nokogiri'

class String
  def relative_to(base)
    (base == self[0..base.length-1]) &&
      self[base.length..-1]
  end
end

module Importer
  module XUtils
    module_function

    def match(text, source)
      case text
      when String
        source.include? text
      when Regexp
        text.match(source)
      when Array
        text.all? {|tt| source.include?(tt)}
      else
        false
      end
    end

    def find_xpath (doc, start, texts)
      xpath = start
      found = true

      while(found)
        found = [:inner_html, :inner_text].any? do |m|
          doc.xpath(xpath+"/*").any? do |tag|
            tag_text = tag.send(m).strip.gsub(/[\302\240]+/, ' ')
            if tag_text && texts.all?{|text| match(text, tag_text)}
              xpath = tag.path.to_s
            end
          end
        end
      end

      (xpath != start) && xpath
    end

    def fetch(url)
      Nokogiri::HTML(open(url).read)
    end
  end
end

I wrote this little module to help me work with Nokogiri when webscraping and data mining.

basic usage:

 include XUtils
 doc = fetch("http://some.url.here") # http:// is impotrtant!

 base = find_xpath(doc, '/html/body', ["what to find1", "What to find 2"]) # when you provide array, then it'll find element conaining ALL words

 precise = find_xpath(doc, base, "what to find1")
 precise.relative_to base

Good luck

2009-06-09 12:21:38

Answer 5

A:

There is no the TBODY tag in your HTML code. Firebug generates it automatically.

Kom2002 2010-08-22 00:43:44

ansaurus

tags:

views:

answers:

hpricot with firebug's XPath

related questions