views:

4048

answers:

5

I'm trying to fill the variables parent_element_h1 and parent_element_h2. Can anyone help me use the Nokogiri Gem to get the information I need into those variables?

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
start_here = parent.at('div.block#X2')

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
parent_element_h1 = 

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
parent_element_h2 =


PLEASE NOTE: The start_here element could be anywhere inside the document. The html-data shown here is just an example. That said, the headers (<h1> and <h2>) could be a sibling of start_here or a child of a sibling of start_here.

The following recursive method is a good starting point, but it doesn't work on <h1> because it's a child of a sibling of start_here.

def search_element(_block,_style)
  unless _block.nil?
    if _block.name == _style
      return _block
    else
      search_element(_block.previous,_style)
    end
  else
    return false
  end
end

parent_element_h1 = search_element(start_here,'h1')
parent_element_h2 = search_element(start_here,'h2')

UPDATE: After accepting an answer, I came up with my own solution. Check it out (somewhere below), it works like a charm and I think it's pretty cool. :-)

+4  A: 

The approach I would take (if I am understanding your problem) is to use XPath or css to search for your start_here element and the parent element that you want to search under. Then, recursively walk the tree starting at parent, stopping when you hit the start_here element, and holding onto the last element that matches your style along the way.

Something like:

parent = value.search("//body").first
div = value.search("//div[@id = 'X2']").first

find = FindPriorTo.new(div)

assert_equal('Foo', find.find_from(parent, 'h1').text)
assert_equal('Bar', find.find_from(parent, 'h2').text)

Where FindPriorTo is a simple class to handle the recursion.

class FindPriorTo
  def initialize(stop_element)
    @stop_element = stop_element
  end

  def find_from(parent, style)
    @should_stop = nil
    @last_style  = nil

    recursive_search(parent, style)
  end

  def recursive_search(parent, style)
    parent.children.each do |ch|
      recursive_search(ch, style)
      return @last_style if @should_stop

      @should_stop = (ch == @stop_element)
      @last_style = ch if ch.name == style
    end

    @last_style    
  end

end

If this approach isn't scalable enough, then you might be able to optimize things by rewriting the recursive_search to not use recursion. And also pass in both of the styles you are looking for and keep track of last found, so you don't have to traverse the tree an extra time.

I'd also say try monkey patching Node to hook on when the document is getting parsed, but it looks like all of that is written in C. Perhaps you might be better served using something other than Nokogiri that has a native ruby SAX parser (maybe REXML), or if speed is your real concern, do the search portion in C/C++ using Xerces or similar... I don't know how well these will deal with parsing HTML though.

Aaron Hinni
The problem is, that I don't know if the header is a sibling or a child of a sibling. Your solution assumes that I know if it's a sibling or a child of a sibling. Besides that, my example-data is much shorter than my real data: 'my_tag' can be anywhere inside the document.
Javier
you can use '//' instead of '/html/body/' or even '/html/body//div' in XPath when you're not sure of the sibling/child relationship. http://www.w3schools.com/Xpath/
Jweede
I think my question wasn't specific enough, I've edited the question and hope it's now clear what I'm looking for (check the comments above the variables I'm trying to fill with data).
Javier
Thanks for your submission. It's a bit of a hack, but it works for this example. Though it won't work if start_here is inside of another div block. What I'm looking for is a way to fetch the nearest, previous header, ignoring it's hierarchy in the document.
Javier
Yes, the prune was a bit of a hack. See if my edited answer is what you are looking for.
Aaron Hinni
Your current version is more or less equivalent to the current hack I implemented. The problem is, that this variant doesn't scale very well with very large html-files (I'm working with >50MB legal texts...). E.g. a large file where start_here is a the end of the large textfile.
Javier
You should mention the scalability need in your question, as you say that the code snippet that uses recursion is a good start. I've updated my response with some other suggestions.
Aaron Hinni
A: 

If you don't know the relationship between elements, you can search for them this way ( anywhere in the document ):


# html code
text = "insert your html here"
# get doc object
doc = Nokogiri::HTML(text)
# get elements with the specified tag
elements = doc.search("//your_tag")

If, however, you need to submit a form, you should use mechanize:


# create mech object
mech = WWW::Mechanize.new
# load site
mech.get("address")
# select a form, in this case, I select the first form. You can select the one you need 
# from the array
form = mech.page.forms.first
# you fill the fields like this: form.name_of_the_field
form.element_name  = value
form.other_element = other_value
Geo
This doesn't solve my problem, but I've edited my question to be more specific. Please note the comment above the two variables I'm trying to fill.
Javier
In short: This wouldn't work because it would match more than just the nearest, previous h1 or h2.
Javier
A: 

You can search the descendants of a Nokogiri HTML::Element using CSS selectors. You can traverse ancestors with the .parent method.

parent_element_h1 = value.css("h1").first.parent
parent_element_h2 = value.css("h2").first.parent
Sam C
This doesn't return the result I'm looking for. Please read the question again.
Javier
+1  A: 

Maybe this will do it. I'm not sure about the performance and if there might be some cases that I haven't thought of.

def find(root, start, tag)
    ps, res = start, nil
    until res or (ps == root)
     ps  = ps.previous || ps.parent
     res = ps.css(tag).last
     res ||= ps.name == tag ? ps : nil
    end
    res || "Not found!"
end

parent_element_h1 =  find(parent, start_here, 'h1')
sris
A: 

This is my own solution (kudos to my co-worker for helping me on this one!) using a recursive method to parse all elements regardless of being a sibling or a child of another sibling.

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
@start_here = parent.at('div.block#X2')

# Search for parent elements of kind "_style" starting from _start_element
def search_for_parent_element(_start_element, _style)
  unless _start_element.nil?
    # have we already found what we're looking for?
    if _start_element.name == _style
      return _start_element
    end
    # _start_element is a div.block and not the _start_element itself
    if _start_element[:class] == "block" && _start_element[:id] != @start_here[:id]
      # begin recursion with last child inside div.block
      from_child = search_for_parent_element(_start_element.children.last, _style)
      if(from_child)
        return from_child
      end
    end
    # begin recursion with previous element
    from_child = search_for_parent_element(_start_element.previous, _style) 
    return from_child ? from_child : false
  else
    return false
  end
end

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
puts parent_element_h1 = search_for_parent_element(@start_here,"h1")

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
puts parent_element_h2 = search_for_parent_element(@start_here,"h2")

You can copy/paste it an run it like it is as a ruby script.

Javier