ansaurus

Question

Searching all elements before an h2 element in hpricot/nokogiri

Answer 1

+2 A:

You can make use of the visitor pattern with Nokogiri. This code will remove everything starting from the other language definition's h2:

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node

Pesto 2009-09-22 14:53:23

Answer 2

+1 A:

The following code is using Hpricot.
It gets the text from the header for the english language (h2) until the next header (h2), or until the footer if there are no further languages:

require 'hpricot'
require 'open-uri'

def get_english_definition(url)
  doc = Hpricot(open(url))

  span = doc.at('h2/span[@class="mw-headline"][text()=English]')
  english_header = span && span.parent
  return nil unless english_header

  next_header_or_footer =
    Hpricot::Elements[*english_header.following_siblings].at('h2') ||
    doc.at('[@class="printfooter"]')

  Hpricot::Elements.expand(english_header.next_node,
                           next_header_or_footer.previous_node).to_s
end

Example:

get_english_definition "http://en.wiktionary.org/wiki/gift"

andre-r 2009-09-23 01:56:42

ansaurus

tags:

views:

answers:

Searching all elements before an h2 element in hpricot/nokogiri

related questions