views:

369

answers:

2

I am attempting to parse a Wiktionary entry to retrieve all english definitions. I am able to retrive all definitions, the problem is that some definitions are in other languages. What I would like to do is somehow retrieve only the HTML block with English definitions. I have found that, in the case that there are other language entries, the header after the english definitions can be retrieved with:

header = (doc/"h2")[3]

So I would like to only search all the elements before this header element. I thought that may be possible with header.preceding_siblings(), but that does not seem to be working. Any suggestions?

+2  A: 

You can make use of the visitor pattern with Nokogiri. This code will remove everything starting from the other language definition's h2:

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node
Pesto
+1  A: 

The following code is using Hpricot.
It gets the text from the header for the english language (h2) until the next header (h2), or until the footer if there are no further languages:

require 'hpricot'
require 'open-uri'

def get_english_definition(url)
  doc = Hpricot(open(url))

  span = doc.at('h2/span[@class="mw-headline"][text()=English]')
  english_header = span && span.parent
  return nil unless english_header

  next_header_or_footer =
    Hpricot::Elements[*english_header.following_siblings].at('h2') ||
    doc.at('[@class="printfooter"]')

  Hpricot::Elements.expand(english_header.next_node,
                           next_header_or_footer.previous_node).to_s
end

Example:

get_english_definition "http://en.wiktionary.org/wiki/gift"
andre-r