views:

366

answers:

2

I have a huge XML(>400MB) containing products. Using a DOM parser is therefore excluded, so i tried to parse and process it using a pull parser. Below is a snippet from the each_product(&block) method where i iterate over the product list.

Basically, using a stack, i transform each <product> ... </product> node into a hash and process it.

while (reader.read)
  case reader.node_type
    #start element
    when Nokogiri::XML::Node::ELEMENT_NODE
      elem_name = reader.name.to_s
      stack.push([elem_name, {}])

    #text element
    when Nokogiri::XML::Node::TEXT_NODE, Nokogiri::XML::Node::CDATA_SECTION_NODE
      stack.last[1] = reader.value

    #end element
    when Nokogiri::XML::Node::ELEMENT_DECL
      return if stack.empty?

      elem = stack.pop
      parent = stack.last
      if parent.nil?
        yield(elem[1])
        elem = nil
        next
      end

      key = elem[0]
      parent_childs = parent[1]
    # ... 
      parent_childs[key] =  elem[1]
    end

The issue is on self-closing tags (EG <country/>), as i can not make the difference between a 'normal' and a 'self-closing' tag. They both are of type Nokogiri::XML::Node::ELEMENT_NODE and i am not able to find any other discriminator in the documentation.

Any ideas on how to solve this issue?

A: 

There is a feature request on project page regarding this issue (with the corresponding failing test).

Until it will be fixed and pushed into the current version, we'll stick with good'ol

input_text.gsub! /<([^<>]+)\/>/, '<\1></\1>'
Vlad Zloteanu
A: 

Hey Vlad, well I am not a Nokogiri expert but I've done a test and saw that the self_closing?() method works fine on determining the self closing tags. Give it a try.

P.S. : I know this is an old post :P / the documentation is HERE

Gerasimos