I have a huge XML(>400MB) containing products. Using a DOM parser is therefore excluded, so i tried to parse and process it using a pull parser. Below is a snippet from the each_product(&block)
method where i iterate over the product list.
Basically, using a stack, i transform each <product> ... </product>
node into a hash and process it.
while (reader.read)
case reader.node_type
#start element
when Nokogiri::XML::Node::ELEMENT_NODE
elem_name = reader.name.to_s
stack.push([elem_name, {}])
#text element
when Nokogiri::XML::Node::TEXT_NODE, Nokogiri::XML::Node::CDATA_SECTION_NODE
stack.last[1] = reader.value
#end element
when Nokogiri::XML::Node::ELEMENT_DECL
return if stack.empty?
elem = stack.pop
parent = stack.last
if parent.nil?
yield(elem[1])
elem = nil
next
end
key = elem[0]
parent_childs = parent[1]
# ...
parent_childs[key] = elem[1]
end
The issue is on self-closing tags (EG <country/>
), as i can not make the difference between a 'normal' and a 'self-closing' tag. They both are of type Nokogiri::XML::Node::ELEMENT_NODE
and i am not able to find any other discriminator in the documentation.
Any ideas on how to solve this issue?