views:

164

answers:

3

We need to import large amount of data(about 5 millions records) to the postgresql db under rails application. Data will be provided in xml format with images inside it encoded with Base64.

Estimated size of the xml file is 40GB. What xml parser can handle such amount of data in ruby?

Thanks.

+3  A: 

You'll want to use some kind of SAX parser. SAX parsers do not load everything to memory at once.

I don't know about Ruby parsers but quick googling gave this blog post. You could start digging from there.

You could also try to split the XML file to smaller pieces to make it more manageable.

Juha Syrjälä
+1 for SAX parser. REXML works as a SAX parser, however you might want to use a more performant library such as Nokogiri SAX parser http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html
Simone Carletti
+1  A: 

You should have use XML SAX parser as a Juha said. Libxml is the fastest xml lib for ruby, I think.

Sebastian
+1  A: 

You could convert the data to CSV and then load it into your database by using your DBMS CSV loading capabilities. For MySQL it's this and for PostgreSQL it's this. I would not use anything built in Ruby to load a 40GB file, it's not too good with memory. Best left to the "professionals".

Ryan Bigg
You still need the XML parser for XML->CSV conversion.
Juha Syrjälä
Unfortunately I have to use ruby for that because each record should pass though the application logic: e.g validation, counters update, solr index, other application specific callbacks.
Bogdan Gusiev