views:

309

answers:

1

Currently, I'm using LIBXML::SAXParser::Callbacks to parse a large XML file containing data 140,000 products. I'm using a task to import the data for these products into my rails app.

My last import took just under 10 hours to complete:

rake asi:import_products --trace  26815.23s user 1393.03s system 80% cpu 9:47:34.09 total

The problem with the current implementation is that the complex dependency structure in the XML means, I need to keep track of the entire product node to know how to parse it properly.

Ideally, I'd like a way that I could process each product node by itself and have the ability to use XPATH, the file size restricts us from using a method that requires loading the entire XML file into memory. I cannot control the format or size of original XML. I have at most, 3GB worth of memory I can use on the process.

Is there a better way than this?

Current Rake Task code:

Snippet of the XML file:

+1  A: 

Can you fetch whole file first? If so, then I'd suggest splitting an XML file into smaller chunks (say, 512MBs or so) so you could parse simultaneous chunks at one time (one per core), 'cause I believe you have modern CPU. Regarding the invalid or malformed xml - just append or prepend missing XML with simple string manipulation.

You can also try profiling your callback method. It's a big chunk of code, I'm pretty sure there should be at least one bottle neck which could save you a few minutes.

Eimantas
Yes, the code isn't pretty, but the speed is only a minor issue. the big issue is handling dependencies within some of the pricing and criteria of the XML. Since it is just a big list of independent products, I could potentially split the file up a bit and process multiple files at a time though. That isn't a bad idea.
DBruns