Currently, I'm using LIBXML::SAXParser::Callbacks to parse a large XML file containing data 140,000 products. I'm using a task to import the data for these products into my rails app.
My last import took just under 10 hours to complete:
rake asi:import_products --trace 26815.23s user 1393.03s system 80% cpu 9:47:34.09 total
The problem with the current implementation is that the complex dependency structure in the XML means, I need to keep track of the entire product node to know how to parse it properly.
Ideally, I'd like a way that I could process each product node by itself and have the ability to use XPATH, the file size restricts us from using a method that requires loading the entire XML file into memory. I cannot control the format or size of original XML. I have at most, 3GB worth of memory I can use on the process.
Is there a better way than this?