Does anyone know of a memory efficient way to generate very large xml files (e.g. 100-500 MiB) in Python?
I've been utilizing lxml, but memory usage is through the roof.
Does anyone know of a memory efficient way to generate very large xml files (e.g. 100-500 MiB) in Python?
I've been utilizing lxml, but memory usage is through the roof.
The only sane way to generate so large an XML file is line by line, which means printing while running a state machine, and lots of testing.
How complicated is the output document? Where are you getting the input data from? How complicated are the input-to-output transformations?
Obviously, you've got to avoid having to build the entire tree ( whether DOM or etree or whatever ) in memory. But the best way depends on the source of your data and how complicated and interlinked the structure of your output is.
If it's big because it's got thousands of instances of fairly independent items, then you can generate the outer wrapper, and then build trees for each item and then serialize each fragment to the output.
If the fragments aren't so independent, then you'll need to do some extra bookkeeping -- like maybe manage a database of generated ids & idrefs.
I would break it into 2 or 3 parts: a sax event producer, an output serializer eating sax events, and optionally, if it seems easier to work with some independent pieces as objects or trees, something to build those objects and then turn them into sax events for the serializer.
Maybe you could just manage it all as direct text output, instead of dealing with sax events: that depends on how complicated it is.
This may also be a good place to use python generators as a way of streaming the output without having to build large structures in memory.
If your document is very regular (such as a bunch of database records, all in the same format) you could use my own "xe" library.
http://home.avvanta.com/~steveha/xe.html
The xe library was designed for generating syndication feeds (Atom, RSS, etc.) and I think it is easy to use. I need to update it for Python 2.6, and I haven't yet, sorry about that.
Perhaps you could use a templating engine instead of generating/building the xml yourself?
Genshi for example is xml-based and supports streaming output. A very basic example:
from genshi.template import MarkupTemplate
tpl_xml = '''
<doc xmlns:py="http://genshi.edgewall.org/">
<p py:for="i in data">${i}</p>
</doc>
'''
tpl = MarkupTemplate(tpl_xml)
stream = tpl.generate(data=xrange(10000000))
with open('output.xml', 'w') as f:
stream.render(out=f)
It might take a while, but memory usage remains low.
The same example for the Mako templating engine (not "natively" xml), but a lot faster:
from mako.template import Template
from mako.runtime import Context
tpl_xml = '''
<doc>
% for i in data:
<p>${i}</p>
% endfor
</doc>
'''
tpl = Template(tpl_xml)
with open('output.xml', 'w') as f:
ctx = Context(f, data=xrange(10000000))
tpl.render_context(ctx)
The last example ran on my laptop for about 20 seconds, producing a (admittedly very simple) 151 MB xml file, no memory problems at all. (according to Windows task manager it remained constant at about 10MB)
Depending on your needs, this might be a friendlier and faster way of generating xml than using SAX etc... Check out the docs to see what you can do with these engines (there are others as well, I just picked out these two as examples)