tags:

views:

107

answers:

5

Does anyone know of a memory efficient way to generate very large xml files (e.g. 100-500 MiB) in Python?

I've been utilizing lxml, but memory usage is through the roof.

+1  A: 

The only sane way to generate so large an XML file is line by line, which means printing while running a state machine, and lots of testing.

Ignacio Vazquez-Abrams
Thanks for the answer. SAX ftl.. argh...
Kenneth Reitz
A: 

How complicated is the output document? Where are you getting the input data from? How complicated are the input-to-output transformations?

John Machin
It's relatively simple data inside. There's just a LOT of it. Host metadata.
Kenneth Reitz
+1  A: 

Obviously, you've got to avoid having to build the entire tree ( whether DOM or etree or whatever ) in memory. But the best way depends on the source of your data and how complicated and interlinked the structure of your output is.

If it's big because it's got thousands of instances of fairly independent items, then you can generate the outer wrapper, and then build trees for each item and then serialize each fragment to the output.

If the fragments aren't so independent, then you'll need to do some extra bookkeeping -- like maybe manage a database of generated ids & idrefs.

I would break it into 2 or 3 parts: a sax event producer, an output serializer eating sax events, and optionally, if it seems easier to work with some independent pieces as objects or trees, something to build those objects and then turn them into sax events for the serializer.

Maybe you could just manage it all as direct text output, instead of dealing with sax events: that depends on how complicated it is.

This may also be a good place to use python generators as a way of streaming the output without having to build large structures in memory.

Steven D. Majewski
Hmm, yes it will have thousands of instances of fairly independent items. I think I might that approach. Thanks.
Kenneth Reitz
The templating approach suggested above is probably easier -- especially if the separate object/chunks are small enough to handle.Running the SAX producer / consumer pair as co-routines using python generators would be the alternative if things didn't easily break into small enough chunks.
Steven D. Majewski
A: 

If your document is very regular (such as a bunch of database records, all in the same format) you could use my own "xe" library.

http://home.avvanta.com/~steveha/xe.html

The xe library was designed for generating syndication feeds (Atom, RSS, etc.) and I think it is easy to use. I need to update it for Python 2.6, and I haven't yet, sorry about that.

steveha
Very nice. Is this more memory efficient than using lxml?
Kenneth Reitz
Not sure. I haven't looked at lxml yet. If you are holding all your data at once, then you are holding at least 500 MiB no matter what. But if you are writing stuff in a loop, you could re-use the same xe data structures, which would save memory. But maybe you could do the same thing with lxml?
steveha
+2  A: 

Perhaps you could use a templating engine instead of generating/building the xml yourself?

Genshi for example is xml-based and supports streaming output. A very basic example:

from genshi.template import MarkupTemplate

tpl_xml = '''
<doc xmlns:py="http://genshi.edgewall.org/"&gt;
<p py:for="i in data">${i}</p>
</doc>
'''

tpl = MarkupTemplate(tpl_xml)
stream = tpl.generate(data=xrange(10000000))

with open('output.xml', 'w') as f:
    stream.render(out=f)

It might take a while, but memory usage remains low.

The same example for the Mako templating engine (not "natively" xml), but a lot faster:

from mako.template import Template
from mako.runtime import Context

tpl_xml = '''
<doc>
% for i in data:
<p>${i}</p>
% endfor
</doc>
'''

tpl = Template(tpl_xml)

with open('output.xml', 'w') as f:
    ctx = Context(f, data=xrange(10000000))
    tpl.render_context(ctx)

The last example ran on my laptop for about 20 seconds, producing a (admittedly very simple) 151 MB xml file, no memory problems at all. (according to Windows task manager it remained constant at about 10MB)

Depending on your needs, this might be a friendlier and faster way of generating xml than using SAX etc... Check out the docs to see what you can do with these engines (there are others as well, I just picked out these two as examples)

Steven
Fantastic idea! This is definitely one worth considering.
Kenneth Reitz