ansaurus

Question

Can Python xml ElementTree parse a very large xml file?

Answer 1

A:

Most DOM libraries - like ElementTree - build the entire Document Model in core. Traditionally, when your model is too large to fit into memory at once, you need to use a more stream-oriented parser like xml.sax.

This is often harder than you expect it should be, especially when used to higher-order operations like dealing with the entire DOM at once.

Is it possible that your xml document is rather simple like

<entries>
  <entry>...</entry>
  <entry>...</entry>
</entries>

which would allow you to work on subsets of the data in a more ElementTree friendly manner?

msw 2010-09-14 08:25:14

thank you very much.

zhangwf 2010-09-15 07:45:50

Answer 2

A:

The only API I've seen that can handle this sort of thing at all is pulldom:

http://docs.python.org/library/xml.dom.pulldom.html

Pulldom uses the SAX API to build partial DOM nodes; by pulling in specific sub-trees as a group and then discarding them when you're done, you can get the memory efficiency of SAX with the sanity of use of DOM.

It's an incomplete API; when I used it I had to modify it to make it fully usable, but it works as a foundation. I don't use it anymore, so I don't recall what I had to add; just an advance warning.

It's very slow.

XML is a very poor format for handling large data sets. If you have any control over the source data, and if it makes sense for the data set, you're much better off breaking the data apart into smaller chunks that you can parse entirely into memory.

The other option is using SAX APIs, but they're a serious pain to do anything nontrivial with directly.

Glenn Maynard 2010-09-14 09:27:04

Answer 3

A:

Check out the iterparse() function. A description of how you can use it to parse very large documents can be found here.

Steven 2010-09-15 16:18:43

ansaurus

tags:

views:

answers:

Can Python xml ElementTree parse a very large xml file?

related questions