tags:

views:

74

answers:

3

I'm trying to parse a large file (> 2GB) of structured markup data and the memory is not enough for this.Which is the optimal way of XML parsing class for this condition.More details please.

A: 

Most DOM libraries - like ElementTree - build the entire Document Model in core. Traditionally, when your model is too large to fit into memory at once, you need to use a more stream-oriented parser like xml.sax.

This is often harder than you expect it should be, especially when used to higher-order operations like dealing with the entire DOM at once.

Is it possible that your xml document is rather simple like

<entries>
  <entry>...</entry>
  <entry>...</entry>
</entries>

which would allow you to work on subsets of the data in a more ElementTree friendly manner?

msw
thank you very much.
zhangwf
A: 

The only API I've seen that can handle this sort of thing at all is pulldom:

http://docs.python.org/library/xml.dom.pulldom.html

Pulldom uses the SAX API to build partial DOM nodes; by pulling in specific sub-trees as a group and then discarding them when you're done, you can get the memory efficiency of SAX with the sanity of use of DOM.

It's an incomplete API; when I used it I had to modify it to make it fully usable, but it works as a foundation. I don't use it anymore, so I don't recall what I had to add; just an advance warning.

It's very slow.

XML is a very poor format for handling large data sets. If you have any control over the source data, and if it makes sense for the data set, you're much better off breaking the data apart into smaller chunks that you can parse entirely into memory.

The other option is using SAX APIs, but they're a serious pain to do anything nontrivial with directly.

Glenn Maynard
A: 

Check out the iterparse() function. A description of how you can use it to parse very large documents can be found here.

Steven