Python and Memory Consumption

Do you really need to keep the whole data from the XML file on memory at once?

Most (all?) XML libraries out there allow you to do iterative parsing, meaning that you keep in memory just a few nodes of the XML file, not the whole file. That is unless you are making a string containing the XML file yourself without any library, but that is a bit insane. If that is the case, use a library ASAP.

The specific code samples presented here might not apply to your project, but consider a few principles—borne out by testing and the lxml documentation—when faced with XML data measured in gigabytes or more:

Use an iterative parsing strategy to incrementally process large documents.

If searching the entire document in random order is required, move to an indexed XML database.

Be extremely conservative in the data that you select. If you are only interested in particular nodes, use methods that select by those names. If you require predicate syntax, try one of the XPath classes and methods available.

Consider the task at hand and the comfort level of the developer. Object models such as lxml's objectify or Amara might be more natural for Python developers when speed is not a consideration. cElementTree is faster when only parsing is required.

Take the time to do even simple benchmarking. When processing millions of records, small differences add up, and it is not always obvious which methods are the most efficient.

If you need to do complex operations on the data, why don't you just put it on a relational database and operate on the data from there? That will have better performance.

Well, I'm not excactly reading the XML at once, but I'm taking data found in text files and creating a XML file from the data... So the generation of the XML is taking place in memory

2010-03-02 03:43:17

@developerjay: It's the same for creation, you can write to disk iteratively every now and then to avoid having the full file in memory at all times. It will be a bit slower, but you'll use much less memory.

voyager 2010-03-02 03:45:41

What library would yo recommended to do this ? LXML/cElementTree ?? I figure that it would be a task to write only some parts to the file iteratively when I need to know where the data needs to be placed within the XML scheme. Basically, how can you write some data at a time and still maintain XML schema... Do you recommend a SAX Handler/Generator which would use simply a string instead ?

2010-03-02 04:37:37

ansaurus

tags:

views:

answers:

Python and Memory Consumption

related questions