Do you really need to keep the whole data from the XML file on memory at once?
Most (all?) XML libraries out there allow you to do iterative parsing, meaning that you keep in memory just a few nodes of the XML file, not the whole file. That is unless you are making a string containing the XML file yourself without any library, but that is a bit insane. If that is the case, use a library ASAP.
The specific code samples presented here might not apply to your project, but consider a few principles—borne out by testing and the lxml documentation—when faced with XML data measured in gigabytes or more:
- Use an iterative parsing strategy to incrementally process large documents.
- If searching the entire document in random order is required, move to an indexed XML database.
- Be extremely conservative in the data that you select. If you are only interested in particular nodes, use methods that select by those names. If you require predicate syntax, try one of the XPath classes and methods available.
- Consider the task at hand and the comfort level of the developer. Object models such as lxml's objectify or Amara might be more natural for Python developers when speed is not a consideration. cElementTree is faster when only parsing is required.
- Take the time to do even simple benchmarking. When processing millions of records, small differences add up, and it is not always obvious which methods are the most efficient.
If you need to do complex operations on the data, why don't you just put it on a relational database and operate on the data from there? That will have better performance.