views:

81

answers:

1

Hey,

I am searching for a way to be able to handle overloading the RAM and CPU using a high memory program... I would like to process a LARGE amount of data contained in files. I then read the files and process the data therein. The problem is there are many nested for loops and a root XML file is being created from all the data processed. The program easily consumes a couple gigs of RAM after half hour or so of run-time. Is there something I can do to not let RAM get so big and/or work around it.. ?

+3  A: 

Do you really need to keep the whole data from the XML file on memory at once?

Most (all?) XML libraries out there allow you to do iterative parsing, meaning that you keep in memory just a few nodes of the XML file, not the whole file. That is unless you are making a string containing the XML file yourself without any library, but that is a bit insane. If that is the case, use a library ASAP.

The specific code samples presented here might not apply to your project, but consider a few principles—borne out by testing and the lxml documentation—when faced with XML data measured in gigabytes or more:

  • Use an iterative parsing strategy to incrementally process large documents.
  • If searching the entire document in random order is required, move to an indexed XML database.
  • Be extremely conservative in the data that you select. If you are only interested in particular nodes, use methods that select by those names. If you require predicate syntax, try one of the XPath classes and methods available.
  • Consider the task at hand and the comfort level of the developer. Object models such as lxml's objectify or Amara might be more natural for Python developers when speed is not a consideration. cElementTree is faster when only parsing is required.
  • Take the time to do even simple benchmarking. When processing millions of records, small differences add up, and it is not always obvious which methods are the most efficient.

If you need to do complex operations on the data, why don't you just put it on a relational database and operate on the data from there? That will have better performance.

voyager
Well, I'm not excactly reading the XML at once, but I'm taking data found in text files and creating a XML file from the data... So the generation of the XML is taking place in memory
@developerjay: It's the same for creation, you can write to disk iteratively every now and then to avoid having the full file in memory at all times. It will be a bit slower, but you'll use much less memory.
voyager
What library would yo recommended to do this ? LXML/cElementTree ?? I figure that it would be a task to write only some parts to the file iteratively when I need to know where the data needs to be placed within the XML scheme. Basically, how can you write some data at a time and still maintain XML schema... Do you recommend a SAX Handler/Generator which would use simply a string instead ?