ansaurus

Question

Splitting a large XML file in Python

Answer 1

+4 A:

You might consider using the ElementTree iterparse function for this situation.

Jeff Bauer 2009-01-25 00:32:07

ElementTree is in stdlib

J.F. Sebastian 2009-01-25 07:40:29

Answer 2

+8 A:

There are two common ways to handle XML data.

One is called DOM, which stands for Document Object Model. This style of XML parsing is probably what you have seen when looking at documentation, because it reads the entire XML into memory to create the object model.

The second is called SAX, which is a streaming method. The parser starts reading the XML and sends signals to your code about certain events, e.g. when a new start tag is found.

So SAX is clearly what you need for your situation. Sax parsers can be found in the python library under xml.sax and xml.parsers.expat.

Van Gale 2009-01-25 00:49:08

+1: SAX decomposition of large XML docs.

S.Lott 2009-01-25 01:28:23

Answer 3

A:

How serendipitous! Will Larson just made a good post about Handling Very Large CSV and XML File in Python.

The main takeaways seem to be to use the xml.sax module, as Van mentioned, and to make some macro-functions to abstract away the details of the low-level SAX API.

Alabaster Codify 2009-01-25 01:53:15

Answer 4

+5 A:

I have had success with the cElementTree.iterparse method in order to do a similar task.

I had a large xml doc with repeated 'entries' with tag 'resFrame' and I wanted to filter out entries for a specific id. Here is the code that I used for it:

source document had this structure

<snapDoc>
  <bucket>....</bucket>
  <bucket>....</bucket>
  <bucket>....</bucket>
  ...
  <resFrame><id>234234</id>.....</resFrame>
  <frame><id>344234</id>.....</frame>
  <resFrame>...</resFrame>
  <frame>...</frame>
</snapDoc>

I used the following script to create a smaller doc that had the same structure, bucket entries and only resFrame entries with a specific id.

#!/usr/bin/env python2.6

import xml.etree.cElementTree as cElementTree
start = '''<?xml version="1.0" encoding="UTF-8"?>
<snapDoc>'''

def main():
    print start
    context = cElementTree.iterparse('snap.xml', events=("start", "end"))
    context = iter(context)
    event, root = context.next() # get the root element of the XML doc

    for event, elem in context:
        if event == "end":
            if elem.tag == 'bucket': # i want to write out all <bucket> entries
               elem.tail = None  
               print cElementTree.tostring( elem )
            if elem.tag == 'resFrame':
                if elem.find("id").text == ":4:39644:482:-1:1": # i only want to write out resFrame entries with this id
                    elem.tail = None
                    print cElementTree.tostring( elem )
            if elem.tag in ['bucket', 'frame', 'resFrame']:
                root.clear()  # when done parsing a section clear the tree to safe memory
    print "</snapDoc>"

main()

James Dean 2009-01-28 19:17:39

Answer 5

A:

This is an old, but very good article from Uche Ogbuji's also very good Python & XMl column. It covers your exact question and uses the standard lib's sax module like the other answer has suggested. Decomposition, Process, Recomposition

prayfomojo 2009-07-02 16:42:26

ansaurus

tags:

views:

answers:

Splitting a large XML file in Python

related questions