tags:

views:

982

answers:

5

I'm looking to split a huge XML file into smaller bits. I'd like to scan through the file looking for a specific tag, then grab all info between and , then save that into a file, then continue on through the rest of the file.

My issue is trying to find a clean way to note the start and end of the tags, so that I can grab the text inside as I scan through the file with "for line in f"

I'd rather not use sentinel variables. Is there a pythonic way to get this done?

The file is too big to read into memory.

+4  A: 

You might consider using the ElementTree iterparse function for this situation.

Jeff Bauer
ElementTree is in stdlib
J.F. Sebastian
+8  A: 

There are two common ways to handle XML data.

One is called DOM, which stands for Document Object Model. This style of XML parsing is probably what you have seen when looking at documentation, because it reads the entire XML into memory to create the object model.

The second is called SAX, which is a streaming method. The parser starts reading the XML and sends signals to your code about certain events, e.g. when a new start tag is found.

So SAX is clearly what you need for your situation. Sax parsers can be found in the python library under xml.sax and xml.parsers.expat.

Van Gale
+1: SAX decomposition of large XML docs.
S.Lott
A: 

How serendipitous! Will Larson just made a good post about Handling Very Large CSV and XML File in Python.

The main takeaways seem to be to use the xml.sax module, as Van mentioned, and to make some macro-functions to abstract away the details of the low-level SAX API.

Alabaster Codify
+5  A: 

I have had success with the cElementTree.iterparse method in order to do a similar task.

I had a large xml doc with repeated 'entries' with tag 'resFrame' and I wanted to filter out entries for a specific id. Here is the code that I used for it:

source document had this structure

<snapDoc>
  <bucket>....</bucket>
  <bucket>....</bucket>
  <bucket>....</bucket>
  ...
  <resFrame><id>234234</id>.....</resFrame>
  <frame><id>344234</id>.....</frame>
  <resFrame>...</resFrame>
  <frame>...</frame>
</snapDoc>

I used the following script to create a smaller doc that had the same structure, bucket entries and only resFrame entries with a specific id.

#!/usr/bin/env python2.6

import xml.etree.cElementTree as cElementTree
start = '''<?xml version="1.0" encoding="UTF-8"?>
<snapDoc>'''

def main():
    print start
    context = cElementTree.iterparse('snap.xml', events=("start", "end"))
    context = iter(context)
    event, root = context.next() # get the root element of the XML doc

    for event, elem in context:
        if event == "end":
            if elem.tag == 'bucket': # i want to write out all <bucket> entries
               elem.tail = None  
               print cElementTree.tostring( elem )
            if elem.tag == 'resFrame':
                if elem.find("id").text == ":4:39644:482:-1:1": # i only want to write out resFrame entries with this id
                    elem.tail = None
                    print cElementTree.tostring( elem )
            if elem.tag in ['bucket', 'frame', 'resFrame']:
                root.clear()  # when done parsing a section clear the tree to safe memory
    print "</snapDoc>"

main()
James Dean
A: 

This is an old, but very good article from Uche Ogbuji's also very good Python & XMl column. It covers your exact question and uses the standard lib's sax module like the other answer has suggested. Decomposition, Process, Recomposition

prayfomojo