I have had success with the cElementTree.iterparse method in order to do a similar task.
I had a large xml doc with repeated 'entries' with tag 'resFrame' and I wanted to filter out entries for a specific id. Here is the code that I used for it:
source document had this structure
<snapDoc>
<bucket>....</bucket>
<bucket>....</bucket>
<bucket>....</bucket>
...
<resFrame><id>234234</id>.....</resFrame>
<frame><id>344234</id>.....</frame>
<resFrame>...</resFrame>
<frame>...</frame>
</snapDoc>
I used the following script to create a smaller doc that had the same structure, bucket entries and only resFrame entries with a specific id.
#!/usr/bin/env python2.6
import xml.etree.cElementTree as cElementTree
start = '''<?xml version="1.0" encoding="UTF-8"?>
<snapDoc>'''
def main():
print start
context = cElementTree.iterparse('snap.xml', events=("start", "end"))
context = iter(context)
event, root = context.next() # get the root element of the XML doc
for event, elem in context:
if event == "end":
if elem.tag == 'bucket': # i want to write out all <bucket> entries
elem.tail = None
print cElementTree.tostring( elem )
if elem.tag == 'resFrame':
if elem.find("id").text == ":4:39644:482:-1:1": # i only want to write out resFrame entries with this id
elem.tail = None
print cElementTree.tostring( elem )
if elem.tag in ['bucket', 'frame', 'resFrame']:
root.clear() # when done parsing a section clear the tree to safe memory
print "</snapDoc>"
main()