views:

294

answers:

3

I'd like to parse a big XML file "on the fly". I'd like to use a python generator to perform this. I've tried "iterparse" of "xml.etree.cElementTree" (which is really nice) but still not a generator.

Other suggestions?

+6  A: 

"On the fly" parsing and document trees are not really compatible. SAX-style parsers are usually used for that (for example, Python's standard xml.sax). You basically have to define a class with handlers for various events like startElement, endElement, etc. and the parser will call the methods as it parses the XML file.

Lukáš Lalinský
that's what I want... I don't mind having to "react" to events such as "start tag" etc.
jldupont
@Jean-Lou: if you don't need the entire tree, then SAX is the way to go. It is made for processing documents as a stream of events instead of a tree of content.
D.Shawley
A: 

PullDom does what you want. It reads XML from a stream, like SAX, but then builds a DOM for a selected piece of it.

"PullDOM is a really simple API for working with DOM objects in a streaming (efficient!) manner rather than as a monolithic tree."

RichieHindle
so if I put a "yield" statement in the for-loop {e.g. for (event,node) in events: yield (event, node)} PullDom won't restart at the beginning next time I enter the for-loop ?
jldupont
... because that's what happens with "iterparse"...
jldupont
@Jean-Lou Dupont: if you want iterator behavior, perhaps you should call `iter(...)` on the ElementTree object?
kaizer.se
@kaizer: example, please please?
jldupont
@Jean-Lou Dupont: I believe that's correct, but you'd have to try it in your own situation.
RichieHindle
+5  A: 

xml.etree.cElementTree comes close to a generator with correct usage; by default you receive each element after its 'end' event, at which point you can process it. You should use element.clear() on the element if you don't need it after processing; thereby you save the memory.


Here is a complete example what I mean, where I parse Rhythmbox's (Music Player) Library. I use (c)ElementTree's iterparse and for each processed element I call element.clear() so that I save quite a lot of memory. (Btw, the code below is a successor to some sax code to do the same thing; the cElementTree solution was a relief since 1) The code is concise and expresses what I need and nothing more 2) It is 3x as fast, 3) it uses less memory.)

import os
import xml.etree.cElementTree as ElementTree
NEEDED_KEYS= set(("title", "artist", "album", "track-number", "location", ))

def _lookup_string(string, strmap):
    """Look up @string in the string map,
    and return the copy in the map.

    If not found, update the map with the string.
    """
    string = string or ""
    try:
        return strmap[string]
    except KeyError:
        strmap[string] = string
        return string

def get_rhythmbox_songs(dbfile, typ="song", keys=NEEDED_KEYS):
    """Return a list of info dictionaries for all songs
    in a Rhythmbox library database file, with dictionary
    keys as given in @keys.
    """
    rhythmbox_dbfile = os.path.expanduser(dbfile)

    lSongs = []
    strmap = {}

    # Parse with iterparse; we get the elements when
    # they are finished, and can remove them directly after use.

    for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
        if not (entry.tag == ("entry") and entry.get("type") == typ):
            continue
        info = {}
        for child in entry.getchildren():
            if child.tag in keys:
                tag = _lookup_string(child.tag, strmap)
                text = _lookup_string(child.text, strmap)
                info[tag] = text
        lSongs.append(info)
        entry.clear()
    return lSongs


Now, I don't understand your expectations, do you have the following expectation?

# take one
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
    # parse some entries, then exit loop

# take two
for event, entry in ElementTree.iterparse(rhythmbox_dbfile):
    # parse the rest of entries

Each time you call iterparse you get a new iterator object, reading the file anew! If you want a persistent object with iterator semantics, you have to refer to the same object in both loops (untried code):

#setup
parseiter = iter(ElementTree.iterparse(rhythmbox_dbfile))
# take one
for event, entry in parseiter:
    # parse some entries, then exit loop

# take two
for event, entry in parseiter:
    # parse the rest of entries


I think it can be confusing since different objects have different semantics. A file object will always have an internal state and advance in the file, however you iterate on it. An ElementTree iterparse object apparently not. The crux is to think that when you use a for loop, the for always calls iter() on the thing you iterate over. Here is an experiment comparing ElementTree.iterparse with a file object:

>>> import xml.etree.cElementTree as ElementTree
>>> pth = "/home/ulrik/.local/share/rhythmbox/rhythmdb.xml"
>>> iterparse = ElementTree.iterparse(pth)
>>> iterparse
<iterparse object at 0x483a0890>
>>> iter(iterparse)
<generator object at 0x483a2f08>
>>> iter(iterparse)
<generator object at 0x483a6468>
>>> f = open(pth, "r")
>>> f
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>
>>> iter(f)
<open file '/home/ulrik/.local/share/rhythmbox/rhythmdb.xml', mode 'r' at 0x4809af98>

What you see is that each call to iter() on an iterparse object returns a new generator. The file object however, has an internal Operating System state that must be conserved and it its own iterator.

kaizer.se
@kaizer: So in effect it is like working with the subset of the document each time the for-loop is entered after the element.clear() ?
jldupont
You haven't defined what you want to do and your expectations surprise me; I would use iterparse in one for loop over the whole document. I will make an example.
kaizer.se
@kaizer: many thanks for all your efforts. I discovered the SAX parser thanks to this post and it looks like I'll be able to manage building my state-machine based parser neatly with this approach. (Can you tell I am an XML-newbie ? ;-)
jldupont
well I am too. I preferred ElementTree as it I could get the job done and forget about it quickly. Your problem might be simpler with other methods, though!
kaizer.se
@Jean-Lou: Did my answer clear up anything for you about iterparse and pausing then resuming parsing? I'm just curious.
kaizer.se