tags:

views:

3535

answers:

3

Python has several ways to parse XML...

I understand the very basics of parsing with SAX. It functions as a stream parser, with an event-driven API.

I understand the DOM parser also. It reads the XML into memory and coverts it to objects that can be accessed with Python.

Generally speaking, it was easy to choose between the 2 depending on what you needed to do, memory constraints, performance, etc.

(hopefully I'm correct so far).

since Python 2.5, we also have ElementTree. How does this compare to DOM and SAX? Which is it more like? Why is it better than the previous parsers?

+3  A: 

ElementTree's parse() is like DOM, whereas iterparse() is like SAX. In my opinion, ElementTree is better than DOM and SAX in that it provides API easier to work with.

sanxiyn
Also, I find that I want the real structure, not a series of events.
S.Lott
A serial parser is often good enough for simple parsing. I started Python using sax, and only switched to minidom when my needs became too complex for sax. I should add that I haven't used ElementTree, yet, since it doesn't seem to offer enough more functionality for me to port my code to it.
giltay
+1  A: 

ElementTree has more pythonic API. It also is in standard library now so using it reduces dependencies.

I actually prefer http://codespeak.net/lxml/ as it has API like ElementTree, but has also nice additional features and performs well.

iny
+18  A: 

ElementTree is much easier to use, because it represents an XML tree (basically) as a structure of lists, and attributes are represented as dictionaries.

ElementTree needs much less memory for XML trees than DOM (and thus is faster), and the parsing overhead via iterparse is comparable to SAX. Additionally, iterparse returns partial structures, and you can keep memory usage constant during parsing by discarding the structures as soon as you process them.

ElementTree, as in Python 2.5, has only a small feature set compared to full-blown XML libraries, but it's enough for many applications. If you need a validating parser or complete XPath support, lxml is the way to go. For a long time, it used to be quite unstable, but I haven't had any problems with it since 2.1.

ElementTree deviates from DOM, where nodes have access to their parent and siblings. Handling actual documents rather than data stores is also a bit cumbersome, because text nodes aren't treated as actual nodes. In the XML snippet

<a>This is <b>a</b> test</a>

The string test will be the so-called tail of element b.

In general, I recommend ElementTree as the default for all XML processing with Python, and DOM or SAX as the solutions for specific problems.

Torsten Marek