tags:

views:

130

answers:

4

A search for "python" and "xml" returns a variety of libraries for combining the two.

This list probably faulty:

  • xml.dom
  • xml.etree
  • xml.sax
  • xml.parsers.expat
  • PyXML
  • beautifulsoup?
  • HTMLParser
  • htmllib
  • sgmllib

Be nice if someone can offer a quick summary of when to use which, and why.

+3  A: 

The DOM/SAX divide is a basic one. It applies not just to python since DOM and SAX are cross-language.

DOM: read the whole document into memory and manipulate it. Good for:

  • complex relationships across tags in the markup
  • small intricate XML documents
  • Cautions:
    • Easy to use excessive memory

SAX: parse the document while you read it. Good for:

  • Long documents or open ended streams
  • places where memory is a constraint
  • Cautions:
    • You'll need to code a stateful parser, which can be tricky

beautifulsoup:

Great for HTML or not-quite-well-formed markup. Easy to use and fast. Good for screen scraping, etc. It can work with markup where the XML based ones would just through an error saying the markup is incorrect.

Most of the rest I haven't used, but I don't think there's hard and fast rules about when to use which. Just your standard considerations: who is going to maintain the code, which APIs do you find most easy to use, how well do they work, etc.

In general, for basic needs, it's nice to use the standard library modules since they are "standard" and thus available and well known. However, if you need to dig deep into something, almost always there are newer nonstandard modules with superior functionality outside of the standard library.

Peter Lyons
Came across this good article with some good elementtree examples using both styles of parsers:http://www.doughellmann.com/PyMOTW/xml/etree/ElementTree/parse.html
Peter Lyons
+1  A: 

i don't do much with xml, but when i've needed to lxml has been a joy to work with, and is apparently quite fast. the element tree api is very nice in an object oriented setting.

Autoplectic
+2  A: 

I find xml.etree essentially sufficient for everything, except for BeautifulSoup if I ever need to parse broken XML (not a common problem, differently from broken HTML, which BeautifulSoup also helps with and is everywhere): it has reasonable support for reading entire XML docs in memory, navigating them, creating them, incrementally-parsing large docs. lxml supports the same interface, and is generally faster -- useful to push performance when you can afford to install third party Python extensions (e.g. on App Engine you can't -- but xml.etree is still there, so you can run exactly the same code). lxml also has more features, and offers BeautifulSoup too.

The other libs you mention mimic APIs designed for very different languages, and in general I see no reason to contort Python into those gyrations. If you have very specific needs such as support for xslt, various kinds of validations, etc, it may be worth looking around for other libraries yet, but I haven't had such needs in a long time so I'm not current wrt the offerings for them.

Alex Martelli
+1  A: 

For many problems you can get by with the xml. It has the major advantage of being part of the standard library. This means that it is pre-installed on almost every system and that the interface will be static. It is not the best, or the fastest, but it is there.

For everything else there is lxml. Specically, lxml is best for parsing broken HTML, xHTML, or suspect feeds. It uses libxml2 and libxslt to handle XPath, XSLT, and EXSLT. The tutorial is clear and the interface is simplistically straight-forward. The rest of the libraries mentioned exist because lxml was not available in its current form.

This is my opinion.

Charles Merriam