tags:

views:

317

answers:

3

EDIT

The use of the phrase "bad at XML" in this question has been a point of contention, so I'd like to start out by providing a very clear definition of what I mean by this term in this context: if support for standard XML APIs is poor, and forces one to use a language-specific API, in which namespaces seem to be an afterthought, then I would be inclined to characterize that language as being not as well suited to using XML as other mainstream languages that do not have these issues. "Bad at XML" is just a shorthand for these conditions, and I think it is a fair way to characterize it. As I will describe, my initial experience with Python has raised concerns about whether it fulfils these conditions; but, because in general my experience with Python has been quite positive, it seems likely that I'm missing something, thus motivating this question.

I'm trying to do some very simple XML processing with Python. I had initially hoped to be able to reuse my knowledge of standard W3C DOM API's, and happily found that the xml.dom and xml.dom.minidom modules did a good job of supporting these API's. Unfortunately, however, serialization proved to be problematic, for the following reasons:

  • xml.dom does not come with a serializer
  • the PyXML library, which includes a serializer for xml.dom, is no longer maintained, AND
  • minidom does not support serialization of namespaces, even though namespaces are supported in the API

I looked through the list of other W3C-like libraries here:

http://wiki.python.org/moin/PythonXml#W3CDOM-likelibraries

I found that many other libraries, such as 4Suite and libxml2dom, are also not maintained.

On the other hand, itools at first glance appears to be maintained, but there does not appear to be an Ubuntu/Debian package available, and so would be difficult to deploy and maintain.

At this point, it seemed like trying to use W3C DOM API's in my Python application was going to be dead-end, and I began to look at the ElementTree API. But the way the eTree API supports namespaces I think is horribly ugly, requiring one to use string concatenation every time an element in a particular namespace is created:

http://codespeak.net/lxml/tutorial.html#namespaces

So, my question is, have I overlooked something, or is support for XML (in particular W3C DOM) actually quite bad in Python?

EDIT

Here follows a list of more precise questions, the answers to which would really help me:

  • Is there reasonable support for W3C DOM in Python?
  • If not xml.dom, do you use e.g. etree instead of W3C DOM?
  • If so, which library is best, and how do you overcome the issues regarding namespacing in the API?
  • If you use W3C DOM instead, are you aware of a library that implements serialization with support for namespaces?
+4  A: 

I would say python handles XML pretty well. The number of different libraries available speaks to that - you have lots of options. And if there are features missing from libraries that you would like to use, feel free to contribute some patches!

I personally use the DOM and lxml.etree (etree is really fast). However, I feel your pain about the namespace thing. I wrote a quick helper function to deal with it:

DEFAULT_NS = "http://www.domain.org/path/to/xml"

def add_xml_namespace(path, namespace=DEFAULT_NS):
    """Adds namespaces to an XPath-ish expression path for etree

    Test simple expression:
    >>> add_xml_namespace('image/namingData/fileBaseName')
    '{http://www.domain.org/path/to/xml}image/{http://www.domain.org/path/to/xml}namingData/{http://www.domain.org/path/to/xml}fileBaseName'

    More complicated expression
    >>> add_xml_namespace('.//image/*') 
    './/{http://www.domain.org/path/to/xml}image/*'

    >>> add_xml_namespace('.//image/text()')
    './/{http://www.domain.org/path/to/xml}image/text()'
    """
    pattern = re.compile(r'^[A-Za-z0-9-]+$')
    tags = path.split('/')
    for i in xrange(len(tags)):
        if pattern.match(tags[i]):
            tags[i] = "{%s}%s" % (namespace, tags[i])
    return '/'.join(tags)

I use it like so:

from lxml import etree
from utilities import add_xml_namespace as ns

tree = etree.parse('file.xml')
node = tree.get_root().find(ns('root/group/subgroup'))
# etc. 

If you don't know the namespace ahead of time, you can extract it from a root node:

tree = etree.parse('file.xml')
root = tree.getroot().tag
namespace = root[1:root.index('}')]
ns = lambda path: add_xml_namespace(path, namespace)
...

Additional comment: There is a little work involved here, but work is necessary when dealing with XML. That's not a python issue, it's an XML issue.

Seth
"If you want it, submit a patch" is the most embarrassing open source one-liner. (And, of course it's a Python issue--it's not XML's fault if a Python API is missing functions and makes you do the work, it's the API's fault.)
Glenn Maynard
For an end-user application, it sounds bad, but for a library written for programmers, what's wrong with that? I've had to use many proprietary libraries (especially in less-dynamic languages) where the situation was basically "If you need one trivial feature we didn't think to include, #@!$ off". People afraid of writing a little code aren't the ones downloading XML libraries in the first place...
Ken
@Ken: It's an entirely different situation when the library is no longer maintained, or left in an ambiguous state, as seems to be the case for many of these python XML libraries. Maintaining a library is much more work than simply submitting a patch.
echo-flow
@echo-flow: I don't understand what's your problem with `lxml`. I for one am very happy with it, and my current project revolves **solely** around XML! What features are you missing? (Serialization is simply `etree.tostring(the_node, formatting_and_encoding_options)`)
delnan
@delnan: I think the ElementTree APIs involving namespaces are unfortunate, and do not seem fully thought-out in comparison with W3C DOM. Requiring one to do string manipulation to create a namespaced element, or extract the namespace from an existing element, seems like a brittle and hacked-together solution compared with DOM APIs. With that said, since I originally asked this question, I have been using lxml in my project, as that seems to be the popular choice, and it's been working well enough for me. It's just not optimal. Do you find it difficult working with namespaces in your project?
echo-flow
@echo-flow: Although the documents in questions make heavy use of namespaces (IIRC, I didn't encounter a single tag that's not in some namespace while testing), it was only a minor annoyance. For the most part, it meant calling `.format()` once before evaluating an XPath query, and ignoring the `{...}` while reading debugging output. Maybe I'm just lucky since I read the XMLs into a custom, much more lightweight data structure for processing and convert it back to an etree for output (preserving namespaces, but that's trivial) - so only those parts need to deal with namespaces.
delnan
Having a plethora of external options for something so common is considered a shortcoming in Python.
Edmund
@delnan: thank you for the useful reply. @Edmund: could you please elaborate on what you mean? What are the "external options" here and what is common?
echo-flow
+1  A: 

Python is great at handling XML, I consider lxml to be the best xml library I have ever worked with, it is powerful and significantly simpler the DOM. The namespace handling took some getting used to, but I think it is another great way lxml keeps things simple.

EDIT

After re reading the question, it is unclear exactly whether the the author meant the serialization of python objects, or just the DOM tree. The portion of my answer bellow assumed the former.

XML serialization is a completely different issue. Personal I don't think it is very important. Most XML serializers produce output that its pretty specific to the language or runtime, which defeats the purpose of having such an open format. I realize that there are some generic XML serialization schema, but Python provides 2 solutions that are superior for 95% of situations, Pickling and JSON.

If your application doesn't have to share objects with non-python systems, Pickling is the fastest and most powerful serialization solution you will find. JSON is significantly faster to parse and generate, and much easier to work with than XML. JSON has plenty of limitations, but it is frequently easier to work around them, than deal with the headaches of XML.

There are plenty of other serialization formats that, depending on the application, I would recommend ahead of XML (E.G.: Google Protocol Buffers, or YAML.)

Also, don't forget about SAX. Event driven parsers are only useful for reading XML, but I have found that it is still the best solution for some problems.

mikerobi
+1  A: 

But the way the eTree API supports namespaces I think is horribly ugly, requiring one to use string concatenation every time an element in a particular namespace is created

Here's how you create an element with a namespace in .NET's System.Linq.Xml DOM:

XNamespace ns = "my-namespace";
XElement elm = new XElement(ns + "foo");

Here's how you create an element in a namespace in lxml:

ns = "{my-namespace}"
elm = etree.Element(ns + "foo")

I'm not seeing horrible ugliness here. In fact, the developers of the .NET API have bent over backwards, creating base classes that support operator overloading, to make it possible for their API to handle namespaces as intuitively as lxml's does.

How this is uglier than the W3C DOM requiring you to use different methods to create elements with and without namespaces is beyond me.

Robert Rossney
More problematic than creation is querying. To get the namespace of an element using W3C DOM, you simply access its namespaceURI property; to get the tag name, it's localName. If you look at @Seth's example, you can see that accessing the namespaceURI using etree requires the following invocation "root[1:root.index('}')]". I think this is not as clean. If there's a better way to do this, however, that would be good to know.
echo-flow
I agree with you there; it's kind of dumb that `Element` doesn't have a `localName` property. On the other hand, if you're querying XML by iterating over DOM objects and looking at the names of elements, instead of using XPath, you're probably making other mistakes too.
Robert Rossney