ansaurus

Question

Is Python bad at XML?

Answer 1

+4 A:

I would say python handles XML pretty well. The number of different libraries available speaks to that - you have lots of options. And if there are features missing from libraries that you would like to use, feel free to contribute some patches!

I personally use the DOM and lxml.etree (etree is really fast). However, I feel your pain about the namespace thing. I wrote a quick helper function to deal with it:

DEFAULT_NS = "http://www.domain.org/path/to/xml"

def add_xml_namespace(path, namespace=DEFAULT_NS):
    """Adds namespaces to an XPath-ish expression path for etree

    Test simple expression:
    >>> add_xml_namespace('image/namingData/fileBaseName')
    '{http://www.domain.org/path/to/xml}image/{http://www.domain.org/path/to/xml}namingData/{http://www.domain.org/path/to/xml}fileBaseName'

    More complicated expression
    >>> add_xml_namespace('.//image/*') 
    './/{http://www.domain.org/path/to/xml}image/*'

    >>> add_xml_namespace('.//image/text()')
    './/{http://www.domain.org/path/to/xml}image/text()'
    """
    pattern = re.compile(r'^[A-Za-z0-9-]+$')
    tags = path.split('/')
    for i in xrange(len(tags)):
        if pattern.match(tags[i]):
            tags[i] = "{%s}%s" % (namespace, tags[i])
    return '/'.join(tags)

I use it like so:

from lxml import etree
from utilities import add_xml_namespace as ns

tree = etree.parse('file.xml')
node = tree.get_root().find(ns('root/group/subgroup'))
# etc.

If you don't know the namespace ahead of time, you can extract it from a root node:

tree = etree.parse('file.xml')
root = tree.getroot().tag
namespace = root[1:root.index('}')]
ns = lambda path: add_xml_namespace(path, namespace)
...

Additional comment: There is a little work involved here, but work is necessary when dealing with XML. That's not a python issue, it's an XML issue.

Seth 2010-10-23 19:40:14

"If you want it, submit a patch" is the most embarrassing open source one-liner. (And, of course it's a Python issue--it's not XML's fault if a Python API is missing functions and makes you do the work, it's the API's fault.)

Glenn Maynard 2010-10-23 20:30:26

For an end-user application, it sounds bad, but for a library written for programmers, what's wrong with that? I've had to use many proprietary libraries (especially in less-dynamic languages) where the situation was basically "If you need one trivial feature we didn't think to include, #@!$ off". People afraid of writing a little code aren't the ones downloading XML libraries in the first place...

Ken 2010-10-24 16:10:01

@Ken: It's an entirely different situation when the library is no longer maintained, or left in an ambiguous state, as seems to be the case for many of these python XML libraries. Maintaining a library is much more work than simply submitting a patch.

echo-flow 2010-10-24 17:58:17

@echo-flow: I don't understand what's your problem with `lxml`. I for one am very happy with it, and my current project revolves **solely** around XML! What features are you missing? (Serialization is simply `etree.tostring(the_node, formatting_and_encoding_options)`)

delnan 2010-10-24 18:39:58

@delnan: I think the ElementTree APIs involving namespaces are unfortunate, and do not seem fully thought-out in comparison with W3C DOM. Requiring one to do string manipulation to create a namespaced element, or extract the namespace from an existing element, seems like a brittle and hacked-together solution compared with DOM APIs. With that said, since I originally asked this question, I have been using lxml in my project, as that seems to be the popular choice, and it's been working well enough for me. It's just not optimal. Do you find it difficult working with namespaces in your project?

echo-flow 2010-10-24 19:16:58

@echo-flow: Although the documents in questions make heavy use of namespaces (IIRC, I didn't encounter a single tag that's not in some namespace while testing), it was only a minor annoyance. For the most part, it meant calling `.format()` once before evaluating an XPath query, and ignoring the `{...}` while reading debugging output. Maybe I'm just lucky since I read the XMLs into a custom, much more lightweight data structure for processing and convert it back to an etree for output (preserving namespaces, but that's trivial) - so only those parts need to deal with namespaces.

delnan 2010-10-24 20:19:56

Having a plethora of external options for something so common is considered a shortcoming in Python.

Edmund 2010-10-24 21:10:50

@delnan: thank you for the useful reply. @Edmund: could you please elaborate on what you mean? What are the "external options" here and what is common?

echo-flow 2010-10-24 22:21:07

Answer 2

+1 A:

Python is great at handling XML, I consider lxml to be the best xml library I have ever worked with, it is powerful and significantly simpler the DOM. The namespace handling took some getting used to, but I think it is another great way lxml keeps things simple.

EDIT

After re reading the question, it is unclear exactly whether the the author meant the serialization of python objects, or just the DOM tree. The portion of my answer bellow assumed the former.

XML serialization is a completely different issue. Personal I don't think it is very important. Most XML serializers produce output that its pretty specific to the language or runtime, which defeats the purpose of having such an open format. I realize that there are some generic XML serialization schema, but Python provides 2 solutions that are superior for 95% of situations, Pickling and JSON.

If your application doesn't have to share objects with non-python systems, Pickling is the fastest and most powerful serialization solution you will find. JSON is significantly faster to parse and generate, and much easier to work with than XML. JSON has plenty of limitations, but it is frequently easier to work around them, than deal with the headaches of XML.

There are plenty of other serialization formats that, depending on the application, I would recommend ahead of XML (E.G.: Google Protocol Buffers, or YAML.)

Also, don't forget about SAX. Event driven parsers are only useful for reading XML, but I have found that it is still the best solution for some problems.

mikerobi 2010-10-23 19:56:30

Answer 3

+1 A:

But the way the eTree API supports namespaces I think is horribly ugly, requiring one to use string concatenation every time an element in a particular namespace is created

Here's how you create an element with a namespace in .NET's System.Linq.Xml DOM:

XNamespace ns = "my-namespace";
XElement elm = new XElement(ns + "foo");

Here's how you create an element in a namespace in lxml:

ns = "{my-namespace}"
elm = etree.Element(ns + "foo")

I'm not seeing horrible ugliness here. In fact, the developers of the .NET API have bent over backwards, creating base classes that support operator overloading, to make it possible for their API to handle namespaces as intuitively as lxml's does.

How this is uglier than the W3C DOM requiring you to use different methods to create elements with and without namespaces is beyond me.

Robert Rossney 2010-10-24 20:48:34

More problematic than creation is querying. To get the namespace of an element using W3C DOM, you simply access its namespaceURI property; to get the tag name, it's localName. If you look at @Seth's example, you can see that accessing the namespaceURI using etree requires the following invocation "root[1:root.index('}')]". I think this is not as clean. If there's a better way to do this, however, that would be good to know.

echo-flow 2010-10-24 22:18:26

I agree with you there; it's kind of dumb that `Element` doesn't have a `localName` property. On the other hand, if you're querying XML by iterating over DOM objects and looking at the names of elements, instead of using XPath, you're probably making other mistakes too.

Robert Rossney 2010-10-24 23:02:24

ansaurus

tags:

views:

answers:

Is Python bad at XML?

EDIT

EDIT

related questions