tags:

views:

377

answers:

9

Musing over a recently asked question, I started to wonder if there is a really simple way to deal with XML documents in Python. A pythonic way, if you will.

Perhaps I can explain best if i give example: let's say the following - which i think is a good example of how XML is (mis)used in web services - is the response i get from http request to http://www.google.com/ig/api?weather=94043

<xml_api_reply version="1">
  <weather module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >
    <forecast_information>
      <city data="Mountain View, CA"/>
      <postal_code data="94043"/>
      <latitude_e6 data=""/>
      <longitude_e6 data=""/>
      <forecast_date data="2010-06-23"/>
      <current_date_time data="2010-06-24 00:02:54 +0000"/>
      <unit_system data="US"/>
    </forecast_information>
    <current_conditions>
      <condition data="Sunny"/>
      <temp_f data="68"/>
      <temp_c data="20"/>
      <humidity data="Humidity: 61%"/>
      <icon data="/ig/images/weather/sunny.gif"/>
      <wind_condition data="Wind: NW at 19 mph"/>
    </current_conditions>
    ...
    <forecast_conditions>
      <day_of_week data="Sat"/>
      <low data="59"/>
      <high data="75"/>
      <icon data="/ig/images/weather/partly_cloudy.gif"/>
      <condition data="Partly Cloudy"/>
    </forecast_conditions>
  </weather>
</xml_api_reply>

After loading/parsing such document, i would like to be able to access the information as simple as say

>>> xml['xml_api_reply']['weather']['forecast_information']['city'].data
'Mountain View, CA'

or

>>> xml.xml_api_reply.weather.current_conditions.temp_f['data']
'68'

From what I saw so far, seems that ElementTree is the closest to what I dream of. But it's not there, there is still some fumbling to do when consuming XML. OTOH, what I am thinking is not that complicated - probably just thin veneer on top of a parser - and yet it can decrease annoyance of dealing with XML. Is there such a magic? (And if not - why?)

PS. Note I have tried BeautifulSoup already and while I like its approach, it has real issues with empty <element/>s - see below in comments for examples.

+2  A: 

If you don't mind using a 3rd party library, then BeautifulSoup will do almost exactly what you ask for:

>>> from BeautifulSoup import BeautifulStoneSoup
>>> soup = BeautifulStoneSoup('''<snip>''')
>>> soup.xml_api_reply.weather.current_conditions.temp_f['data']
u'68'
Mike Boers
I already looked at Beautiful[Stone]Soup - but it is **broken** (as documented http://www.crummy.com/software/BeautifulSoup/documentation.html ) for empty tags like <element/> - which are aplenty in the example. For example `soup.xml_api_reply.weather.current_conditions.icon` returns `<icon data="/ig/images/weather/partly_cloudy.gif"><wind_condition data="Wind: N at 25 mph"></wind_condition></icon>` or you can get `temp_c` via `soup.xml_api_reply.weather.current_conditions.condition.temp_f.temp_c['data']` which seems demented to me
Nas Banov
It will work if 1) you know what empty tags you're looking for, and 2) they can be relied upon to be empty. Then you can specify them as an argument to the parser: selfClosingTags=['city','postal_code', ...]
Owen S.
@Owen S: <nod>, selfClosingTags indeed helped with the StoneSoup but one shouldn't have to do that really. The example is chock full of empty tags (that should have been attributes ... but aren't) - and so is the case in many XMLs
Nas Banov
A: 

If you haven't already, I'd suggest looking into the DOM API for Python. DOM is a pretty widely used XML interpretation system, so it should be pretty robust.

It's probably a little more complicated than what you describe, but that comes from its attempts to preserve all the information implicit in XML markup rather than from bad design.

tlayton
The question is specifically for easy, pythonic XML access. DOM is many things (not _all_ of which are evil, I admit), but "easy" and "pythonic" it most certainly is _not_. Reverting to DOM for interacting with XML is like dropping to C (or worse, assembly) for a webapp -- it should be done rarely and only for remarkably good reason.
Nicholas Knight
And I'd also like to note that's not because of preserving XML structure; it's because that library tries hard to adhere to a cross-language API for its interface. There are more Pythonic libraries that preserve the XML structure quite precisely.
Owen S.
+3  A: 

Take a look at Amara 2, particularly the Bindery part of this tutorial.

It works in a way pretty similar to what you describe.

On the other hand. ElementTree's find*() methods can give you 90% of that and are packaged with Python.

Walter Mundt
I looked and indeed `amara.bindery` does what I am looking for - but it seems way too big (600k installer, 3MB source) - it's like someone said, I wanted a banana but now I got a "free" gorilla with it. Re ElementTree find*() - it's close but lacks that pythonic []/iterator veneer i was thinking about
Nas Banov
A: 

I believe that the built in python xml module will do the trick. Look at "xml.parsers.expat"

xml.parsers.expat

iform
A low-level SAXlike parser interface provides a Pythonic object interface to the parsed XML document?? What am I missing?
Owen S.
+3  A: 

I highly recommend lxml.etree and xpath to parse and analyse your data. Here is a complete example. I have truncated the xml to make it easier to read.

import lxml.etree

s = """<?xml version="1.0" encoding="utf-8"?>
<xml_api_reply version="1">
  <weather module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >
    <forecast_information>
      <city data="Mountain View, CA"/> <forecast_date data="2010-06-23"/>
    </forecast_information>
    <forecast_conditions>
      <day_of_week data="Sat"/>
      <low data="59"/>
      <high data="75"/>
      <icon data="/ig/images/weather/partly_cloudy.gif"/>
      <condition data="Partly Cloudy"/>
    </forecast_conditions>
  </weather>
</xml_api_reply>"""

tree = lxml.etree.fromstring(s)
for weather in tree.xpath('/xml_api_reply/weather'):
    print weather.find('forecast_information/city/@data')[0]
    print weather.find('forecast_information/forecast_date/@data')[0]
    print weather.find('forecast_conditions/low/@data')[0]
    print weather.find('forecast_conditions/high/@data')[0]
Jerub
Seems fairly easy indeed and i am taking note - but it is more xpath-ish (xpathologic?) than pythonic.
Nas Banov
+4  A: 

You want a thin veneer? That's easy to cook up. Try the following trivial wrapper around ElementTree as a start:

# geetree.py
import xml.etree.ElementTree as ET

class GeeElem(object):
    """Wrapper around an ElementTree element. a['foo'] gets the
       attribute foo, a.foo gets the first subelement foo."""
    def __init__(self, elem):
        self.etElem = elem

    def __getitem__(self, name):
        res = self._getattr(name)
        if res is None:
            raise AttributeError, "No attribute named '%s'" % name
        return res

    def __getattr__(self, name):
        res = self._getelem(name)
        if res is None:
            raise IndexError, "No element named '%s'" % name
        return res

    def _getelem(self, name):
        res = self.etElem.find(name)
        if res is None:
            return None
        return GeeElem(res)

    def _getattr(self, name):
        return self.etElem.get(name)

class GeeTree(object):
    "Wrapper around an ElementTree."
    def __init__(self, fname):
        self.doc = ET.parse(fname)

    def __getattr__(self, name):
        if self.doc.getroot().tag != name:
            raise IndexError, "No element named '%s'" % name
        return GeeElem(self.doc.getroot())

    def getroot(self):
        return self.doc.getroot()

You invoke it so:

>>> import geetree
>>> t = geetree.GeeTree('foo.xml')
>>> t.xml_api_reply.weather.forecast_information.city['data']
'Mountain View, CA'
>>> t.xml_api_reply.weather.current_conditions.temp_f['data']
'68'
Owen S.
+4  A: 

lxml has been mentioned. You might also check out lxml.objectify for some really simple manipulation.

>>> from lxml import objectify
>>> tree = objectify.fromstring(your_xml)
>>> tree.weather.attrib["module_id"]
'0'
>>> tree.weather.forecast_information.city.attrib["data"]
'Mountain View, CA'
>>> tree.weather.forecast_information.postal_code.attrib["data"]
'94043'
Ryan Ginstrom
++. Does what asked for, although like in the Amara case, free gorilla (a ton of non-distro library) comes with the order of banana. Btw, seems also can use `.get('data')` instead of `.attrib['data']`
Nas Banov
A: 

The suds project provides a Web Services client library that works almost exactly as you describe -- provide it a wsdl and then use factory methods to create the defined types (and process the responses too!).

d.w.
Ok, that's interesting... but doesn't the name **suds** imply this is only for use with **SOAP**? The example above is not SOAPy and i won't want to go the slippery-SOAP. Also, where do I find WSDL of - for example - the weather service above?
Nas Banov
Yes, you're right -- suds is definitely geared towards SOAP web-services rather than generic XML. A WSDL would be the published contract for that kind of service. My bad, I had assumed you had rinsed the bubbles from that example prior to posting :-)
d.w.
A: 

I found the following python-simplexml module, which in the attempts of the author to get something close to SimpleXML from PHP is indeed a small wrapper around ElementTree. It's under 100 lines but seems to do what was requested:

>>> import SimpleXml
>>> x = SimpleXml.parse(urllib.urlopen('http://www.google.com/ig/api?weather=94043'))
>>> print x.weather.current_conditions.temp_f['data']
58
Nas Banov