tags:

views:

507

answers:

7

I have many rows in a database that contain xml and I'm trying to write a python script that will go through those rows and count how many instances of a particular node attribute show up. for instance, my tree looks like:

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

I'm looking for the easiest way for me to access the attributes 1 and 2 in the XML above.

+8  A: 

You can use BeautifulSoup

from BeautifulSoup import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'
S.Mark
Interesting, I've always thought of Beautiful Soup as a brilliant HTML parsing library and API, but for some reason I never really thought of using it for XML. Hmmm…
Avi Flax
Actually, There is `BeautifulStoneSoup` in BeautifulSoup for XML
S.Mark
I wouldn't rely too much on BeautifulSoup now that its future is uncertain. http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
ionut bizau
Thanks for info @ibz, Yeah, Actually, If source is not well-formed, it will be difficult to parse for parsers too.
S.Mark
+12  A: 

I suggest ElementTree (there are other compatible implementatons, such as lxml, but what they add is "just" even more speed -- the ease of programming part depends on the API, which ElementTree defines.

After building an Element instance e from the XML, e.g. with the XML function, just:

for atype in e.findall('type')
  print(atype.get('foobar'))

and the like.

Alex Martelli
You seem to ignore xml.etree.cElementTree which comes with Python and in some aspects is faster tham lxml ("lxml's iterparse() is slightly slower than the one in cET" -- e-mail from lxml author).
John Machin
+1  A: 

I'm still a Python newbie myself, but my impression is that ElementTree is the state-of-the-art in Python XML parsing and handling.

Mark Pilgrim has a good section on Parsing XML with ElementTree in his book Dive Into Python 3.

Avi Flax
+2  A: 

Python has an interface to the expat xml parser.

xml.parsers.expat

It's a non-validating parser, so bad xml will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4
Tor Valamo
A: 

I find the Python xml.dom and xml.dom.minidom quite easy. Keep in mind that DOM isn't good for large amounts of XML, but if your input is fairly small then this will work fine.

Evgeny
+4  A: 

minidom is the quickest and pretty straight forward:

XML:

<data>
    <items>
     <item name="item1"></item>
     <item name="item2"></item>
     <item name="item3"></item>
     <item name="item4"></item>
    </items>
</data>

PYTHON:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item') 
print len(itemlist)
print itemlist[0].attributes['name'].value
for s in itemlist :
    print s.attributes['name'].value

OUTPUT

4 item1 item1 item2 item3 item4

Ryan Christensen
Really Helpful! Thanks for the simple example
Nimbuz
+1  A: 

lxml.objectify is really simple.

Taking your sample text:

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

Output:

{'1': 1, '2': 1}
Ryan Ginstrom