ansaurus

Question

Answer 1

+8 A:

from BeautifulSoup import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

S.Mark 2009-12-16 05:12:43

Interesting, I've always thought of Beautiful Soup as a brilliant HTML parsing library and API, but for some reason I never really thought of using it for XML. Hmmm…

Avi Flax 2009-12-16 05:24:58

Actually, There is `BeautifulStoneSoup` in BeautifulSoup for XML

S.Mark 2009-12-16 05:28:46

I wouldn't rely too much on BeautifulSoup now that its future is uncertain. http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

ionut bizau 2009-12-16 06:10:44

Thanks for info @ibz, Yeah, Actually, If source is not well-formed, it will be difficult to parse for parsers too.

S.Mark 2009-12-16 06:27:36

Answer 2

+12 A:

I suggest ElementTree (there are other compatible implementatons, such as lxml, but what they add is "just" even more speed -- the ease of programming part depends on the API, which ElementTree defines.

After building an Element instance e from the XML, e.g. with the XML function, just:

for atype in e.findall('type')
  print(atype.get('foobar'))

and the like.

Alex Martelli 2009-12-16 05:21:55

You seem to ignore xml.etree.cElementTree which comes with Python and in some aspects is faster tham lxml ("lxml's iterparse() is slightly slower than the one in cET" -- e-mail from lxml author).

John Machin 2009-12-16 11:37:14

Answer 3

+1 A:

I'm still a Python newbie myself, but my impression is that ElementTree is the state-of-the-art in Python XML parsing and handling.

Mark Pilgrim has a good section on Parsing XML with ElementTree in his book Dive Into Python 3.

Avi Flax 2009-12-16 05:23:31

Answer 4

+2 A:

Python has an interface to the expat xml parser.

xml.parsers.expat

It's a non-validating parser, so bad xml will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4

Tor Valamo 2009-12-16 05:28:00

Answer 5

A:

I find the Python xml.dom and xml.dom.minidom quite easy. Keep in mind that DOM isn't good for large amounts of XML, but if your input is fairly small then this will work fine.

Evgeny 2009-12-16 05:28:55

Answer 6

+4 A:

minidom is the quickest and pretty straight forward:

XML:

<data>
    <items>
     <item name="item1"></item>
     <item name="item2"></item>
     <item name="item3"></item>
     <item name="item4"></item>
    </items>
</data>

PYTHON:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item') 
print len(itemlist)
print itemlist[0].attributes['name'].value
for s in itemlist :
    print s.attributes['name'].value

OUTPUT

4 item1 item1 item2 item3 item4

Ryan Christensen 2009-12-16 05:30:15

Really Helpful! Thanks for the simple example

Nimbuz 2010-01-19 06:51:57

Answer 7

+1 A:

lxml.objectify is really simple.

Taking your sample text:

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

Output:

{'1': 1, '2': 1}

Ryan Ginstrom 2009-12-16 10:42:24

ansaurus

tags:

views:

answers:

easiest way to parse xml in python

related questions