ansaurus

Question

Good python XML parser to work with namespace heavy documents

Answer 1

+1 A:

How about:

http://docs.python.org/library/pyexpat.html

pyfunc 2010-09-24 09:12:44

Do you have and example of how it can be used with namespaces?

Frank Malina 2010-09-24 09:29:33

Answer 2

A:

libxml (http://xmlsoft.org/) Best, faster lib for xml parsing. There are implementation for python.

iscarface 2010-09-24 09:54:50

lxml from codespeak wraps and uses libxml

ma3 2010-09-25 04:41:32

Answer 3

A:

Sax handles namespace too thanks to a namespace mode.

mripard 2010-09-24 09:58:25

Example would be appreciated, thank you.

Frank Malina 2010-09-25 10:51:23

Answer 4

+6 A:

lxml is namespace-aware.

>>> from lxml import etree
>>> et = etree.XML("""<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz /></bar></root>""")
>>> etree.tostring(et, encoding=str) # encoding=str only needed in Python 3, to avoid getting bytes
'<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz/></bar></root>'
>>> et.xpath("f:bar", namespaces={"b":"bar", "f": "foo"})
[<Element {foo}bar at ...>]

Edit: On your example:

from lxml import etree

# remove the b prefix in Python 2
# needed in python 3 because
# "Unicode strings with encoding declaration are not supported."
et = etree.XML(b"""...""")

ns = {
    'lom': 'http://ltsc.ieee.org/xsd/LOM',
    'zs': 'http://www.loc.gov/zing/srw/',
    'dc': 'http://purl.org/dc/elements/1.1/',
    'voc': 'http://www.schooletc.co.uk/vocabularies/',
    'srw_dc': 'info:srw/schema/1/dc-schema'
}

# according to docs, .xpath returns always lists when querying for elements
# .find returns one element, but only supports a subset of XPath
record = et.xpath("zs:records/zs:record", namespaces=ns)[0]
# in this example, we know there's only one record
# but else, you should apply the following to all elements the above returns

name = record.xpath("//voc:name", namespaces=ns)[0].text
print("name:", name)

lom_entry = record.xpath("zs:recordData/srw_dc:dc/"
                         "lom:metaMetadata/lom:identifier/"
                         "lom:entry",
                         namespaces=ns)[0].text

print('lom_entry:', lom_entry)

lom_ids = [id.text for id in
           record.xpath("zs:recordData/srw_dc:dc/"
                        "lom:classification/lom:taxonPath/"
                        "lom:taxon/lom:id",
                        namespaces=ns)]

print("lom_ids:", lom_ids)

Output:

name: Frank Malina
lom_entry: 2.6
lom_ids: ['PYTHON', 'XML', 'XML-NAMESPACES']

delnan 2010-09-24 11:03:53

+1 lxml is the only python tool/package you'll ever need for xml/xslt/xpath related tasks

ma3 2010-09-25 04:42:47

Edit: How would you code around the example provided? The lack of recipes on the web for this sort of lxml work is appalling. At the moment, I have proceeded by stripping the namespaces and traversing with BeautifulSoup. This is suboptimal on a number of levels.

Frank Malina 2010-09-25 10:46:16

@Frank Malina: XPath isn't lxml-specific, there are some useable ressources on XPath across the web. But I will make a stab at it...

delnan 2010-09-25 11:00:35

That is actually quite beautiful.

Frank Malina 2010-09-25 19:31:02

I know XML and XPath backwards and forwards and I've always found using lxml a challenge because of the lack of good examples. The above is really valuable. Thanks.

Robert Rossney 2010-09-25 20:35:05

ansaurus

tags:

views:

answers:

Good python XML parser to work with namespace heavy documents

related questions