views:

114

answers:

3

I have a problem with retrieving information from a XML tree.

My XML has this shape:

<?xml version="1.0"?>
<records xmlns="http://www.mysyte.com/foo"&gt;
  <record>
    <id>first</id>
    <name>john</name>
    <papers>
      <paper>john_1</paper>
      <paper>john_2</paper>
    </papers>
  </record>
  <record>
    <id>second</id>
    <name>mike</name>
    <papers>
      <paper>mike_a</paper>
      <paper>mike_b</paper>
    </papers>
  </record>
  <record>
    <id>third</id>
    <name>albert</name>
    <papers>
      <paper>paper of al</paper>
      <paper>other paper</paper>
    </papers>
  </record>
</records>

What I want to do is to extract tuples of data like the follow:

[{'code': 'first', 'name': 'john'}, 
 {'code': 'second', 'name': 'mike'}, 
 {'code': 'third', 'name': 'albert'}]

Now I wrote this python code:

try:
  doc = libxml2.parseDoc(xml)
except (libxml2.parserError, TypeError):
  print "Problems loading XML"

ctxt = doc.xpathNewContext()
ctxt.xpathRegisterNs("pre", "http://www.mysyte.com/foo")

record_nodes = ctxt.xpathEval('/pre:records/pre:record')

for record_node in record_nodes:
  id = record_node.xpathEval('id')[0].content
  name = record_node.xpathEval('name')[0].content
  ret_list.append({'code': id, 'name': name})

My problem is that I don't have any result and I have the impression that I'm doing something wrong with the XPATH when I iterate on the nodes.

I also tried with these XPATHs for the id and the name:

/id
/name
/record/id
/record/name
/pre:id
/pre:name

and so on, but with any result (BTW if I use the prefix in the sub queries I have an error).

Any idea?

A: 

If it is possible to switch to lxml, here is one way it could be done:

import lxml.etree as le
root=le.XML(content)
result=[]
namespaces={'pre':'http://www.mysyte.com/foo'}
for record in root:
    id=record.xpath('pre:id',namespaces=namespaces)[0]
    name=record.xpath('pre:name',namespaces=namespaces)[0]
    result.append({'code':id.text,'name':name.text})
print(result)
# [{'code': 'first', 'name': 'john'}, {'code': 'second', 'name': 'mike'}, {'code': 'third', 'name': 'albert'}]

Building off of Dimitre Novatchev's XPath expression, you could do this:

id_name_nodes = iter(ctxt.xpathEval('/pre:records/pre:record/*[self::pre:id or self::pre:name]'))

ret_list=[]
for id,name in zip(id_name_nodes,id_name_nodes):
    ret_list.append({'code':id.content,'name':name.content})
print(ret_list)

This libxml2 code, relies on every record having an id and name. If an id or name is missing, the ret_list will pair the wrong id and name, failing silently. Under the same circumstance, the lxml code would raise an error.

unutbu
I'm using libxml2 everywhere and I would like to keep using it also in this case.However thanks for your answer!
Giovanni Di Milia
Tim McNamara
ok, but there should be a way to do it directly in libxml2!
Giovanni Di Milia
A: 

You can select all the elements you need with a single XPath expression:

/pre:records/pre:record/*[self::pre:id or self::pre:name]

Then just process the selected nodes in python.

Dimitre Novatchev
Sorry but this doesn't answer my question
Giovanni Di Milia
@Giovanni-Di-Milia: This answers the XPath part -- I don't know Python. Having selected all nodes you want, you should be able to process them in Python and to produce the wanted result.
Dimitre Novatchev
+1  A: 

Here is a suggestion. Note the setContextNode() method:

import libxml2

xml = "test.xml"
doc = libxml2.parseFile(xml) 

ctxt = doc.xpathNewContext() 
ctxt.xpathRegisterNs("pre","http://www.mysyte.com/foo") 

ret_list = []
record_nodes = ctxt.xpathEval('/pre:records/pre:record') 

for node in record_nodes:
    ctxt.setContextNode(node)
    _id = ctxt.xpathEval('pre:id')[0].content
    name = ctxt.xpathEval('pre:name')[0].content
    ret_list.append({'code': _id, 'name': name}) 

print ret_list
mzjn
No comments on this one? It is indeed a way to "do it directly in libxml2".
mzjn
Sorry! I forgot to sign this answer as the best one! It actually works in the way I want. Thanks!
Giovanni Di Milia