tags:

views:

246

answers:

6

I have an XML file in the following format:

<doc>

<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>

<type...>

</type>


</doc>

I would like to parse this document and build a hash table

{X: {"A": [(100,80), (200,90)], "B": [(100,20), (20,90)]}, Y: .....}

How would I do this in python?

+2  A: 

I would recommend using the minidom library.

The docs are pretty good so you should be up and running in no time.

Dan.

freeasinbeer
A: 

Why not try something like the PyXml library. They have lots of documentation and tutorials.

Gordon
**WARNING** Norwegian Blue Parrot syndrome: Last release 5 years ago. No Windows installers for Python 2.5 and 2.6.
John Machin
A: 

Another XML parsing library: http://www.crummy.com/software/BeautifulSoup/

The MYYN
A: 

As others have stated minidom is the way to go here. You open (and parse) the file, while going through the nodes you check if its relevant and should be read. That way, you also know if you want to read the child nodes.

Threw together this, seems to do what you want. Some of the values are read by attribute position rather than attribute name. And theres no error handling. And the print () at the end means its Python 3.x.

I'll leave it as an exercise to improve upon that, just wanted to post a snippet to get you started.

Happy hacking! :)

xml.txt

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>
</doc>

parsexml.py

from xml.dom import minidom
data={}
doc=minidom.parse("xml.txt")
for n in doc.childNodes[0].childNodes:
    if n.localName=="id":
     id_name = n.attributes.item(0).nodeValue
     data[id_name] = {}
     for j in n.childNodes:
      if j.localName=="type":
       type_name = j.attributes.item(0).nodeValue
       data[id_name][type_name] = [(),()]
       for k in j.childNodes:
        if k.localName=="min":
         data[id_name][type_name][0] = \
          (k.attributes.item(1).nodeValue, \
           k.attributes.item(0).nodeValue)
        if k.localName=="max":
         data[id_name][type_name][1] = \
          (k.attributes.item(1).nodeValue, \
           k.attributes.item(0).nodeValue)
print (data)

Output:

{'X': {'A': [('100', '80'), ('200', '90')], 'B': [('100', '20'), ('20', '90')]}}
mizipzor
Sorry, wrong room. The fugly code competition is down the hall.
John Machin
+7  A: 

I disagree with the suggestion in other answers to use minidom -- that's a so-so Python adaptation of a standard originally conceived for other languages, usable but not a great fit. The recommended approach in modern Python is ElementTree.

The same interface is also implemented, faster, in third party module lxml, but unless you need blazing speed the version included with the Python standard library is fine (and faster than minidom anyway) -- the key point is to program to that interface, then you can always switch to a different implementation of the same interface in the future if you want to, with minimal changes to your own code.

For example, after the needed imports &c, the following code is a minimal implementation of your example (it does not verify that the XML is correct, just extracts the data assuming correctness -- adding various kinds of checks is pretty easy of course):

from xml.etree import ElementTree as et  # or, import any other, faster version of ET

def xml2data(xmlfile):
  tree = et.parse(xmlfile)
  data = {}
  for anid in tree.getroot().getchildren():
    currdict = data[anid.get('name')] = {}
    for atype in anid.getchildren():
      currlist = currdict[atype.get('name')] = []
      for c in atype.getchildren():
        currlist.append((c.get('val'), c.get('id')))
  return data

This produces your desired result given your sample input.

Alex Martelli
`for child in node.getchildren():` is unnecessary; use `for child in node:` instead.
John Machin
A: 

Do not reinvent the wheel. Use Amara toolkit. Variable names are just keys in a dictionary anyway. http://www.xml3k.org/Amara

Hamish Grubijan
Another link - http://www.xml.com/pub/a/2005/01/19/amara.htmlYou will end up with a variable doc, which has doc.id, which has doc.id.type[0], then doc.id.type[0].min, ... and so on. Super-easy to access!
Hamish Grubijan