views:

454

answers:

2

Hey guys,

I need to parse an XML file and build a record-based output from the data. The problem is that the XML is in a "generic" form, in that it has several levels of nested "node" elements that represent some sort of data structure. I need to build the records dynamically based on the deepest level of the "node" element. Some example XML and expected output are at the bottom.

I am most familiar w/ python's ElementTree, so I'd prefer to use that but I just can't wrap my head around a way to dynamically build the output record based on a dynamic node depth. Also - we can't assume that the nested nodes will be x levels deep, so just hardcoding each level w/ a loop isn't possible. Is there a way to parse the XML and build the output on the fly?

Some Additional Notes:

  • The node names are all "node" except the parent and detail info (rate, price, etc)
  • The node depth is not static. So - assume further levels than displayed in the sample
  • Each "level" can have multiple sub-levels. So - you need to loop on each child "node" to properly build each record.

Any ideas / input would be greatly appreciated.

<root>
   <node>101
      <node>A
         <node>PlanA     
            <node>default
                <rate>100.00</rate>
            </node>
            <node>alternative
                <rate>90.00</rate>
            </node>
         </node>
      </node>
   </node>
   <node>102
      <node>B
         <node>PlanZZ     
            <node>Group 1
               <node>default
                   <rate>100.00</rate>
               </node>
               <node>alternative
                   <rate>90.00</rate>
               </node>
            </node>
            <node>Group 2
               <node>Suba
                  <node>default
                      <rate>1.00</rate>
                  </node>
                      <node>alternative
                      <rate>88.00</rate>
                  </node>
               </node>
               <node>Subb
                  <node>default
                      <rate>200.00</rate>
                  </node>
                      <node>alternative
                      <rate>4.00</rate>
                  </node>
               </node>
            </node>
         </node>
      </node>  
   </node>
</root>

The Output would look like this:

SRV  SUB  PLAN   Group    SubGrp  DefRate   AltRate
101  A    PlanA                   100       90
102  B    PlanB  Group1           100       90
102  B    PlanB  Group2   Suba    1         88
102  B    PlanB  Group2   Subb    200       4
+4  A: 

That's why you have Element Tree find method with an XPath.

class Plan( object ):
    def __init__( self ):
        self.srv= None
        self.sub= None
        self.plan= None
        self.group= None
        self.subgroup= None
        self.defrate= None
        self.altrate= None
    def initFrom( self, other ):
        self.srv= other.srv
        self.sub= other.sub
        self.plan= other.plan
        self.group= other.group
        self.subgroup= other.subgroup
    def __str__( self ):
        return "%s %s %s %s %s %s %s" % (
            self.srv, self.sub, self.plan, self.group, self.subgroup,
            self.defrate, self.altrate )

def setRates( obj, aSearch ):
    for rate in aSearch:
        if rate.text.strip() == "default":
            obj.defrate= rate.find("rate").text.strip()
        elif rate.text.strip() == "alternative":
            obj.altrate= rate.find("rate").text.strip()
        else:
            raise Exception( "Unexpected Structure" )

def planIter( doc ):
    for topNode in doc.findall( "node" ):
        obj= Plan()
        obj.srv= topNode.text.strip()
        subNode= topNode.find("node")
        obj.sub= subNode.text.strip()
        planNode= topNode.find("node/node")
        obj.plan= planNode.text.strip()
        l3= topNode.find("node/node/node")
        if l3.text.strip() in ( "default", "alternative" ):
            setRates( obj, topNode.findall("node/node/node") )
            yield obj
        else:
            for group in topNode.findall("node/node/node"):
                grpObj= Plan()
                grpObj.initFrom( obj )
                grpObj.group= group.text.strip()
                l4= group.find( "node" )
                if l4.text.strip() in ( "default", "alternative" ):
                    setRates( grpObj, group.findall( "node" ) )
                    yield grpObj
                else:
                    for subgroup in group.findall("node"):
                        subgrpObj= Plan()
                        subgrpObj.initFrom( grpObj )
                        subgrpObj.subgroup= subgroup.text.strip()
                        setRates( subgrpObj, subgroup.findall("node") )
                        yield subgrpObj

import xml.etree.ElementTree as xml
doc = xml.XML( doc )

for plan in planIter( doc ):
    print plan


Edit

Whoever gave you this XML document needs to find another job. This is A Bad Thing (TM) and indicates a fairly casual disregard for what XML means.

S.Lott
Thanks for the quick response. The node names are all "node" and unfortunately as I stated earlier, I can't assume that "subgroup" is the last level, otherwise this would have been very easy. The node depth is not static. There could be children of the subgroup "node". Thoughts? Thanks again!
John
Also - each "level" can have multiple sub-levels. So creating a single object just from the top node loop won't work. You need to loop on each child "node" as well to build each record.
John
@S.Lott - I couldn't agree more on the XML structure but unfortunately, it is "system generated" and they refuse to change it. :-(
John
@John: They're incompetent. It's trivially changeable if they would simply subclass the DOM objects properly. Seriously. What they're doing A Bad Thing on two levels -- it's wrong and they're refusing to change.
S.Lott
@S.Lott - Ha! Again, couldn't agree more. When I explained the poor XML structure and the need to change, they just looked at me w/ blank faces and said "that can't be changed". Oh well -- nothing like building a complex solution to deal w/ a poor design. Thanks again.
John
A: 

I'm not too familiar with the ElementTree module, but you should be able to use the getchildren() method on an element, and recursively parse data until there are no more children. This is more sudo-code than anything:

def parseXml(root, data):
    # INSERT CODE to populate your data object here with the values 
    # you want from this node
    sub_nodes = root.getchildren()
    for node in sub_nodes:
        parseXml(node, data)

data = {}  # I'm guessing you want a dict of some sort here to store the data you parse
parseXml(parse(file).getroot(), data)
# data will be filled and ready to use
jcoon