tags:

views:

62

answers:

3

Hi, I have a huge number (2k+) of xml files that I need to extract data from and dump that data into a table of some sort (Excel or simply one or multiple XML files would be fine). The fun part is that the xml files have wildly differing amount of nodes (with occasionally the same names in the subnodes) and also the depth of the hierarchy is not known.

Anyway, here is an example:

<?xml version="1.0" encoding="UTF-8"?>
   <SomeName>
     <identifier>
        <UID> 1234 </UID>
     </identifier>
     <MainNode1>
         <SubNode1>
            <Subnode1a>DATA1a0</Subnode1a>
         </SubNode1>
         <SubNode1>
            <Subnode1a>DATA1a1</Subnode1a>
         </SubNode1>
         <SubNode1>
            <Subnode1a>DATA1a2</Subnode1a>
         </SubNode1>
      </MainNode1>

      <MainNode2>
         <SubNode2>
            <Subnode2a>DATA2a0</Subnode2a>
         </SubNode2>
         <SubNode3>
            <Subnode3a>DATA3a0</Subnode3a>
         </SubNode3>
         <SubNode4>
            <Subnode4a>DATA4a0</Subnode4a>
         </SubNode4>
      </MainNode2>

      <MainNodeIDONTCARE>
           <SubnodeWhatever>
           </SubnodeWhatever>
      <MainNodeIDONTCARE>
    </SomeName>

And here is my table I want it to flatten to. Basically it should look like this:

 <?xml version="1.0" encoding="UTF-8"?>
 <SomeName> 
 <UID>1234</UID>
 <MainNode1_SubNode1_SubNode1aA>DATA1a0</MainNode1_SubNode1_SubNode1aA>
 <MainNode1_SubNode1_SubNode1aB>DATA1a1</MainNode1_SubNode1_SubNode1aB>
 <MainNode1_SubNode1_SubNode1aC>DATA1a2</MainNode1_SubNode1_SubNode1aC>
 <MainNode2_SubNode2_SubNode2a>Data2a0</MainNode2_SubNode2_SubNode2a>
 <MainNode2_SubNode3_SubNode3a>Data3a0</MainNode2_SubNode3_SubNode3a>
 <MainNode2_SubNode4_SubNode4a>Data4a0</MainNode2_SubNode4_SubNode4a>
 </SomeName>

As you can see the <MainNodeIDONTCARE> is not in the table, because I simply want to extract specific data sets from the XML files (I want to include MainNodes that I need and just ignore all others). In this case anything that is in MainNode1 through MainNode4.

What would be the best way to get this done? I don't care about performance, because the files aren't too big and the conversion can take as long as it wants to.

Is there some clever XSLT or anything else that can help me? Thank you.

A: 

Whether or not an XSLT is used (and yes that would work), it seems the trick is to make an xpath statement that makes one single collection of all those nodes. The proper xpath statement will not care about the depth.

Brent Arias
A: 

Here you are:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:template match="SomeName">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="UID" priority="1">
    <xsl:copy-of select="."/>
  </xsl:template>

  <xsl:template match="MainNodeIDONTCARE" priority="1"/> 

  <xsl:template match="SomeName//*[not(*)][text()]">
    <xsl:variable name="elementName">
      <xsl:call-template name="getElementName">
        <xsl:with-param name="element" select="."/>
      </xsl:call-template>
    </xsl:variable>
    <xsl:element name="{$elementName}">
      <xsl:value-of select="."/>
    </xsl:element>
  </xsl:template>

  <xsl:template name="getElementName">
    <xsl:param name="element"/>

    <xsl:if test="$element/parent::*[not(self::SomeName)]">
      <xsl:call-template name="getElementName">
        <xsl:with-param name="element" select="$element/parent::*"/>
      </xsl:call-template>
      <xsl:text>_</xsl:text>
    </xsl:if>
    <xsl:value-of select="name($element)"/>
  </xsl:template>

</xsl:stylesheet>
Jan Willem B
Hi,thanks for the XSLT! I'm using it right now and have modified it slightly to fit my case. What I need to do is to only copy and flatten specific nodes (in this case MainNode1, MainNode2)and drop all others.Currently I have modified the `<xsl:template match="SomeName//*[not(*)][text()]">` to look like `<xsl:template match="`MainNode1//*[not(*)][text()]">`, and copy the entire block for all the other Nodes I need. But at the end of my output I still have all the data from the Nodes I don'T care about and I'd like to drop that data. How do I do that?
Grinner
add an empty template like I added for MainNodeIDONTCARE with priority 1
Jan Willem B
A: 

Using XSLT 2.0 [Saxon] (I don't think this'll work in xslt 1.0):

First, the non empty text nodes contain the data, so select for them:

  <xsl:template match="/SomeName" >        
      <SomeName>
          <xsl:copy-of select="identifier/UID" />
        <xsl:apply-templates select="(MainNode1|MainNode2)//text()[normalizespace()]" />
      </SomeName>
  </xsl:template>

Then you need to make an element name by joining the ancestor names:

<xsl:template match="text()" >
  <xsl:element  name="{string-join( ./ancestor::*[name()!='SomeName']/name(),'_')}">
    <xsl:value-of select="." />
  </xsl:element>
</xsl:template>
Steven D. Majewski