tags:

views:

975

answers:

2

Hey all, I have highly repetitive data with a depth of 5 nodes deep (including the root) that needs to be broken apart. (I'll include a fast sample in a minute.) What I'm looking to do is parse a ~5mb XML file into smaller sub-files based on the 3rd-depth nodes. But after that, it gets more complicated.

The task's requirements are these:

  1. Sub-files must maintain the hierarchical parents of the 3rd level node which is extracted, including their attributes.
  2. Sub-files must retain all attributes and children nodes.
  3. If XSLT cannot handle the job, attempt it in Ruby. If you aren't good at XSLT, but can tell me how to do it in Ruby or even Python, please feel free to contribute an answer in those languages. (Else try and stick with XSLT or pseudo-code.)

DOM Hierarchy:

<xml attr="whatever">
  <major-group name="whatever">
    <minor-group name="whatever">
      <another-group name="whatever">
        <last-node name="whatever"></last-node>
      </another-group>
    </minor-group>
  </major-group>
</xml>

Which I need to split on the minor-group element while retaining both its children and direct parents, and put all that (for each minor-group) in an external file. I have several files to split in this manner.

And... having never before parsed XML in Ruby, and having just begun using XSLT, I cannot yet write a script to accomplish my task with either.

I'm curious to see if XSLT is up to the task. :>

Edit:

Here's my resulting code, with the ability to show a stylesheet at the beginning of the file.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="xml"/>
  <xsl:template match="minor-group">
    <xsl:variable name="filename"><xsl:value-of select="concat(@name,'.xml')"/></xsl:variable>
    <xsl:result-document href="{$filename}">
      <xsl:text disable-output-escaping="yes">
        <![CDATA[<?xml-stylesheet type="text/xsl" href="../web.xslt"?>]]>
      </xsl:text> 
      <xml>
        <xsl:attribute name="whatever"><xsl:value-of select="../../@whatever" /></xsl:attribute>
        <major-group>
          <xsl:attribute name="whatever"><xsl:value-of select="../@whatever" /></xsl:attribute>
          <xsl:copy-of select="."/>
        </major-group>
      </xml>
    </xsl:result-document>
  </xsl:template>
</xsl:stylesheet>
A: 

I don't believe you can parse one file into multiple output files using simply XSLT.

If you were to break the XML up into different XML files with Ruby, and then apply the seperate XML files to the XSLT multiple times it should work.

Zachary Spencer
It used to be possible with Apache's Xalan, http://www.abbeyworkshop.com/howto/xslt/xslt_split/index.html but it seems defunct. I have found no other related result via Google. :/ (Besides, that breaking up is what I'm trying to do with either Ruby or XSLT -- I just don't know how to preserve it all with Ruby.)
The Wicked Flea
@Flea: That sample references the Redirect extension to Xalan. Looks like it's available for Xalan-J (Java version of Xalan), see http://xml.apache.org/xalan-j/extensionslib.html#redirect
system PAUSE
Which I don't know how to get/use. I haven't touched Java, ever. I'll look into it.... :/
The Wicked Flea
It's a bit of work to set up. But you can run a transform from the command-line, see http://xml.apache.org/xalan-j/getstarted.html#commandline
system PAUSE
+3  A: 

To extract the list of "minor group" elements, one of the following XPath expressions would be required.

/xml/major-group/minor-group    (the explicit way)
/*/*/*                          (the generic, any-third-level-element way)

In a scripting language of your choice, read the document into a DOM, construct a loop over the XPath query, writing the results to different output files.

With XSLT 1.0 it is not possible to generate more than one output document at a time. Hovever, XSLT 2.0 supports this via the <xsl:result-document> instruction.

If you have an XSLT 2.0 engine at your disposal, you could try that route. A random page I found at IBM's developerWorks website shows how to get started: Tip: Create multiple files in XSLT 2.0

Tomalak
Thanks very much for the tip-off about XSLT 2.0; this ought to fix my problem, but I'm going to test it first.
The Wicked Flea