tags:

views:

328

answers:

7

As the title says it, I have a huge xml file (GBs)

<root>  
<keep>  
   <stuff>  ...  </stuff>  
   <morestuff> ... </morestuff>  
</keep>  
<discard>  
   <stuff>  ...  </stuff>  
   <morestuff> ... </morestuff>
</discard>  
</root>  

and I'd like to transform it into a much smaller one which retains only a few of the elements.
My parser should do the following:
1. Parse through the file until a relevant element starts.
2. Copy the whole relevant element (with children) to the output file. go to 1.

step 1 is easy with SAX and impossible for DOM-parsers.
step 2 is annoying with SAX, but easy with the DOM-Parser or XSLT.

so what? - is there a neat way to combine SAX and DOM-Parser to do the task?

+7  A: 

StAX would seem to be one obvious solution: it's a pull parser rather than either the "push" of SAX or the "buffer the whole thing" approach of DOM. Can't say I've used it though. A "StAX tutorial" search may come in handy :)

Jon Skeet
+1 StAX is mucher easier to use than SAX if haven't been exposed to handling XML files before. Besides, it also allows writing XML (in contrast to SAX).
Helper Method
+5  A: 

Yes, just write a SAX content handler, and when it encounters a certain element, you build a dom tree on that element. I've done this with very large files, and it works very well.

It's actually very easy: As soon as you encounter the start of the element you want, you set a flag in your content handler, and from there on, you forward everything to the DOM builder. When you encounter the end of the element, you set the flag to false, and write out the result.

(For more complex cases with nested elements of the same element name, you'll need to create a stack or a counter, but that's still quite easy to do.)

Chris Lercher
+1  A: 

Have a look at StAX, this might be what you need. There's a good introduction on IBM Developer Works.

ilikeorangutans
A: 

You can do this quite easily with an XMLEventReader and several XMLEventWriters from the javax.xml.stream package.

jarnbjo
+2  A: 

Since you're talking about GB's, I would rather prioritize the memory usage in the consideration. SAX needs about 2 times of memory as the document big is, while DOM needs it to be at least 5 times. So if your XML file is 1GB big, then DOM would require a minimum of 5GB of free memory. That's not funny anymore. So SAX (or any variant on it, like StAX) is the best option here.

If you want the most memory efficient approach, look at VTD-XML. It requires only a little more memory than the file big is.

BalusC
Good point, memory is absolutely crucial here. BTW, SAX doesn't even necessarily need twice the size of the document - because it's a streaming API, you can constantly garbage collect previous parts of the document, as soon as you don't need them anymore.
Chris Lercher
True, but that depends on the functional requirements. He might for instance require to have the entire XML in memory before being able to gather the desired information.
BalusC
+2  A: 

For such a large XML document, something with a streaming architecture, like Omnimark would be ideal.

It wouldn't have to be anything complex either. An Omnimark script like what's below could give you what you need:

process

submit #main-input

macro upto (arg string) is
    ((lookahead not string) any)*
macro-end

find (("<keep") upto ("</keep>") "</keep>")=>keep
    output keep

find any
DevNull
+2  A: 

I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.

It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.

UPDATE

I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.

<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
    version="1.0" pass-through="none" output-method="xml">
    <stx:template match="element/child">
        <stx:process-self group="copy" />
    </stx:template>
    <stx:group name="copy" pass-through="all">
    </stx:group>
</stx:transform>

The pass-through="none" at the stx:transform configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template matches the XPath element/child (this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy" is invoked on the current element. That group has pass-though="all", so the default templates copy their input and process child elements. When the element/child element is ended, control is passed back to the template that invoked process-self, and the following elements are ignored again. Until the template matches again.

The following is an example input file:

<root>
    <child attribute="no-parent, so no copy">
    </child>
    <element id="id1">
        <child attribute="value1">
            text1<b>bold</b>
        </child>
    </element>
    <element id="id2">
        <child attribute="value2">
            text2
            <x:childX xmlns:x="http://x.example.com/x"&gt;
            <!-- comment -->
                yet more<b i="i" x:i="x-i" ></b>
            </x:childX>
        </child>
    </element>
</root>

This is the corresponding output file:

<?xml version="1.0" encoding="UTF-8"?>
<child attribute="value1">
            text1<b>bold</b>
        </child><child attribute="value2">
            text2
            <x:childX xmlns:x="http://x.example.com/x"&gt;
            <!-- comment -->
                yet more<b i="i" x:i="x-i" />
            </x:childX>
        </child>

The unusual formatting is a result of skipping the text nodes containing newlines outside the child elements.

Christian Semrau
sounds good. can I just write an xslt-stylesheet and then run it with STX?
No, this is not possible. While XSLT uses modes to distinguish templates for the same match in different situations (skip mode vs. copy mode, in your case), STX uses template groups. The syntax within templates is similar to XSLT, but different in detail. I add an example transform to my answer.
Christian Semrau
Note that, in the XPath for matching a template, the only nodes you can access are the current node, its parent nodes, and their attributes. You cannot match on any other previous or following node, due to the streaming nature of the transform. If you need this kind of match, you can define variables (that are mutable) and use these in `stx:if` tests. But this is tricky and feels like implementing a content handler in XML.
Christian Semrau