ansaurus

Question

How to transform huge xml files in java?

Answer 1

+7 A:

StAX would seem to be one obvious solution: it's a pull parser rather than either the "push" of SAX or the "buffer the whole thing" approach of DOM. Can't say I've used it though. A "StAX tutorial" search may come in handy :)

Jon Skeet 2010-05-05 13:49:52

+1 StAX is mucher easier to use than SAX if haven't been exposed to handling XML files before. Besides, it also allows writing XML (in contrast to SAX).

Helper Method 2010-05-05 13:52:36

Answer 2

+5 A:

Yes, just write a SAX content handler, and when it encounters a certain element, you build a dom tree on that element. I've done this with very large files, and it works very well.

It's actually very easy: As soon as you encounter the start of the element you want, you set a flag in your content handler, and from there on, you forward everything to the DOM builder. When you encounter the end of the element, you set the flag to false, and write out the result.

(For more complex cases with nested elements of the same element name, you'll need to create a stack or a counter, but that's still quite easy to do.)

Chris Lercher 2010-05-05 13:50:03

Answer 3

+1 A:

Have a look at StAX, this might be what you need. There's a good introduction on IBM Developer Works.

ilikeorangutans 2010-05-05 13:51:04

Answer 4

A:

You can do this quite easily with an XMLEventReader and several XMLEventWriters from the javax.xml.stream package.

jarnbjo 2010-05-05 13:52:23

Answer 5

+2 A:

Since you're talking about GB's, I would rather prioritize the memory usage in the consideration. SAX needs about 2 times of memory as the document big is, while DOM needs it to be at least 5 times. So if your XML file is 1GB big, then DOM would require a minimum of 5GB of free memory. That's not funny anymore. So SAX (or any variant on it, like StAX) is the best option here.

If you want the most memory efficient approach, look at VTD-XML. It requires only a little more memory than the file big is.

BalusC 2010-05-05 13:54:18

Good point, memory is absolutely crucial here. BTW, SAX doesn't even necessarily need twice the size of the document - because it's a streaming API, you can constantly garbage collect previous parts of the document, as soon as you don't need them anymore.

Chris Lercher 2010-05-05 14:01:49

True, but that depends on the functional requirements. He might for instance require to have the entire XML in memory before being able to gather the desired information.

BalusC 2010-05-05 14:05:11

Answer 6

+2 A:

For such a large XML document, something with a streaming architecture, like Omnimark would be ideal.

It wouldn't have to be anything complex either. An Omnimark script like what's below could give you what you need:

process

submit #main-input

macro upto (arg string) is
    ((lookahead not string) any)*
macro-end

find (("<keep") upto ("</keep>") "</keep>")=>keep
    output keep

find any

DevNull 2010-05-05 19:58:05

Answer 7

+2 A:

I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.

It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.

UPDATE

I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.

<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
    version="1.0" pass-through="none" output-method="xml">
    <stx:template match="element/child">
        <stx:process-self group="copy" />
    </stx:template>
    <stx:group name="copy" pass-through="all">
    </stx:group>
</stx:transform>

The pass-through="none" at the stx:transform configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template matches the XPath element/child (this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy" is invoked on the current element. That group has pass-though="all", so the default templates copy their input and process child elements. When the element/child element is ended, control is passed back to the template that invoked process-self, and the following elements are ignored again. Until the template matches again.

The following is an example input file:

<root>
    <child attribute="no-parent, so no copy">
    </child>
    <element id="id1">
        <child attribute="value1">
            text1<b>bold</b>
        </child>
    </element>
    <element id="id2">
        <child attribute="value2">
            text2
            <x:childX xmlns:x="http://x.example.com/x"&gt;
            <!-- comment -->
                yet more<b i="i" x:i="x-i" ></b>
            </x:childX>
        </child>
    </element>
</root>

This is the corresponding output file:

<?xml version="1.0" encoding="UTF-8"?>
<child attribute="value1">
            text1<b>bold</b>
        </child><child attribute="value2">
            text2
            <x:childX xmlns:x="http://x.example.com/x"&gt;
            <!-- comment -->
                yet more<b i="i" x:i="x-i" />
            </x:childX>
        </child>

The unusual formatting is a result of skipping the text nodes containing newlines outside the child elements.

Christian Semrau 2010-05-05 20:21:55

sounds good. can I just write an xslt-stylesheet and then run it with STX?

2010-05-06 09:43:14

No, this is not possible. While XSLT uses modes to distinguish templates for the same match in different situations (skip mode vs. copy mode, in your case), STX uses template groups. The syntax within templates is similar to XSLT, but different in detail. I add an example transform to my answer.

Christian Semrau 2010-05-06 21:14:35

Note that, in the XPath for matching a template, the only nodes you can access are the current node, its parent nodes, and their attributes. You cannot match on any other previous or following node, due to the streaming nature of the transform. If you need this kind of match, you can define variables (that are mutable) and use these in `stx:if` tests. But this is tricky and feels like implementing a content handler in XML.

Christian Semrau 2010-05-06 21:39:36

ansaurus

tags:

views:

answers:

How to transform huge xml files in java?

related questions