ansaurus

Question

Efficient merging of multiple, large xml files into one

Answer 1

+1 A:

You may want to have a look at the pay-for version of Saxon. It can handle on-the-fly XSLT not needing the full DOM in memory.

Thorbjørn Ravn Andersen 2010-02-16 20:08:46

hmmh, in xslt you could look up the first node and the last node regardless where you are. Ie.: all needs to be in memory ... per definition of xslt. Or what do you think?

Karussell 2010-02-16 20:12:10

There is a fairly large subset of XSLT-programs you do not need the full DOM trees in memory to execute.

Thorbjørn Ravn Andersen 2010-02-16 21:59:53

ah, ok thanks. Now I understand

Karussell 2010-02-18 22:19:31

Answer 2

+2 A:

I haven't done this myself, but I recalled seeing an IBM developerworks article that looked like it made this pretty easy.

It's a bit old now, but try http://www.ibm.com/developerworks/xml/library/x-tipstx5/index.html

This is StAX instead of SAX. I'm not sure current JDKs include StAX. If not you can probably get it from http://stax.codehaus.org/

Don Roby 2010-02-16 20:42:55

thank you for the link. I will investigate this!

Karussell 2010-02-16 21:16:05

+1 JDK includes StAX since at 1.5 as far as I remember. Much more convenient to use than SAX.

Helper Method 2010-10-03 15:33:41

Answer 3

A:

I finally managed this via the following snippet:

  finalHandler = new StreamResult(new OutputStreamWriter(System.out));
  // customHandler extends DefaultHandler
  CustomTransformerHandler customHandler = new CustomTransformerHandler(
         finalHandler);
  customHandler.startDocumentExplicitly();
  InputStream is = null;
  while ((is = customHandler.createNextInputStream()) != null) {
    // multiple inputStream parsing
    XMLReader myReader = XMLReaderFactory.createXMLReader();
    myReader.setContentHandler(customHandler);
    myReader.parse(new InputSource(is));
  }
  customHandler.endDocumentExplicitly();

The important part was to leave the startDocument and endDocument methods empty. All other methods (characters, startElement, endElement) will be redirected to the finalHandler. The customHandler.createNextInputStream method returns null if all inputstreams are read.

Karussell 2010-02-16 23:31:20

Answer 4

A:

the most effective way to merge files are to use byte level cut and paste feature offered by VTD-XML, AFAIK. You take both files, parse them into VTDNav objects, then instantiate an XMLModifier object, grab the fragments from the second file, and insert them into the first file... that got to be far more efficient than SAX..

vtd-xml-author 2010-02-18 09:11:24

hmmh, but I don't want to have them in memory ... just pipe them directly to the disc. And I don't understand how that will be faster than sax.

Karussell 2010-02-18 09:45:21

using sax you are doing a lot more than just piping them to the disk, a lot of parsing overhead of SAX is a complete waste of cycles, using VTD-XML I won't be surpirse to see a 10x (at least) perforamnce improvement...

vtd-xml-author 2010-02-18 09:49:55

ok. thanks for the vtd-xml hint. It looks promising (from what I can read on the sourceforge website). But although it might be 100 times faster. If it takes 100% RAM of the doc (or even more) I cannot use it: it could be that the resulting xml won't fit even into memory.

Karussell 2010-02-18 22:16:33

Note that Mr. Zhang is the author of VTD-XML.

John Saunders 2010-03-09 09:19:51

ansaurus

tags:

views:

answers:

Efficient merging of multiple, large xml files into one

related questions