views:

913

answers:

5

I need to read a large XML document from the network and split it up into smaller XML documents. In particular the stream I read from the network looks something like this:

<a> <b> ... </b> <b> ... </b> <b> ... </b> <b> ... </b> .... </a>

I need to break this up into chunks of

<a> <b> ... </b> <a>

(I only actually need the <b> .... </b> parts as long as the namespace bindings declared higher up (e.g. in <a> ) are moved to <b> if that makes it easier).

The file is too big for a DOM style parser, it has to be done streaming. Is there any XML library that can do this?

[Edit]

I think what I'm ideally looking for is something like the ability to do XPath queries on an XML stream where the stream parser only parses as far as necessary to return the next item in the result node set (and all its attributes and children). Doesn't have to be XPath, but something along the idea.

Thanks!

+2  A: 

The JAXP SAX api with SAX filter is both fast and efficient. Good intro filters can be seen here

Jimmy
hmm I'm not quite getting it. I can see how I can catch the event when my tag gets parsed, but it's not clear to me how to get the filter to redirect the stream to a new document until the end tag and how to include the parents + their various name-space bindings. Any chance you can expand on this a bit? I'm aware I can do this just through SAX by basically catching all kinds of event and keeping track of things and plain copying things, but I was hoping there is an easier solution.
Carsten
It's not the easy solution I hoped for, but it is correct and nobody had a better suggestion, so I'll give it to you ...
Carsten
A: 

I happen to like the XOM XML library, as its interface is simple, intuitive and powerful. To do what you want with XML, you can use your own NodeFactory and (for example) override the finishMakingElement() method. If it is making the element that you want (in your case, <b>) then you pass it along to whatever you need to do with it.

Adam Batkin
A: 

Am I a maverick in suggesting regular expressions...?

Neil Coffey
not if you supply the expression ....
Carsten
Well... without seeing the file...
Neil Coffey
A: 

As a XML splitter, VTD-XML is ideally suited for this task... it is also more memory efficient than DOM. The key method that simplify coding is VTDNav's getElementFragment()... below is the Java code for split input.xml into out0.xml and out1.xml

<a> <b> text1 </b>  <b> text2 </b> </a>

into

<a> <b> text1</b> </a>

and

<a> <b> text2</b> </a>

using XPath

/a/b

The code

import java.io.*;
import com.ximpleware.*;

public class split {
    public static void main(String[] argv) throws Exception{
     VTDGen vg = new VTDGen();
     if (vg.parseFile("c:/split/input.xml", true)){
      VTDNav vn = vg.getNav();
      AutoPilot ap = new AutoPilot(vn);
      ap.selectXPath("/a/b");
      int i=-1,k=0;
      byte[] ba = vn.getXML().getBytes();
      while((i=ap.evalXPath())!=-1){
       FileOutputStream fos = new FileOutputStream("c:/split/out"+k+".xml");
       fos.write("<a>".getBytes());
       long l = vn.getElementFragment();
       fos.write(ba, (int)l, (int)(l>>32));
       fos.write("</a>".getBytes());
       k++;
      }
     }  
    }
}

For further reading, please visit http://www.devx.com/xml/Article/36379

vtd-xml-author
Thanks for your reply.This looks like a DOM style approach to me, reuqiring the whole document to be parsed before doing a query. My XML stream is too big for that, it needs to be done by a streaming parser.
Carsten
with extended version, it can do partial loading via memory map,but this is only available in extended edition, with standard version, 2GB is the most you can load, it only consumes around 1/5 the memory of DOM...
vtd-xml-author
A: 

go old school

StringBuilder buffer = new StringBuilder(1024 * 50);
BufferedReader reader = new BufferedReader(new FileReader(pstmtout));
String line;
while ((line = reader.readLine()) != null) {
  buffer.append(line);
  if (line.equalsIgnoreCase(endStatementTag)) {
    service.handle(buffer.toString());
    buffer.delete(0, buffer.length());
  }
}