views:

1947

answers:

6

I'm trying to read a single XML document from stream at a time using dom4j, process it, then proceed to the next document on the stream. Unfortunately, dom4j's SAXReader (using JAXP under the covers) keeps reading and chokes on the following document element.

Is there a way to get the SAXReader to stop reading the stream once it finds the end of the document element? Is there a better way to accomplish this?

A: 

Most likely, you don't want to have more than one document in the same stream at the same time. I don't think that the SAXReader is smart enough to stop when it gets to the end of the first document. Why is it necessary to have multiple documents in the same stream like this?

Ian McLaird
The XML specification describes an XML document as the prolog, document element, and trailing comments, processing instructions, and whitespace. There's nothing that states that a given medium like a file or stream can only hold a single document.Why *not* have multiple documents per stream?
Alan Krueger
A: 

I think you'd have to add an adapter, something to wrap the stream and have that thing return end of file when it sees the beginning of the next document. As far as I know ,the parsers as written, will go until the end of the file or an error... and seeing another <?xml version="1.0"?> would certainly be an error.

trenton
This seems like a hack to work around a parser limitation.The trouble is, though, determining when to insert this EOF marker requires parsing the XML.
Alan Krueger
+1  A: 

I was able to get this to work with some gymnastics using some internal JAXP classes:

  • Create a custom scanner, a subclass of XMLNSDocumentScannerImpl
    • Create a custom driver, an implementation of XMLNSDocumentScannerImpl.Driver, inside the custom scanner that returns END_DOCUMENT when it sees an declaration or an element. Get the ScannedEntity from fElementScanner.getCurrentEntity(). If the entity has a PushbackReader, push back the remaining unread characters in the entity buffer onto the reader.
    • In the constructor, replaces the fTrailingMiscDriver with an instance of this custom driver.
  • Create a custom configuration class, a subclass of XIncludeAwareParserConfiguration, that replaces the stock DOCUMENT_SCANNER with an instance of this custom scanner in its constructor.
  • Install an instance of this custom configuration class as the "com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration" property so it will be instantiated when dom4j's SAXReader class tries to create a JAXP XMLReader.
  • When passing a Reader to dom4j's SAXReader.read() method, supply a PushbackReader with a buffer size considerably larger than the one-character default. At least 8192 should be enough to support the default buffer size of the XMLEntityManager inside JAXP's copy of Apache2.

This isn't the cleanest solution, as it involves subclassing internal JAXP classes, but it does work.

Alan Krueger
A: 

Assuming you are responsible for placing documents into the stream in the first place should be easy to delimit the documents in some fashion. For example:

// Any value that is invalid for an XML character will do.
static final char DOC_TERMINATOR=4;

BOOL addDocumentToStream(BufferedWriter streamOut, char xmlData[])
{
  streamOut.write(xmlData);
  streamOut.write(DOC_TERMINATOR);
}

Then when reading from the stream read into a array until DOC_TERMINATOR is encountered.

char *getNextDocuument(BufferedReader streamIn)
{
  StringBuffer buffer = new StringBuffer();
  int character;

  while (true)
  {
    character = streamIn.read();
    if (character == DOC_TERMINATOR)
      break;

    buffer.append(character);
  }
  return buffer.toString().toCharArray();
}

Since 4 is an invalid character value you won't encounter except where you explicitly add it. Thus allowing you to split the documents. Now just wrap the resuling char array for input into SAX and your good to go.

...
  XMLReader xmlReader = XMLReaderFactory.createXMLReader();
...
  while (true)
  {
    char xmlDoc = getNextDocument(streamIn);

    if (xmlDoc.length == 0)
      break;

    InputSource saxInputSource = new InputSource(new CharArrayReader(xmlDoc));
    xmlReader.parse(saxInputSource);
  }
...

Note that the loop terminates when it gets a doc of length 0. This means that you should either add a second DOC_TERMINATOR after the last document of you need to add something to detect the end of the stream in getNextDocument().

"Assuming you are responsible for placing documents into the stream ..."Sadly, such simplifying assumptions are sometimes not available, as in this case.
Alan Krueger
A: 

I have done this before by wrappering the base reader with another reader of my own creation that had very simple parsing capability. Assuming you know the closing tag for the document, the wrapper simply parses for a match, e.g. for "</MyDocument>". When it detects that it returns EOF. The wrapper can be made adaptive by parsing out the first opening tag and returning EOF on the matching closing tag. I found it was not necessary to actually detect the level for the closing tag since no document I had used the document tag within itself, so it was guaranteed that the first occurrence of the closing tag ended the document.

As I recall, one of the tricks was to have the wrapper block close(), since the DOM reader closes the input source.

So, given Reader input, your code might look like:

SubdocReader sdr=new SubdocReader(input);
while(!sdr.eof()) {
    sdr.next();
    // read doc here using DOM
    // then process document
    }
input.close();

The eof() method returns true if EOF is encountered. The next() method flags the reader to stop returning -1 for read().

Hopefully this points you in a useful direction.

-- Kiwi.

Software Monkey
A: 
Michael Rutherfurd
A stream doesn't necessarily have an end, it could be a persistent network connection or operating system pipe that closes only rarely. Reading that entirely into memory is impossible and/or absurd.
Alan Krueger