views:

274

answers:

3

I have a situation similar to an earlier question about emitting XML. I am analyzing data in a SAX ContentHandler while serializing it to a stream. I am suspicious that the solution in the linked question -- though it is exactly what I am looking for in terms of the API -- is not memory-efficient, since it involves an identity transform with the XSLT processor. I want the memory consumption of the program to be bounded, rather than it growing with the input size.

How can I easily forward the parameters to my ContentHandler methods to a serializer without doing acrobatics to adapt e.g. StAX to SAX, or worse yet, copying the SAX event contents to the output stream?

Edit: here's a minimal example of what I am after. thingIWant should just write to the OutputStream given to it. Like I said, the earlier question has a TransformerHandler that gives me the right API, but it uses the XSLT processor instead of just a simple serialization.

public class MyHandler implements ContentHandler {

    ContentHandler thingIWant;

    MyHandler(OutputStream outputStream) {
        thingIWant = setup(outputStream);
    }

    public void startDocument() throws SAXException {
        // parsing logic
        thingIWant.startDocument();
    }

    public void startElement(String uri, String localName, String qName,
                             Attributes atts) throws SAXException {
        // parsing logic
        thingIWant.startElement(uri, localName, qName, atts);
    }

    public void characters(char[] ch, int start, int length) throws SAXException {
        // parsing logic
        thingIWant.characters(ch, start, length);
    }

    // etc...
 }
+1  A: 

Edit: Includes default JDK version

The most efficient would be an XMLWriter which implements ContentHandler. In nutshell, you are reading and writing from and to IO buffers. There is an XMLWriter in DOM4J which is being used below. You can either subclass XMLWriter or use XMLFilter to do analysis. I am using XMLFilter in this example. Note that XMLFilter is also a ContentHandler. Here is the complete code.

import org.dom4j.io.XMLWriter;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLFilterImpl;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.io.PrintStream;

public class XMLPipeline {

    public static void main(String[] args) throws Exception {
        String inputFile = "build.xml";
        PrintStream outputStream = System.out;
        new XMLPipeline().pipe(inputFile, outputStream);
    }

//dom4j
public void pipe(String inputFile, OutputStream outputStream) throws
        SAXException, ParserConfigurationException, IOException {
    XMLWriter xwriter = new XMLWriter(outputStream);
    XMLReader xreader = XMLReaderFactory.createXMLReader();
    XMLAnalyzer analyzer = new XMLAnalyzer(xreader);
    analyzer.setContentHandler(xwriter);
    analyzer.parse(inputFile);

    //do what you want with analyzer
    System.err.println(analyzer.elementCount);
}


//default JDK
public void pipeTrax(String inputFile, OutputStream outputStream) throws
        SAXException, ParserConfigurationException, 
        IOException, TransformerException {
    StreamResult xwriter = new StreamResult(outputStream);
    XMLReader xreader = XMLReaderFactory.createXMLReader();
    XMLAnalyzer analyzer = new XMLAnalyzer(xreader);
    TransformerFactory stf = SAXTransformerFactory.newInstance();
    SAXSource ss = new SAXSource(analyzer, new InputSource(inputFile));
    stf.newTransformer().transform(ss, xwriter);
    System.out.println(analyzer.elementCount);
}

//This method simply reads from a file, runs it through SAX parser and dumps it 
//to dom4j writer
public void dom4jNoop(String inputFile, OutputStream outputStream) throws
        IOException, SAXException {
    XMLWriter xwriter = new XMLWriter(outputStream);
    XMLReader xreader = XMLReaderFactory.createXMLReader();
    xreader.setContentHandler(xwriter);
    xreader.parse(inputFile);

}

//Simplest way to read a file and write it back to an output stream
public void traxNoop(String inputFile, OutputStream outputStream) 
  throws TransformerException {
    TransformerFactory stf = SAXTransformerFactory.newInstance();
    stf.newTransformer().transform(new StreamSource(inputFile), 
     new StreamResult(outputStream));
}    
    //this analyzer counts the number of elements in sax stream
    public static class XMLAnalyzer extends XMLFilterImpl {
        int elementCount = 0;

        public XMLAnalyzer(XMLReader xmlReader) {
            super(xmlReader);
        }

        @Override
        public void startElement(String uri, String localName, String qName, 
          Attributes atts) throws SAXException {
            super.startElement(uri, localName, qName, atts);
            elementCount++;
        }
    }
}
Chandra Patni
I actually don't want to do any transforming. I want to parse the document and update some state based on its contents, while serializing the document as-is to an OutputStream.
Steven Huwig
Your edit is a little better but I don't currently have a dependency on DOM4J and don't want one if I can help it. I understand that specific XML parsers and APIs have XMLWriter-type serializers, but I'd like to restrict this to the javax.xml.* world.
Steven Huwig
I have added the JDK version. AFAIK, to a print an XML in JDK, you need to use transformers. The JDK Transformer code is 2x slower in my micro-benchmark than using dom4j writer. If you want performance and no dependency, you can easily write your XML writer which is quite simple. See DOM4J XMLWriter source code.
Chandra Patni
I'm actually looking for memory efficiency, not time efficiency. Specifically, the parse should be able to handle documents of arbitrary length while consuming only a fixed amount of memory.
Steven Huwig
+1  A: 

First: don't worry about the identity transform; it does not build an in-memory representation of the data.

To implement your "tee" functionality, you have to create a content handler that listens to the stream of events produced by the parser, and passes them on to the handler provided for you by the transformer. Unfortunately, this is not as easy as it sounds: the parser wants to send events to a DefaultHandler, while the transformer wants to read events from an XMLReader. The former is an abstract class, the latter is an interface. The JDK also provides the class XMLFilterImpl, which implements all of the interfaces of DefaultHandler, but does not extend from it ... that's what you get for incorporating two different projects as your "reference implementations."

So, you need to write a bridge class between the two:

import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;

import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLFilterImpl;

/**
 *  Uses a decorator ContentHandler to insert a "tee" into a SAX parse/serialize
 *  stream.
 */
public class SaxTeeExample
{
    public static void main(String[] argv)
    throws Exception
    {
        StringReader src = new StringReader("<root><child>text</child></root>");
        StringWriter dst = new StringWriter();

        Transformer xform = TransformerFactory.newInstance().newTransformer();
        XMLReader reader = new MyReader(SAXParserFactory.newInstance().newSAXParser());
        xform.transform(new SAXSource(reader, new InputSource(src)),
                        new StreamResult(dst));

        System.out.println(dst.toString());
    }


    private static class MyReader
    extends XMLFilterImpl
    {
        private SAXParser _parser;

        public MyReader(SAXParser parser)
        {
            _parser = parser;
        }

        @Override
        public void parse(InputSource input) 
        throws SAXException, IOException
        {
            _parser.parse(input, new XMLFilterBridge(this));
        }

        // this is an example of a "tee" function
        @Override
        public void startElement(String uri, String localName, String name, Attributes atts) throws SAXException
        {
            System.out.println("startElement: " + name);
            super.startElement(uri, localName, name, atts);
        }
    }


    private static class XMLFilterBridge
    extends DefaultHandler
    {
        private XMLFilterImpl _filter;

        public XMLFilterBridge(XMLFilterImpl myFilter)
        {
            _filter = myFilter;
        }

        @Override
        public void characters(char[] ch, int start, int length)
        throws SAXException
        {
            _filter.characters(ch, start, length);
        }

        // override all other methods of DefaultHandler
        // ...
    }
}

The main method sets up the transformer. The interesting part is that the SAXSource is constructed around MyReader. When the transformer is ready for events, it will call the parse() method ofthat object, passing it the specified InputSource.

The next part is not obvious: XMLFilterImpl follows the Decorator pattern. The transformer will call various setter methods on this object before starting the transform, passing its own handlers. Any methods that I don't override (eg, startDocument()) will simply call the delegate. As an example override, I'm doing "analysis" (just a println) in startElement(). You'll probably override other ContentHandler methods.

And finally, XMLFilterBridge is the bridge between DefaultHandler and XmlReader; it's also a decorator, and every method simply calls the delegate. I show one override, but you'll have to do them all.

kdgregory
Incidentally, I'll be adding the bridge class to the Practical Xml library (http://sourceforge.net/projects/practicalxml/develop) ... I was pulling pieces of old XML code together for this answer, and I think it might be generally useful.
kdgregory
As I was reviewing this, I thought `parse()` is implemented by `XMLFilterImpl`, I don't need that bridge. Unfortunately, the implementation simply delegates to a parent `XMLReader`.
kdgregory
This is great information -- one question, though. Is it in the spec or the JDK implementation itself that an identity transform on a SAXSource is memory-safe, or is that an artifact of the transformation processor included as the default JDK processor? I.e. if someone uses a particularly bogus XSL system via JAXP configuration, will this still work?
Steven Huwig
It's reading between the lines of the spec: (1) the Transformer object is not tied to XSLT, (2) there's a distinct factory method for the copy transform, and (3) XSLT does not itself have a copy transform. I thought it did, but checked the spec; the default rules cover template matching, they do not specify any actions to take.
kdgregory
+2  A: 

I recently had a similar problem. Here is the class I wrote to get you thingIWant:

import java.io.OutputStream;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerException;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import org.xml.sax.*;

public class XMLSerializer implements ContentHandler {
    static final private TransformerFactory tf = TransformerFactory.newInstance();
    private ContentHandler ch;

    public XMLSerializer(OutputStream os) throws SAXException {
        try {
            final Transformer t = tf.newTransformer();

            t.transform(new SAXSource(                
                new XMLReader() {     
                    public ContentHandler getContentHandler() { return ch; }
                    public DTDHandler getDTDHandler() { return null; }      
                    public EntityResolver getEntityResolver() { return null; }
                    public ErrorHandler getErrorHandler() { return null; }    
                    public boolean getFeature(String name) { return false; }
                    public Object getProperty(String name) { return null; } 
                    public void parse(InputSource input) { }               
                    public void parse(String systemId) { }  
                    public void setContentHandler(ContentHandler handler) { ch = handler; }                
                    public void setDTDHandler(DTDHandler handler) { }
                    public void setEntityResolver(EntityResolver resolver) { }
                    public void setErrorHandler(ErrorHandler handler) { }
                    public void setFeature(String name, boolean value) { }
                    public void setProperty(String name, Object value) { }
                }, new InputSource()),                                    
                new StreamResult(os));
        }
        catch (TransformerException e) {
            throw new SAXException(e);  
        }

        if (ch == null)
            throw new SAXException("Transformer didn't set ContentHandler");
    }

    public void setDocumentLocator(Locator locator) {
        ch.setDocumentLocator(locator);
    }

    public void startDocument() throws SAXException {
        ch.startDocument();
    }

    public void endDocument() throws SAXException {
        ch.endDocument();
    }

    public void startPrefixMapping(String prefix, String uri) throws SAXException {
        ch.startPrefixMapping(prefix, uri);
    }

    public void endPrefixMapping(String prefix) throws SAXException {
        ch.endPrefixMapping(prefix);
    }

    public void startElement(String uri, String localName, String qName, Attributes atts)
        throws SAXException {
        ch.startElement(uri, localName, qName, atts);
    }

    public void endElement(String uri, String localName, String qName)
        throws SAXException {
        ch.endElement(uri, localName, qName);
    }

    public void characters(char[] ch, int start, int length)
        throws SAXException {
        this.ch.characters(ch, start, length);
    }

    public void ignorableWhitespace(char[] ch, int start, int length)
        throws SAXException {
        this.ch.ignorableWhitespace(ch, start, length);
    }

    public void processingInstruction(String target, String data)
        throws SAXException {
        ch.processingInstruction(target, data);
    }

    public void skippedEntity(String name) throws SAXException {
        ch.skippedEntity(name);
    }
}

Basically, it intercepts the Transformer's call to parse(), and grabs a reference to its internal ContentHandler. After that, the class acts as a proxy to the snagged ContentHandler.

Not very clean, but it works.

Chris K