views:

324

answers:

3

I would like to parse a document using SAX, and create a subdocument from some of the elements, while processing others purely with SAX. So, given this document:

  <DOC>
    <small>
      <element />
    </small>
    <entries>
      <!-- thousands here -->
    </entries>
  </DOC>

I would like to parse the DOC and DOC/entries elements using the SAX ContentHandler, but when I hit <small> I want to create a new document containing just the <small> and its children.

Is there an easy way to do this, or do I have to build the DOM myself, by hand?

+1  A: 

Seems to me the answer depends on whether you need the 'new document' in memory. If you do then use DOM, alternatively if you're just going to stream the 'new document' then StAX would probably fit better with the event-driven nature of SAX.

Nick Holt
+2  A: 

One approach is to create a ContentHandler that watches for events that signal the entry or exit from a <small> element. This handler acts as a proxy, and in "normal" mode passes the SAX events straight through to the "real" ContentHandler.

However, when entry into a <small> element is detected, the proxy is responsible for the creation of a TransformerHandler, plumbed up to a DOMResult. The TransformerHandler expects all the events that a complete, well-formed document would produce; you cannot immediately send it a startElement event. Instead, simulate the beginning of a new document by invoking setDocumentLocator, startDocument, and other necessary events on the TransformerHandler instance first.

Then, until the end of the <small> element element is detected by the proxy, all events are forwarded to this TransformerHandler instead of the "real" ContentHandler. When the closing </small> tag is encountered, the proxy simulates the end of a document by invoking endDocument on the TransformerHandler. A DOM is now available as the result of the TransformerHandler, which contains only the ` fragment.

This process is repeated through the whole, larger document.

erickson
A: 

I've had no problem building multiple simultaneous documents out of one SAX stream. It's pretty much SOP for any business-document-oriented stream. What difficulty are you having with doing that? The hierarchy of your classes needn't match the hierarchy of the SAX stream.

le dorfier