ansaurus

Question

best java approach for stream filtering in XML?

Answer 1

+1 A:

I find Apache Digester a big help for rules-based parsing of XML.

Update: If it's filtering and output that you're concerned with, review this set of articles on Developerworks which is concerned with the same issues. Of particular relevance are parts 2, 3 and 4. The summary: Use SAX, XMLFilter and XMLWriter.

While I suppose this is technically a good fit for XSLT, I've always found it hard to debug for complex transformations. YMMV :-)

Further Update: XMLWriter is available from here. I don't know what your particular difficulty with SAX is. I created a file groups.xml containing:

<groups>
<group name="beatles"><item name="paul" number="64"/></group>
<group name="rolling stones"><item name="mick" number="19"/></group>
<group name="who"><item name="roger" number="515"/></group>
</groups>

Note that I had to make some changes to make it well-formed XML. Then, I knocked up this simple Jython script, groups.py, to illustrate how to solve your problem:

import java.io
import org.xml.sax.helpers
import sys

sys.path.append("xml-writer.jar")
import com.megginson.sax

def get_factors(n):
    return "factors for %s" % n

class MyFilter(org.xml.sax.helpers.XMLFilterImpl):
    def startElement(self, uri, localName, qName, attrs):
        if qName == "item":
            newAttrs = org.xml.sax.helpers.AttributesImpl(attrs)
            n = attrs.length
            for i in range(n):
                name = attrs.getLocalName(i)
                if name == "number":
                    newAttrs.addAttribute("", "factors", "factors",
                                          "CDATA",
                                          get_factors(attrs.getValue(i)))
            attrs = newAttrs
        #call superclass method...
        org.xml.sax.helpers.XMLFilterImpl.startElement(self, uri, localName,
                                                       qName, attrs)

source = org.xml.sax.InputSource(java.io.FileInputStream("groups.xml"))
reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader()
filter = MyFilter(reader)
writer = com.megginson.sax.XMLWriter(filter,
                                     java.io.FileWriter("output.xml"))
writer.parse(source)

Obviously, I've mocked up the factor finding function as your example was, I believe, purely illustrative. The script reads groups.xml, applies a filter, and outputs to output.xml. Let's run it:

$ jython groups.py
$ cat output.xml

<?xml version="1.0" standalone="yes"?>

<groups>
<group name="beatles"><item name="paul" number="64" factors="factors for 64"></item></group>
<group name="rolling stones"><item name="mick" number="19" factors="factors for 19"></item></group>
<group name="who"><item name="roger" number="515" factors="factors for 515"></item></group>
</groups>

Job done? Of course, you'll need to transcribe this code to Java.

Vinay Sajip 2009-09-08 17:08:20

The parsing of what I want to do isn't hard. But I need to emit output that is essentially the same as my input, with intentional differences only

Jason S 2009-09-08 17:13:09

@Jason S that sounds an awful lot like XSLT's domain

Jweede 2009-09-08 17:25:07

@Vinay: Now I remember why I hate SAX: I can never get the damn thing working. XMLWriter does not seem to be a recognized class, and as simple as XMLFilter sounds I can't seem to figure out how to create something that outputs XML.

Jason S 2009-09-08 18:29:39

Answer 2

+2 A:

XSLT seems like an appropriate model for what you are doing. Look into using XSLT with procedural extensions.

If you really can't keep the whole document in memory, Saxon is your only XSLT choice. It's likely that whatever calculations you need to do can be done in XSLT -- but if not, it's not too hard to write your own extension functions.

Steven Huwig 2009-09-08 17:23:09

Answer 3

+1 A:

StAX should work well for you. Piping input to output is super easy; you just write the XMLEvent you get from the XMLEventReader to the XMLEventWriter.

XMLEventFactory EVT_FACTORY;
XMLEventReader reader;
XMLEventWriter writer;

QName numberQName = new QName("number");
QName factorsQName = new QName("factors");
while(reader.hasNext()) {
  XMLEvent e = reader.nextEvent();
  if(e.isAttribute() && ((Attribute)e).getName().equals(numberQName)) {
     String v = ((Attribute)e).getValue();
     String factors = factorize(Integer.parseInt(v));
     XMLEvent factorsAttr = EVT_FACTORY.createAttribute(factorsQName, factors);
     writer.add(factorsAttr);
  }
  // pass through
  writers.add(e);
}

ykaganovich 2009-09-08 17:24:10

asAttribute doesn't seem to exist, but this gives me some food for thought. thanks.

Jason S 2009-09-08 18:00:01

...and it looks like XMLEvent returned from nextEvent() won't ever be an attribute directly, but rather a start document / start element / characters / end doc/element.

Jason S 2009-09-08 18:05:41

Unfortunately XMLEventReader / Writer doesn't preserve formatting quirks. drat.

Jason S 2009-09-08 18:08:03

I doubt that there's any XML library that will preserve "formatting quirks", if you're referring to things that XML spec deems insignificant.

ykaganovich 2009-09-08 18:29:32

ansaurus

tags:

views:

answers:

best java approach for stream filtering in XML?

related questions