tags:

views:

144

answers:

3

I want to take an XML file as input and output the same XML except for some search/replace actions for attributes and text, based on matching certain node characteristics.

What's the best general approach for this, and are there tutorials somewhere?

DOM is out since I can't guarantee being able to keep the whole thing in memory.

I don't mind using SAX or StAX, except that I want the default behavior to be a pass-through no-op filter; I did something similar with StAX once and it was a pain, didn't work with namespaces, and I was never sure if I had included all the cases I needed to handle.

I think XSLT won't work (but am not sure), because it's declarative and I need to do some procedural calculations when figuring out what text/attributes to emit on the output.

(contrived example:

Suppose I was looking for all nodes with XPath of /group/item/@number and wanted to evaluate the number attribute as an integer, factor it using a method public List<Integer> factorize(int i), convert the list of factors to a space-delimited string, and add an attribute factors to the corresponding /group/item node?

input:

<group name="beatles"><item name="paul" number="64"></group>
<group name="rolling stones"><item name="mick" number="19"></group>
<group name="who"><item name="roger" number="515"></group>

expected output:

<group name="beatles"><item name="paul" number="64" factors="2 2 2 2 2 2"></group>
<group name="rolling stones"><item name="mick" number="19" factors="19"></group>
<group name="who"><item name="roger" number="515" factors="103 5"></group>

)

Update: I got the StAX XMLEventReader/Writer method working easily, but it doesn't preserve certain formatting quirks that are important in my application. (I guess the program that saves/loads XML doesn't honor valid XML files. >:( argh.) Is there a way to process XML that minimizes textual differences between input and output? (at least when it comes to character data.)

+1  A: 

I find Apache Digester a big help for rules-based parsing of XML.

Update: If it's filtering and output that you're concerned with, review this set of articles on Developerworks which is concerned with the same issues. Of particular relevance are parts 2, 3 and 4. The summary: Use SAX, XMLFilter and XMLWriter.

While I suppose this is technically a good fit for XSLT, I've always found it hard to debug for complex transformations. YMMV :-)

Further Update: XMLWriter is available from here. I don't know what your particular difficulty with SAX is. I created a file groups.xml containing:

<groups>
<group name="beatles"><item name="paul" number="64"/></group>
<group name="rolling stones"><item name="mick" number="19"/></group>
<group name="who"><item name="roger" number="515"/></group>
</groups>

Note that I had to make some changes to make it well-formed XML. Then, I knocked up this simple Jython script, groups.py, to illustrate how to solve your problem:

import java.io
import org.xml.sax.helpers
import sys

sys.path.append("xml-writer.jar")
import com.megginson.sax

def get_factors(n):
    return "factors for %s" % n

class MyFilter(org.xml.sax.helpers.XMLFilterImpl):
    def startElement(self, uri, localName, qName, attrs):
        if qName == "item":
            newAttrs = org.xml.sax.helpers.AttributesImpl(attrs)
            n = attrs.length
            for i in range(n):
                name = attrs.getLocalName(i)
                if name == "number":
                    newAttrs.addAttribute("", "factors", "factors",
                                          "CDATA",
                                          get_factors(attrs.getValue(i)))
            attrs = newAttrs
        #call superclass method...
        org.xml.sax.helpers.XMLFilterImpl.startElement(self, uri, localName,
                                                       qName, attrs)

source = org.xml.sax.InputSource(java.io.FileInputStream("groups.xml"))
reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader()
filter = MyFilter(reader)
writer = com.megginson.sax.XMLWriter(filter,
                                     java.io.FileWriter("output.xml"))
writer.parse(source)

Obviously, I've mocked up the factor finding function as your example was, I believe, purely illustrative. The script reads groups.xml, applies a filter, and outputs to output.xml. Let's run it:

$ jython groups.py
$ cat output.xml 
<?xml version="1.0" standalone="yes"?>

<groups>
<group name="beatles"><item name="paul" number="64" factors="factors for 64"></item></group>
<group name="rolling stones"><item name="mick" number="19" factors="factors for 19"></item></group>
<group name="who"><item name="roger" number="515" factors="factors for 515"></item></group>
</groups>

Job done? Of course, you'll need to transcribe this code to Java.

Vinay Sajip
The parsing of what I want to do isn't hard. But I need to emit output that is essentially the same as my input, with intentional differences only
Jason S
@Jason S that sounds an awful lot like XSLT's domain
Jweede
@Vinay: Now I remember why I hate SAX: I can never get the damn thing working. XMLWriter does not seem to be a recognized class, and as simple as XMLFilter sounds I can't seem to figure out how to create something that outputs XML.
Jason S
+2  A: 

XSLT seems like an appropriate model for what you are doing. Look into using XSLT with procedural extensions.

If you really can't keep the whole document in memory, Saxon is your only XSLT choice. It's likely that whatever calculations you need to do can be done in XSLT -- but if not, it's not too hard to write your own extension functions.

Steven Huwig
+1  A: 

StAX should work well for you. Piping input to output is super easy; you just write the XMLEvent you get from the XMLEventReader to the XMLEventWriter.

XMLEventFactory EVT_FACTORY;
XMLEventReader reader;
XMLEventWriter writer;

QName numberQName = new QName("number");
QName factorsQName = new QName("factors");
while(reader.hasNext()) {
  XMLEvent e = reader.nextEvent();
  if(e.isAttribute() && ((Attribute)e).getName().equals(numberQName)) {
     String v = ((Attribute)e).getValue();
     String factors = factorize(Integer.parseInt(v));
     XMLEvent factorsAttr = EVT_FACTORY.createAttribute(factorsQName, factors);
     writer.add(factorsAttr);
  }
  // pass through
  writers.add(e);
}
ykaganovich
asAttribute doesn't seem to exist, but this gives me some food for thought. thanks.
Jason S
...and it looks like XMLEvent returned from nextEvent() won't ever be an attribute directly, but rather a start document / start element / characters / end doc/element.
Jason S
Unfortunately XMLEventReader / Writer doesn't preserve formatting quirks. drat.
Jason S
I doubt that there's any XML library that will preserve "formatting quirks", if you're referring to things that XML spec deems insignificant.
ykaganovich