I would do it all in Ruby, writing a module that can do two tasks:
1) perform a SAX parse in Ruby of various XML input formats, outputting an intermediate format XML document and a validation error/key violation error list
2) create a DOM tree from an intermediate format XML file, modify it in place, enhance it with imported data, and output the modified DOM tree to a standard format
The first step using SAX allows redundant data to be stripped from the file (and not loaded into a DOM model!) and the non-redundant, wanted data groups to be placed in uniformly named tags quickly. For maximum speed the data groups should not be sorted in any way before going out to intermediate format XML, and intermediate format XML should use short tag names.
The second step using DOM allows the tags for intermediate format XML where no validation errors were found to be sorted and processed quickly.
By validation error here I mean a range of things such as missing fields or invalid key formats, number ranges etc. It would also detect objects referenced by keys that are missing from the file; for this it builds up two hashes, one of referenced keys and one of present keys, and check the referenced keys against the present keys as one of the last steps before completion. Although you could do some checking with a XSD or DTD, Ruby allow you more flexibility, and many validation issues in practice are "softer" errors for which some limited correction can be made.
The module should limit how many of each task are done in parallel to avoid the system running out of CPU or RAM.
The essence of my recommendation is to do it all in Ruby but to separate the work in two phases - first phase, those tasks which can be done quickly with SAX and second phase, those tasks that can be done quickly with DOM.
EDIT
> How do we do structural transformations with SAX?
Well you can't do any kind of reordering conveniently or elese you're no longer really getting the memory use benefits of parsing XML serially, but here's an illustration of the kind of approach I mean for stage one, using Java (sorry not Ruby but should be fairly easily translatable - think of this as pseudocode!):
class MySAXHandler implements org.xml.sax.ContentHandler extends Object {
final static int MAX_DEPTH=512;
final static int FILETYPE_A=1;
final static int FILETYPE_B=2;
String[] qualifiedNames = new String[MAX_DEPTH];
String[] localNames = new String[MAX_DEPTH];
String[] namespaceURIs = new String[MAX_DEPTH];
int[] meaning = new int[MAX_DEPTH];
int pathPos=0;
public java.io.Writer destination;
ArrayList errorList=new ArrayList();
org.xml.sax.Locator locator;
public int inputFileSchemaType;
String currentFirstName=null;
String currentLastName=null;
puiblic void setDocumentLocator(org.xml.sax.Locator l) { this.locator=l; }
public void startElement(String uri, String localName, String qName,
org.xml.sax.Attributes atts) throws SAXException {
// record current tag in stack
qualifiedNames[pathPos] = qName;
localNames[pathPos] = localName;
namespaceURIs[pathPos] = uri;
int meaning;
// what is the meaning of the current tag?
meaning=0; pm=pathPos==0?0:meanings[pathPos-1];
switch (inputFileSchemaType) {
case FILETYPE_A:
switch(pathPos) {
// this checking can be as strict or as lenient as you like on case,
// namespace URIs and tag prefixes
case 0:
if(localName.equals("document")&&uri.equals("http://xyz")) meaning=1;
break; case 1: if (pm==1&&localName.equals("clients")) meaning=2;
break; case 2: if (pm==2&&localName.equals("firstName")) meaning=3;
else if (pm==2&&localName.equals("lastName")) meaning=4;
else if (pm==2) meaning=5;
}
break; case FILETYPE_B:
switch(pathPos) {
// this checking can be as strict or as lenient as you like on case,
// namespace URIs and tag prefixes
case 0:
if(localName.equals("DOC")&&uri.equals("http://abc")) meaning=1;
break; case 1: if (pm==1&&localName.equals("CLS")) meaning=2;
break; case 2: if (pm==2&&localName.equals("FN1")) meaning=3;
else if (pm==2&&localName.equals("LN1")) meaning=4;
else if (pm==2) meaning=5;
}
}
meanings[pathPos]=meaning;
// does the tag have unrecognised attributes?
// does the tag have all required attributes?
// record any keys in hashtables...
// (TO BE DONE)
// generate output
switch (meaning) {
case 0:errorList.add(new Object[]{locator.getPublicId(),
locator.getSystemId(),
locator.getLineNumber(),locator.getColumnNumber(),
"Meaningless tag found: "+localName+" ("+qName+
"; namespace: \""+uri+"\")});
break;case 1:
destination.write("<?xml version=\"1.0\" ?>\n");
destination.write("<imdoc xmlns=\"http://someurl\" lang=\"xyz\">\n");
destination.write("<!-- Copyright notice -->\n");
destination.write("<!-- Generated by xyz -->\n");
break;case 2: destination.write(" <cl>\n");
currentFirstName="";currentLastName="";
}
pathPos++;
}
public void characters(char[] ch, int start, int length)
throws SAXException {
int meaning=meanings[pathPos-1]; switch (meaning) {
case 1: case 2:
errorList.add(new Object[]{locator.getPublicId(),
locator.getSystemId(),
locator.getLineNumber(),locator.getColumnNumber(),
"Unexpected extra characters found"});
break; case 3:
// APPEND to currentFirstName IF WITHIN SIZE LIMITS
break; case 4:
// APPEND to currentLastName IF WITHIN SIZE LIMITS
break; default: // ignore other characters
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
pathPos--;
int meaning=meanings[pathPos]; switch (meaning) { case 1:
destination.write("</imdoc>");
break; case 2:
destination.write(" <ln>"+currentLastName.trim()+"</ln>\n");
destination.write(" <fn>"+currentFirstName.trim()+"</fn>\n");
destination.write(" </cl>\n");
break; case 3:
if (currentFirstName==null||currentFirstName.equals(""))
errorList.add(new Object[]{locator.getPublicId(),
locator.getSystemId(),
locator.getLineNumber(),locator.getColumnNumber(),
"Invalid first name length"});
// ADD FIELD FORMAT VALIDATION USING REGEXES / RANGE CHECKING
break; case 4:
if (currentLastName==null||currentLastName.equals(""))
errorList.add(new Object[]{locator.getPublicId(),
locator.getSystemId(),
locator.getLineNumber(),locator.getColumnNumber(),
"Invalid last name length"});
// ADD FIELD FORMAT VALIDATION USING REGEXES / RANGE CHECKING
}
}
public void endDocument() {
// check for key violations
}
}
The stage one code is not for reordering the data, just standardising to a single intermediate format (which may admittedly vary in the order of data groups depending on the source file type as the data groups order will mirror that of the source file) and validating it.
But writing a SAX handler is only worth doing if you're not already happy with your XSLT. Presumably you're not if you're writing this question...?
OTOH if you like your XSLT and it's running fast enough, I say why change the architecture. In that case, you might find { this } article helpful, if you're not already wrapping the relevant Xalan calls in a Ruby module. You might want to try and make it a one step process for the users (for cases where no data errors are found!).
EDIT
With this approach, you'll have to escape your XML on output manually so:
& becomes &
> becomes >
< becomes <
Non-ascii becomes a character entity if necessary, otherwise a UTF-8 sequence
etc
Also worth writing a function that can take a SAX Attributes object and a flexible validation spec relevant to the input tag's meaning and file format as an object array or similar, and can match and return values, and flag errors, strictly or leniently as required.
And finally you should have a configurable MAX_ERRORS concept with a default of say 1000, record a "too many errors" error at this limit and stop recording errors after you reach the limit.
If you need to up the number of XMLs you can do in parallel, and are still struggling with capacity/performance, I suggest that the DOM step only loads, reorders and saves, so can do one or two docs at a time, but relatively quickly so doing it in batches, and then a second SAX processor then does the Google calls and processes XML serially for N docs in parallel.
HTH
EDIT
> We have ~50 different incoming formats, so doing
> switch/case FORMAT_X is not good.
That is the conventional wisdom sure, but what about the following:
// set meaning and attributesValidationRule (avr)
if (fileFormat>=GROUP10) switch (fileFormat) {
case GROUP10_FORMAT1:
switch(pathPos) {
case 0: if (...) { meaning=GROUP10_CUSTOMER; avr=AVR6_A; }
break; case 1: if (...) { meaning=...; avr=...; }
...
}
break; case GROUP10_FORMAT2: ...
break; case GROUP10_FORMAT3: ...
}
else if (fileFormat>=GROUP9) switch (fileFormat) {
case GROUP9_FORMAT1: ...
break; case GROUP9_FORMAT2: ...
}
...
else if (fileFormat>=GROUP1) switch (fileFormat) {
case GROUP1_FORMAT1: ...
break; case GROUP1_FORMAT2: ...
}
...
result = validateAttribute(atts,avr);
if (meaning >= MEANING_SET10) switch (meaning) {
case ...: ...
break; case ...: ...
}
else if (meaning >= MEANING_SET9) switch (meaning) {
}
etc
Could well be fast enough and much easier to read than lots of functions or classes.
> The part I am not happy about is that I cannot do structure
> and value transformations using some kind of homogeneous process
> (like with Java I can write extensions for Xalan).
Sounds like you've hit a limit of XSLT or are you just talking about the obvious limit that bringing in data from sources other than the source document is a pain?
Another idea is to have a validating style sheet, a style sheet that outputs a list of keys for trying on Google Maps, a style sheet that output a list of keys for trying on your database, processes that actually do the Google/db calls and output more XML, an "XML concatenating" function, and a style sheet that combines the data, taking input like:
<?xml version="1.0" ?>
<myConsolidatedInputXmlDoc>
<myOriginalOrIntermediateFormatDoc>
...
</myOriginalOrIntermediateFormatDoc>
<myFetchedRelatedDataFromGoogleMaps>
...
</myFetchedRelatedDataFromGoogleMaps>
<myFetchedDataFromSQL>
...
</myFetchedDataFromSQL>
</myConsolidatedInputXmlDoc>
In this way you get to use XSLT in a "multi-pass" scenario without calling out to Xalan extensions.