tags:

views:

177

answers:

3

We have pretty big (~200mb) xml files from different sources that we want to transform into a common format.

For structural transformations (element names, nesting, etc) we decided to use XSLT (1.0). Because it has to be fast (we receive a lot of those files), we chose Apache Xalan as the engine. Structural transformations might be quite complex (not just <tag a> -> <tag b>), and are different for xml files from different sources.

However, we also need to transform values of the elements. Transformations can be rather complex (i.e., some require access to Google Maps API, others require access to our database, etc...), so we decided to use a simple Ruby-based DSL, which is a list of "xpath selector" => transformer entities, i.e.:

{"rss/channel/item" => {:class => 'ItemMutators', :method => :guess_location}

However, keeping elements transformations apart from value transformations seems rather like a hack. Are there any better solutions?


For example, with Java you can write extensions for xalan, and you can use them to transform the values. Is there something similar but for ruby?


Thank you, guys! All the responses were very valuable. I am currently thinking :)

+2  A: 

I would do it all in Ruby, writing a module that can do two tasks:

1) perform a SAX parse in Ruby of various XML input formats, outputting an intermediate format XML document and a validation error/key violation error list

2) create a DOM tree from an intermediate format XML file, modify it in place, enhance it with imported data, and output the modified DOM tree to a standard format

The first step using SAX allows redundant data to be stripped from the file (and not loaded into a DOM model!) and the non-redundant, wanted data groups to be placed in uniformly named tags quickly. For maximum speed the data groups should not be sorted in any way before going out to intermediate format XML, and intermediate format XML should use short tag names.

The second step using DOM allows the tags for intermediate format XML where no validation errors were found to be sorted and processed quickly.

By validation error here I mean a range of things such as missing fields or invalid key formats, number ranges etc. It would also detect objects referenced by keys that are missing from the file; for this it builds up two hashes, one of referenced keys and one of present keys, and check the referenced keys against the present keys as one of the last steps before completion. Although you could do some checking with a XSD or DTD, Ruby allow you more flexibility, and many validation issues in practice are "softer" errors for which some limited correction can be made.

The module should limit how many of each task are done in parallel to avoid the system running out of CPU or RAM.

The essence of my recommendation is to do it all in Ruby but to separate the work in two phases - first phase, those tasks which can be done quickly with SAX and second phase, those tasks that can be done quickly with DOM.

EDIT

> How do we do structural transformations with SAX?

Well you can't do any kind of reordering conveniently or elese you're no longer really getting the memory use benefits of parsing XML serially, but here's an illustration of the kind of approach I mean for stage one, using Java (sorry not Ruby but should be fairly easily translatable - think of this as pseudocode!):

class MySAXHandler implements org.xml.sax.ContentHandler extends Object {
  final static int MAX_DEPTH=512;
  final static int FILETYPE_A=1;
  final static int FILETYPE_B=2;
  String[] qualifiedNames = new String[MAX_DEPTH];
  String[] localNames = new String[MAX_DEPTH];
  String[] namespaceURIs = new String[MAX_DEPTH];
  int[] meaning = new int[MAX_DEPTH];
  int pathPos=0;
  public java.io.Writer destination;
  ArrayList errorList=new ArrayList();
  org.xml.sax.Locator locator;
  public int inputFileSchemaType;

  String currentFirstName=null;
  String currentLastName=null;

  puiblic void setDocumentLocator(org.xml.sax.Locator l) { this.locator=l; }

  public void startElement(String uri, String localName, String qName,
    org.xml.sax.Attributes atts) throws SAXException { 

    // record current tag in stack
    qualifiedNames[pathPos] = qName;
    localNames[pathPos] = localName;
    namespaceURIs[pathPos] = uri;
    int meaning;

    // what is the meaning of the current tag?
    meaning=0; pm=pathPos==0?0:meanings[pathPos-1];
    switch (inputFileSchemaType) {
           case FILETYPE_A:
      switch(pathPos) {
        // this checking can be as strict or as lenient as you like on case,
        // namespace URIs and tag prefixes
             case 0:
        if(localName.equals("document")&&uri.equals("http://xyz")) meaning=1;
      break; case 1: if (pm==1&&localName.equals("clients")) meaning=2;
      break; case 2: if (pm==2&&localName.equals("firstName")) meaning=3;
        else if (pm==2&&localName.equals("lastName")) meaning=4;
        else if (pm==2) meaning=5;
      }
      break; case FILETYPE_B:
      switch(pathPos) {
        // this checking can be as strict or as lenient as you like on case,
        // namespace URIs and tag prefixes
             case 0:
        if(localName.equals("DOC")&&uri.equals("http://abc")) meaning=1;
      break; case 1: if (pm==1&&localName.equals("CLS")) meaning=2;
      break; case 2: if (pm==2&&localName.equals("FN1")) meaning=3;
        else if (pm==2&&localName.equals("LN1")) meaning=4;
        else if (pm==2) meaning=5;
      }
    }

    meanings[pathPos]=meaning;

    // does the tag have unrecognised attributes?
    // does the tag have all required attributes?
    // record any keys in hashtables...
    // (TO BE DONE)

    // generate output
    switch (meaning) {
      case 0:errorList.add(new Object[]{locator.getPublicId(),
        locator.getSystemId(),
        locator.getLineNumber(),locator.getColumnNumber(),
        "Meaningless tag found: "+localName+" ("+qName+
        "; namespace: \""+uri+"\")});
      break;case 1:
      destination.write("<?xml version=\"1.0\" ?>\n");
      destination.write("<imdoc xmlns=\"http://someurl\" lang=\"xyz\">\n");
      destination.write("<!-- Copyright notice -->\n");
      destination.write("<!-- Generated by xyz -->\n");
      break;case 2: destination.write(" <cl>\n");
        currentFirstName="";currentLastName="";
    }
    pathPos++;
  }
  public void characters(char[] ch, int start, int length)
            throws SAXException {
    int meaning=meanings[pathPos-1]; switch (meaning) {
    case 1: case 2:
              errorList.add(new Object[]{locator.getPublicId(),
        locator.getSystemId(),
        locator.getLineNumber(),locator.getColumnNumber(),
        "Unexpected extra characters found"});
    break; case 3:
      // APPEND to currentFirstName IF WITHIN SIZE LIMITS
    break; case 4:
      // APPEND to currentLastName IF WITHIN SIZE LIMITS
    break; default: // ignore other characters
    }
  }
  public void endElement(String uri, String localName, String qName)
    throws SAXException {
    pathPos--;
    int meaning=meanings[pathPos]; switch (meaning) { case 1:
      destination.write("</imdoc>");
    break; case 2:
      destination.write("  <ln>"+currentLastName.trim()+"</ln>\n");
      destination.write("  <fn>"+currentFirstName.trim()+"</fn>\n");
      destination.write(" </cl>\n");
    break; case 3:
      if (currentFirstName==null||currentFirstName.equals(""))
              errorList.add(new Object[]{locator.getPublicId(),
        locator.getSystemId(),
        locator.getLineNumber(),locator.getColumnNumber(),
        "Invalid first name length"});
      // ADD FIELD FORMAT VALIDATION USING REGEXES / RANGE CHECKING
    break; case 4:
      if (currentLastName==null||currentLastName.equals(""))
              errorList.add(new Object[]{locator.getPublicId(),
        locator.getSystemId(),
        locator.getLineNumber(),locator.getColumnNumber(),
        "Invalid last name length"});
      // ADD FIELD FORMAT VALIDATION USING REGEXES / RANGE CHECKING
    }
  }
  public void endDocument() {
    // check for key violations
  }
}

The stage one code is not for reordering the data, just standardising to a single intermediate format (which may admittedly vary in the order of data groups depending on the source file type as the data groups order will mirror that of the source file) and validating it.

But writing a SAX handler is only worth doing if you're not already happy with your XSLT. Presumably you're not if you're writing this question...?

OTOH if you like your XSLT and it's running fast enough, I say why change the architecture. In that case, you might find { this } article helpful, if you're not already wrapping the relevant Xalan calls in a Ruby module. You might want to try and make it a one step process for the users (for cases where no data errors are found!).

EDIT

With this approach, you'll have to escape your XML on output manually so:

& becomes &amp;

> becomes &gt;

< becomes &lt;

Non-ascii becomes a character entity if necessary, otherwise a UTF-8 sequence

etc

Also worth writing a function that can take a SAX Attributes object and a flexible validation spec relevant to the input tag's meaning and file format as an object array or similar, and can match and return values, and flag errors, strictly or leniently as required.

And finally you should have a configurable MAX_ERRORS concept with a default of say 1000, record a "too many errors" error at this limit and stop recording errors after you reach the limit.

If you need to up the number of XMLs you can do in parallel, and are still struggling with capacity/performance, I suggest that the DOM step only loads, reorders and saves, so can do one or two docs at a time, but relatively quickly so doing it in batches, and then a second SAX processor then does the Google calls and processes XML serially for N docs in parallel.

HTH

EDIT

> We have ~50 different incoming formats, so doing

> switch/case FORMAT_X is not good.

That is the conventional wisdom sure, but what about the following:

// set meaning and attributesValidationRule (avr)
if (fileFormat>=GROUP10) switch (fileFormat) {
  case GROUP10_FORMAT1: 

    switch(pathPos) {
    case 0: if (...) { meaning=GROUP10_CUSTOMER; avr=AVR6_A; }
    break; case 1: if (...) { meaning=...; avr=...; }
    ...
    }

  break; case GROUP10_FORMAT2: ...

  break; case GROUP10_FORMAT3: ...
}
else if (fileFormat>=GROUP9) switch (fileFormat) {
  case GROUP9_FORMAT1: ... 
  break; case GROUP9_FORMAT2: ...
}
...
else if (fileFormat>=GROUP1) switch (fileFormat) {
  case GROUP1_FORMAT1: ... 
  break; case GROUP1_FORMAT2: ...
}

...

result = validateAttribute(atts,avr);

if (meaning >= MEANING_SET10) switch (meaning) {
case ...:  ...
break; case ...:  ...
}
else if (meaning >= MEANING_SET9) switch (meaning) {
}
etc

Could well be fast enough and much easier to read than lots of functions or classes.

> The part I am not happy about is that I cannot do structure

> and value transformations using some kind of homogeneous process

> (like with Java I can write extensions for Xalan).

Sounds like you've hit a limit of XSLT or are you just talking about the obvious limit that bringing in data from sources other than the source document is a pain?

Another idea is to have a validating style sheet, a style sheet that outputs a list of keys for trying on Google Maps, a style sheet that output a list of keys for trying on your database, processes that actually do the Google/db calls and output more XML, an "XML concatenating" function, and a style sheet that combines the data, taking input like:

<?xml version="1.0" ?>
<myConsolidatedInputXmlDoc>
  <myOriginalOrIntermediateFormatDoc>
    ...
  </myOriginalOrIntermediateFormatDoc>
  <myFetchedRelatedDataFromGoogleMaps>
    ...
  </myFetchedRelatedDataFromGoogleMaps>
  <myFetchedDataFromSQL>
    ...
  </myFetchedDataFromSQL>
</myConsolidatedInputXmlDoc>

In this way you get to use XSLT in a "multi-pass" scenario without calling out to Xalan extensions.

martinr
I'm all in favour of DSLs. I like the thinking. But basically this is a job for SAX/DOM. Worth putting in some DSLs too only if you're doing a lot of this kind of thing.
martinr
How do we do structural transformations with SAX? Writing a DSL for structural transformations seems like quite a complex.We have many different formats that sometimes need some complicated structure changes. XSLT was perfect for describing them but, unfortunately, it couldn't transform the values.Right now our workflow consists of 4 rake tasks:1. Download the xmls2. Transform the downloaded xmls with XSLT3. Transform the result of 2 (only element values) with Ruby. (Nokogiri/SAX, DSL like the one in the post).4. Read the result of 3, validate each item, save it to the db.
glebm
+1 for the Xalan-Ruby link. Right now we are just calling Xalan CLI with %x[...].The problem with DOM is that DOM is not really doable at all. 200 mb file is too much for DOM to handle (gets incredibly slow).We have ~50 different incoming formats, so doing switch/case FORMAT_X is not good. Writing a DSL for that is not really doable, because the transformations can be quite complex, but XSLT does handle them.The part I am not happy about is that I cannot do structure and value transformations using some kind of homogeneous process (like with Java I can write extensions for Xalan).
glebm
Could put it in a database for sorting. If you need to avoid using DOM... Stage 1: SAX->(validation/keycheck/partial-standardize)->"INSERT INTO queries". Stage 2 "Various SELECT ORDER BY queries"->(add GoogleMaps/main database data/rest of standardisation)->Final SQL. Could use local database to hold intermediate data and snapshot of production database, or if you have the capacity and data stability, do all the SQL on the main database.
martinr
I kind of agree (with certain caveats) with this guy: http://www.codinghorror.com/blog/2005/07/martin-fowler-hates-xslt-too.html. Basicaly IMHO XSLT is a very good **prototyping** tool but usually other tools are better once the proof of concept is done. I hate to be a drag but what else can I say - that is my experience.
martinr
+2  A: 

You should be able to use XSLT extensions. A web search reveals that Xalan supports Java for doing extensions: http://xml.apache.org/xalan-j/extensions.html

Quote from the linked page:

For those situations where you would like to augment the functionality of XSLT with calls to a procedural language, Xalan-Java supports the creation and use of extension elements and extension functions. Xalan-Java also provides a growing extensions library available for your use.

Also, apparently someone has written a package in Ruby which can provide xslt extensions: http://greg.rubyfr.net/pub/packages/ruby-xslt/classes/XML/XSLT.html

Moron
Heck, If you follow the links there, you can even use JRuby.
Kyle Butt
@Kyle: There is apparently even one package developed for Ruby!
Moron
Java and JRuby, unfortunately, are not an option, because the value transformer needs access to the models from the main application (a lot of them).I took a look at the ruby package, it has not been updated since 2006 and probably is quite slow (our source xml's are sometimes about 200 mb, and we need to import a lot of them daily).
glebm
@Glex: The Ruby implementation seems to be a shared source, perhaps you can modify it to suit your needs. btw, where is the bottleneck really? I would expect the Google API + DB calls to be the bottleneck. In that case, having the value transform separate from the element transform could actually turn out useful: You could batch up those API/DB calls, do some caching etc, perhaps?
Moron
We do all that (caching, background Google API calls, etc).There are 2 performance issues:- Model validations on save are really slow (i.e. much slower than anything else)- The size of the XML files forces us to use SAX, but that is not a problem really.Now that I've been thinking about it, maybe we did the right thing with separating them
glebm
If you really need the validation and that is the bottleneck, perhaps you can try optimizing the validation step. For instance, do you require just the structure for validation? They perhaps you can transform to an xml which is stripped down to its bare bones and then validate. Then do the transform again, turning off validation...
Moron
+1  A: 

One approach is to use Xalan-J with some extensions that make RPC calls back to your Ruby process. The returned data can be further processed by XSLT.

For tighter integration, you could bind Xalan-C++ as a Ruby library. You probably only need a small part of the Xalan API, similar to that used in the command line driver XalanExe. With Xalan running in-process, your extensions can then directly access your Ruby model.

Links:

Lachlan Roche