tags:

views:

814

answers:

9

I need a tool to execute XSLTs against very large XML files. To be clear, I don't need anything to design, edit, or debug the XSLTs, just execute them. The transforms that I am using are already well optimized, but the large files are causing the tool I have tried (Saxon v9.1) to run out of memory.

A: 

Take a look at Xselerator

SaaS Developer
+1  A: 

I have found that a custom tool built to run the XSLT using earlier versions of MSXML makes it very fast, but also consumes incredible amounts of memory, and will not actually complete if it is too large. You also lose out on some advanced XSLT functionality as the earlier versions of MSXML don't support the full xpath stuff.

It is worth a try if your other options take too long.

hova
The dirty secret for a while was the fact that the early .Net equivalents of MSXML 4.0 were real slow!!!
David Robbins
+1  A: 

That's an interesting question. XSLT could potentially be optimized for space, but I expect all but the most obscure implementations around start by parsing the source document into DOM, which is bound to use a low multiple of the document size in memory.

Unless the stylesheet is specially designed to support a single-pass transformation, reasonable time performance would probably require parsing the source document into a disk-based hierarchical database.

I do not have an answer, though.

ddaa
Yes, this describes the underlying problem well.
fatcat1111
+3  A: 

I found a good solution: Apache's Xalan C++. It provides a pluggable memory manager, allowing me to tune allocation based on the input and transform.

In multiple cases it is consuming ~60% less memory (I'm looking at private bytes) than the others I have tried.

fatcat1111
Still not good for "processing files that are larger than RAM", but if it's good enough for you, great.
ddaa
+2  A: 

It sounds like you're sorted - but often, another potential approach is to split the data first. Obviously this only works with some transformations (i.e. where different chunks of data can be treated in isolation from the whole) - but then you can use a simple parser (rather than a DOM) to do the splitting into manageable pieces, then process each chunk separately and reassemble.

Since I'm a .NET bod, things like XmlReader can do the chunking without a DOM; I'm sure there are equivalents for every language.

Again - just for completeness.

[edit re question] I'm not aware of any specific name; maybe Divide and Conquer. For an example; if your data is actually a flat list of like objects, then you could simply split the first-level children - i.e. rather than having 2M rows, you split it into 10 lots of 200K rows, or 100 lots of 20K rows. I've done this before lots of times for working with bulk data (for example, uploading in chunks of data [all valid] and re-assembling at the server so that each individual upload is small enough to be robust).

Marc Gravell
Thanks Marc, that's a great idea. Can you tell me more about it? Is there a name for this technique so that I can research it?
fatcat1111
A: 

Could you describe the structure of the xml file?

Dimitre Novatchev
A: 

Are you using the Java version of Saxon, or the .Net port? You can assign more memory to the Java VM running Saxon, if you are running out of memory (using the -Xms command line parameter).

I've also found that the .Net version of Saxon runs out of memory less easily than the Java version.

James Sulak
+2  A: 

You may want to look into STX for streaming-based XSLT-like transformations. Alternatively, I believe StAX can integrate with XSLT nicely through the Transformer interface.

ykaganovich
Thanks for adding this - STX was entirely new to me.
fatcat1111
+2  A: 

For what it's worth, I suspect that for Java, Saxon is as good as it gets, if you need to use XSLT. It is quite efficient (both cpu and memory) for larger documents, but XSLT itself essentially forces full in-memory tree of contents to be created and retained, except for limited cases. Saxon-SA (for-fee version) supposedly has extensions to allow taking advantage of such "streaming" cases, so that might be worth checking out.

But the advice to split up the contents is the best one: if you are dealing with independent records, just split the input using other techniques (like, use Stax! :-) )

StaxMan