ansaurus

Question

What is the relative processing speed of manipulating data with XML or OOP techniques? (i.e. XProc or XSL vs C# or Java)

Answer 1

A:

All other things being equal, it's generally fastest to:

read the XML only once (disk I/O is slow)
build a document tree of nodes entirely in memory,
perform the transformations,
and generate the result.

That is, if you can represent the transformations as code operations on the in-node tree rather than having to read them from an XSLT description, that will definitely be faster. Either way, you'll have to generate some code that does the transformations you want, but with XSLT you have the extra step of "read in the transformations from this document and then transform the instructions into code", which tends to be a slow operation.

Your mileage may vary. You'll need to be more specific about the individual circumstances before a more precise answer can be given.

John Feminella 2010-06-26 23:40:50

Let's assume the XML is stored in a database. Would it be faster to use programatically execute a transform, or to programatically loop through the nodes in the loaded xml document api (i.e. the XElements of a Linq XDocument)?

smartcaveman 2010-06-27 01:39:30

Either way you still have to read the XML and form an in-memory representation of it, unless you're using some sort of specialized database extension that can perform the XML transforms directly. That means that in general it'll be quicker to programmatically loop through nodes.

John Feminella 2010-06-27 04:20:29

Many XSL transformers have an ability to use a "compiled" stylesheet instead of reparsing the XSLT every time it's used. For libxslt the recommendation is to parse/compile the stylesheet once, then use the in-memory stylesheet struct multiple times. Saxon provides xsltc for non-long-lived processes. So, I wouldn't put too much emphasis on the impact of the stylesheet parse/compile on performance.

Owen S. 2010-06-27 06:30:44

Answer 2

A:

Generally speaking, if you're only going to be using the source document's tree once, you're not going to gain much of anything by deserializing it into some specialized object model. The cost of admission - parsing the XML - is likely to dwarf the cost of using it, and any increase in performance that you get from representing the parsed XML in something more efficient than an XML node tree is going to be marginal.

If you're using the data in the source document over and over again, though, it can make a lot of sense to parse that data into some more efficiently-accessible structure. This is why XSLT has the xsl:key element and key() function: looking an XML node up in a hash table can be so much faster than performing a linear search on a list of XML nodes that it was worth putting the capability into the language.

To address your specific example, iterating over a List<Thing> is going to perform at the same speed as iterating over a List<XmlNode>. What will make the XSLT slower is not the iteration. It's the searching, and what you do with the found nodes. Executing the XPath query Things/Thing iterates through the child elements of the current node, does a string comparison to check each element's name, and if the element matches, it iterates through that element's child nodes and does another string comparison for each. (Actually, I don't know for a fact that it's doing a string comparison. For all I know, the XSLT processor has hashed the names in the source document and the XPath and is doing integer comparisons of hash values.) That's the expensive part of the operation, not the actual iteration over the resulting node set.

Additionally, most anything that you do with the resulting nodes in XSLT is going to involve linear searches through a node set. Accessing an object's property in C# doesn't. Accessing MyThing.MyProperty is going to be faster than getting at it via <xsl:value-of select='MyProperty'/>.

Generally, that doesn't matter, because parsing XML is expensive whether you deserialize it into a custom object model or an XmlDocument. But there's another case in which it may be relevant: if the source document is very large, and you only need a small part of it.

When you use XSLT, you essentially traverse the XML twice. First you create the source document tree in memory, and then the transform processes this tree. If you have to execute some kind of nasty XPath like //*[@some-attribute='some-value'] to find 200 elements in a million-element document, you're basically visiting each of those million nodes twice.

That's a scenario where it can be worth using an XmlReader instead of XSLT (or before you use XSLT). You can implement a method that traverses the stream of XML and tests each element to see if it's of interest, creating a source tree that contains only your interesting nodes. Or, if you want to get really crazy, you can implement a subclass of XmlReader that skips over uninteresting nodes, and pass that as the input to XslCompiledTemplate.Transform(). (I suspect, though, that if you knew enough about how XmlReader works to subclass it you probably wouldn't have needed to ask this question in the first place.) This approach allows you to visit 1,000,200 nodes instead of 2,000,000. It's also a king-hell pain in the ass, but sometimes art demands sacrifice from the artist.

Robert Rossney 2010-06-28 17:17:20

ansaurus

tags:

views:

answers:

What is the relative processing speed of manipulating data with XML or OOP techniques? (i.e. XProc or XSL vs C# or Java)

related questions