views:

273

answers:

3

I am very new to XSLT, and the first thing that i need to do is parse a 300MB file (and that's on the small end). The XSLT is not that complex for the moment, it's just removing some nodes that match a certain criteria. I have two problems:

  1. It's too slow. It takes 50 seconds to process 500,000 records and that's not fast enough.
  2. It consumes 500MBs of memory, so this will only get worse when the files will get bigger.

Is there anything i can do natively in .net to make is perform better?

I know I can look into SAX based parsing, or STX (which is mentioned in another post), but I would prefer to stay within the .net boundaries.

Thank you!

EDIT: Here's my XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:test="http://schemas...."&gt;
 <xsl:output omit-xml-declaration="yes"/>
    <xsl:template match="node()|@*">
      <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
      </xsl:copy>
</xsl:template>
<xsl:template match="test:QueryRow[test:Columns/test:QueryColumn[test:Name='hit_count' and test:Value>200]]"/>
</xsl:stylesheet>

Here's the code i use to do the transform

XslCompiledTransform compiledTransform = new XslCompiledTransform();
XsltSettings settings = new XsltSettings();
settings.EnableScript = true;
XmlReader xmlReader = XmlReader.Create("in.xml");
XmlWriter xmlWriter = XmlWriter.Create("out.xml");
compiledTransform.Load("format.xslt", settings, null);
compiledTransform.Transform(xmlReader, xmlWriter); //this is what takes a long time

At the moment I am trying to just read the file in, and write it back out, but it seems to actually be reading the whole file into memory, so I am trying to find a way to read it line by line.

+1  A: 

You could try checking out Saxon, which I hear is a very good and efficient XSLT processor. But the full XSLT is not possible to process in a streaming manner, even though your transform sounds like it could be, so unless the XSLT processor is very good at optimizing (as I understand, Saxon is one of the best, if not the best), your memory consumption problems may not be solvable.

jk
+3  A: 

Try profiling your XSLT. oXygen has a nice profiling capability that can tell you where the hot spots are in your transforms.

oXygen HotSpots

You could have some inefficient XPATH expressions (e.g. //*), or have logic buried inside of your templates(e.g. lots of for-each, if, choose, etc) that is preventing the XSLT engine from optimizing. Moving some of that logic up into the template match criteria can help the engine optimize and reduce the size of the node sets that you iterate over and evaluate.

Mads Hansen
+1  A: 

The XPath expression you're filtering on doesn't have anything obviously wrong with it, as such. But it's easy to envision it being a problem. If your QueryRow elements all have 20 Column children, each of which has 20 QueryColumn children, the XSLT processor is going to have to examine 400 elements before deciding that a given QueryRow element doesn't match. That's conceivably pretty inefficient, because if it turns out that the element shouldn't be filtered, the XSLT processor then has to visit all 400 elements again to output them all.

The .NET way to implement SAX-like XML parsing is to subclass XmlReader, which you could conceivably do in this case: you basically build an XmlReader that buffers QueryRow elements as it reads their descendants until it determines that they're OK, and then returns them to the caller of the Read method. That's going to be considerably faster than using XSLT to filter the XML, since using an XmlReader doesn't require you to build an in-memory representation of the unfiltered XML document before you can filter it.

Robert Rossney
This is pretty much what i am trying to do now. Buffer each queryrow, and then filter that.
Pasha