ansaurus

Question

Doing XML extracts with XSLT without having to read the whole DOM tree into memory?

Answer 1

+2 A:

You should be able to implement this without a full table scan. The '//' operator means find an element in the tree at any level. It is pretty expensive to run especially on a document of your size. If you optimize your XPath query or considering setting up match templates, the XSLT transformer may not need to load the entire document into memory.

Based on your XML sample, you are looking to match /log/e/m[ ... predicate ...]. That should be able to be optimized by some XSLT processors to not scan the full document where // would not be.

Since your XML document is pretty simple, it might be easier to not use XSLT at all. STaX is a great streaming API for handling large XML documents. Dom4j also has good support for an XPath like query against large documents. Info on using dom4j for large documents is here: http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

Sample from the above source:

SAXReader reader = new SAXReader();
reader.addHandler( "/ROWSET/ROW", 
    new ElementHandler() {
        public void onStart(ElementPath path) {
            // do nothing here...    
        }
        public void onEnd(ElementPath path) {
            // process a ROW element
            Element row = path.getCurrent();
            Element rowSet = row.getParent();
            Document document = row.getDocument();
            ...
            // prune the tree
            row.detach();
        }
    }
);

Document document = reader.read(url);

// The document will now be complete but all the ROW elements
// will have been pruned.
// We may want to do some final processing now
...

Chris Dail 2009-12-17 13:46:34

Yes, even if a processor is capable of operating on a subsection of a document, starting with "//" would remove that ability.

Paul Butcher 2009-12-17 13:49:17

It is fine to do the complete scan of the XML _file_. I just don't have memory for the complete DOM tree of the XML file.

Thorbjørn Ravn Andersen 2009-12-17 14:02:20

What XSLT processor doesn't load the entire source tree into memory?

Robert Rossney 2009-12-18 17:53:46

@Robert, e.g. the enterprise version of Saxon. Please reread the question.

Thorbjørn Ravn Andersen 2009-12-23 07:59:36

Answer 2

A:

This is a stab in the dark, and maybe you'll laugh me out of the house.

Nothing stops you from connecting a SAX source to the input of your XSLT; and it is at least in theory easy enough to do your grep from a SAX stream without needing a DOM. So... wanna give that a try?

Carl Smotricz 2009-12-17 13:50:46

My initial example is very simple in terms of what I want to do. As you may have guessed already, this is a log file with structured data. My future goal is to mine it with more complex questions, hence the XPath approach.

Thorbjørn Ravn Andersen 2009-12-18 12:39:39

That doesn't change much. The thing to realize is that a log file structure means you'll have lots of relatively small elements at the (root+1) level, each of which won't consume much memory. As long as you don't do any operations that require more than one of the elements at this level, there's nothing stopping XSLT from doing them sequentially, and you can apply whatever selections and transforms you wish.

Carl Smotricz 2009-12-18 14:02:37

Answer 3

+3 A:

Consider VTD-XML. It is much more memory efficient. You can find an API here and benchmarks here.

alt text

Note that the last graph says that DOM uses at minimum 5x as many memory as the XML file big is. It is after all really astonishing, isn't it?

As a bonus, it is also faster in parsing and Xpath as opposed to DOM and JDK:

alt text

BalusC 2009-12-17 13:56:41

Looks good but does it actually prune input or just make the internalrepresentation smaller?

Thorbjørn Ravn Andersen 2009-12-17 19:17:18

The internal representation. Also see under each http://vtd-xml.sourceforge.net/VTD.html and the other pages (developer's guide) over there.

BalusC 2009-12-17 19:44:38

Ok, thanks. The smaller internal representation does not solve the scalability problem, only postpones it, so this is not the right approach. unfortunately - it looks interesting.

Thorbjørn Ravn Andersen 2009-12-18 12:36:42

Much luck finding "the right approach" then :) You could eventually homegrow one yourself which opens and reads the XML file line by line *every time*, scanning for matches and forgetting the previous line so that it saves memory. But how far would you go? A painfully **slow** but memory efficient approach? Or a **fast** and memory efficient approach? By the way, a 500MB XML file more sounds like candidate for an embedded DB or maybe a worthfully DB server. SQL is undoubtely the best approach for those sizes.

BalusC 2009-12-18 12:42:28

Answer 4

A:

Try the CAX parser from xponentsoftware. It is a fast xml parser built on Microsoft's xmlreader. It gives the full path as you parse each element, so you could check if the path ="m/e" and then check if the text node contains "foo"

bill seacham 2009-12-17 14:13:27

So you suggest writing a small preprocessor to do this? Interesting, but not what I want to do.

Thorbjørn Ravn Andersen 2009-12-18 12:44:11

Not at all. You mentioned trying stax and saxon, which are xml parsers. so is cax. it gives you full access to any size xml, but uses a tiny amount of memory. unlike stax and saxon, cax allows you to look back a everything that has been parsed, which you often need to do when transforming xml.

bill seacham 2009-12-19 04:24:29

Answer 5

A:

The Enterprise Edition of the Saxon XSLT Processor supports streaming of large documents for exactly this type of problem.

Robert Christie 2009-12-17 14:27:23

Which is the one I refer to as the most expensive version of Saxon. Our usage does not warrant a £300 purchase.

Thorbjørn Ravn Andersen 2009-12-17 14:29:25

Answer 6

A:

I'm not a Java guy, and I don't know if the tools I'd use to do this in .NET have analogs in the Java world.

To solve this problem in .NET, I'd derive a class from XmlReader, and have it only return the elements that I'm interested in. Then I can use the XmlReader as the input for any XML object, like an XmlDocument or an XslCompiledTransform. The XmlReader subclass basically pre-processes the input stream, making it look like a much, much smaller XML document to whatever class is using it to read from.

It seems like the technique described here is analogous. But I am, as I say, not a Java guy.

Robert Rossney 2009-12-18 07:19:50

So you suggest writing a small preprocessor to do this? Interesting, but not what I want to do.

Thorbjørn Ravn Andersen 2009-12-18 12:37:33

Answer 7

A:

STX contains a streamable subset of XPath, called STXPath I believe; I should remember, because I co-wrote the spec :-)

You could definitely pick up Joost and extract the relevant bits, but note that STX didn't get wide industry acceptance, so you need to do some due diligence as to the current stability and support of the tool.

xcut 2009-12-18 13:29:05

STX looks very interesting. When you say "extract the relevant bits" do you mean that Joost actually supports STXPath or that some elbow grease is needed to make it do it?

Thorbjørn Ravn Andersen 2009-12-23 08:03:06

AFAIK Joost implements STXPath. When I say "extract the relevant bits", I mean extract the STXPath processing, since your original question asks about extraction rather than transformation.

xcut 2009-12-23 14:16:54

Answer 8

A:

You could do it via STX/Joost as already suggested, but note that many XSLT implementations have a SAX streaming mode and don't need to keep everything in memory. You just need to make sure you your XSLT file isn't looking in any of the wrong axis.

However if I were you and really wanted performance I'd do it in STaX. It's simple, standard and fast. It comes out of the box in java 6, although you can also use Woodstox for a slightly better implementation.

For the xpath you listed the implementation is trivial. The downside is that you've more code to maintain and it's just not as expressive and highlevel as XPath, as you would have in Joost or XSLT.

David Roussel 2010-06-15 09:26:35

ansaurus

tags:

views:

answers:

Doing XML extracts with XSLT without having to read the whole DOM tree into memory?

related questions