views:

216

answers:

8

I have a situation where I want to extract some information from some very large but regular XML files (just had to do it with a 500 Mb file), and where XSLT would be perfect.

Unfortunately those XSLT implementations I am aware of (except the most expensive version of Saxon) does not support only having the necessary part of the DOM read in but reads in the whole tree. This cause the computer to swap to death.

The XPath in question is

//m/e[contains(.,'foobar')

so it is essentially just a grep.

Is there an XSLT implementation which can do this? Or an XSLT implementation which given suitable "advice" can do this trick of pruning away the parts in memory which will not be needed again?

I'd prefer a Java implementation but both Windows and Linux are viable native platforms.


EDIT: The input XML looks like:

<log>
<!-- Fri Jun 26 12:09:27 CEST 2009 -->
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Registering Catalina:type=Manager,path=/axsWHSweb-20090626,host=localhost</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Force random number initialization starting</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Getting message digest component for algorithm MD5</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Completed getting message digest component</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>getDigest() 0</m></e>
......
</log>

Essentialy I want to select some m-nodes (and I know the XPath is wrong for that, it was just a quick hack), but maintain the XML layout.


EDIT: It appears that STX may be what I am looking for (I can live with another transformation language), and that Joost is an implementation hereof. Any experiences?


EDIT: I found that Saxon 6.5.4 with -Xmx1500m could load my XML, so this allowed me to use my XPaths right now. This is just a lucky stroke so I'd still like to solve this generically - this means scriptable which in turn means no handcrafted Java filtering first.


EDIT: Oh, by the way. This is a log file very similar to what is generated by the log4j XMLLayout. The reason for XML is to be able to do exactly this, namely do queries on the log. This is the initial try, hence the simple question. Later I'd like to be able to ask more complex questions - therefore I'd like the query language to be able to handle the input file.

+2  A: 

You should be able to implement this without a full table scan. The '//' operator means find an element in the tree at any level. It is pretty expensive to run especially on a document of your size. If you optimize your XPath query or considering setting up match templates, the XSLT transformer may not need to load the entire document into memory.

Based on your XML sample, you are looking to match /log/e/m[ ... predicate ...]. That should be able to be optimized by some XSLT processors to not scan the full document where // would not be.

Since your XML document is pretty simple, it might be easier to not use XSLT at all. STaX is a great streaming API for handling large XML documents. Dom4j also has good support for an XPath like query against large documents. Info on using dom4j for large documents is here: http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

Sample from the above source:

SAXReader reader = new SAXReader();
reader.addHandler( "/ROWSET/ROW", 
    new ElementHandler() {
        public void onStart(ElementPath path) {
            // do nothing here...    
        }
        public void onEnd(ElementPath path) {
            // process a ROW element
            Element row = path.getCurrent();
            Element rowSet = row.getParent();
            Document document = row.getDocument();
            ...
            // prune the tree
            row.detach();
        }
    }
);

Document document = reader.read(url);

// The document will now be complete but all the ROW elements
// will have been pruned.
// We may want to do some final processing now
...
Chris Dail
Yes, even if a processor is capable of operating on a subsection of a document, starting with "//" would remove that ability.
Paul Butcher
It is fine to do the complete scan of the XML _file_. I just don't have memory for the complete DOM tree of the XML file.
Thorbjørn Ravn Andersen
What XSLT processor doesn't load the entire source tree into memory?
Robert Rossney
@Robert, e.g. the enterprise version of Saxon. Please reread the question.
Thorbjørn Ravn Andersen
A: 

This is a stab in the dark, and maybe you'll laugh me out of the house.

Nothing stops you from connecting a SAX source to the input of your XSLT; and it is at least in theory easy enough to do your grep from a SAX stream without needing a DOM. So... wanna give that a try?

Carl Smotricz
My initial example is very simple in terms of what I want to do. As you may have guessed already, this is a log file with structured data. My future goal is to mine it with more complex questions, hence the XPath approach.
Thorbjørn Ravn Andersen
That doesn't change much. The thing to realize is that a log file structure means you'll have lots of relatively small elements at the (root+1) level, each of which won't consume much memory. As long as you don't do any operations that require more than one of the elements at this level, there's nothing stopping XSLT from doing them sequentially, and you can apply whatever selections and transforms you wish.
Carl Smotricz
+3  A: 

Consider VTD-XML. It is much more memory efficient. You can find an API here and benchmarks here.

alt text

Note that the last graph says that DOM uses at minimum 5x as many memory as the XML file big is. It is after all really astonishing, isn't it?

As a bonus, it is also faster in parsing and Xpath as opposed to DOM and JDK:

alt text

alt text

BalusC
Looks good but does it actually prune input or just make the internalrepresentation smaller?
Thorbjørn Ravn Andersen
The internal representation. Also see under each http://vtd-xml.sourceforge.net/VTD.html and the other pages (developer's guide) over there.
BalusC
Ok, thanks. The smaller internal representation does not solve the scalability problem, only postpones it, so this is not the right approach. unfortunately - it looks interesting.
Thorbjørn Ravn Andersen
Much luck finding "the right approach" then :) You could eventually homegrow one yourself which opens and reads the XML file line by line *every time*, scanning for matches and forgetting the previous line so that it saves memory. But how far would you go? A painfully **slow** but memory efficient approach? Or a **fast** and memory efficient approach? By the way, a 500MB XML file more sounds like candidate for an embedded DB or maybe a worthfully DB server. SQL is undoubtely the best approach for those sizes.
BalusC
A: 

Try the CAX parser from xponentsoftware. It is a fast xml parser built on Microsoft's xmlreader. It gives the full path as you parse each element, so you could check if the path ="m/e" and then check if the text node contains "foo"

bill seacham
So you suggest writing a small preprocessor to do this? Interesting, but not what I want to do.
Thorbjørn Ravn Andersen
Not at all. You mentioned trying stax and saxon, which are xml parsers. so is cax. it gives you full access to any size xml, but uses a tiny amount of memory. unlike stax and saxon, cax allows you to look back a everything that has been parsed, which you often need to do when transforming xml.
bill seacham
A: 

The Enterprise Edition of the Saxon XSLT Processor supports streaming of large documents for exactly this type of problem.

Robert Christie
Which is the one I refer to as the most expensive version of Saxon. Our usage does not warrant a £300 purchase.
Thorbjørn Ravn Andersen
A: 

I'm not a Java guy, and I don't know if the tools I'd use to do this in .NET have analogs in the Java world.

To solve this problem in .NET, I'd derive a class from XmlReader, and have it only return the elements that I'm interested in. Then I can use the XmlReader as the input for any XML object, like an XmlDocument or an XslCompiledTransform. The XmlReader subclass basically pre-processes the input stream, making it look like a much, much smaller XML document to whatever class is using it to read from.

It seems like the technique described here is analogous. But I am, as I say, not a Java guy.

Robert Rossney
So you suggest writing a small preprocessor to do this? Interesting, but not what I want to do.
Thorbjørn Ravn Andersen
A: 

STX contains a streamable subset of XPath, called STXPath I believe; I should remember, because I co-wrote the spec :-)

You could definitely pick up Joost and extract the relevant bits, but note that STX didn't get wide industry acceptance, so you need to do some due diligence as to the current stability and support of the tool.

xcut
STX looks very interesting. When you say "extract the relevant bits" do you mean that Joost actually supports STXPath or that some elbow grease is needed to make it do it?
Thorbjørn Ravn Andersen
AFAIK Joost implements STXPath. When I say "extract the relevant bits", I mean extract the STXPath processing, since your original question asks about extraction rather than transformation.
xcut
A: 

You could do it via STX/Joost as already suggested, but note that many XSLT implementations have a SAX streaming mode and don't need to keep everything in memory. You just need to make sure you your XSLT file isn't looking in any of the wrong axis.

However if I were you and really wanted performance I'd do it in STaX. It's simple, standard and fast. It comes out of the box in java 6, although you can also use Woodstox for a slightly better implementation.

For the xpath you listed the implementation is trivial. The downside is that you've more code to maintain and it's just not as expressive and highlevel as XPath, as you would have in Joost or XSLT.

David Roussel