views:

663

answers:

5

Are there any production-ready libraries for streaming XPath expressions evaluation against provided xml-document? My investigations show that most of existing solutions load entire DOM-tree into memory before evaluating xpath expression.

+1  A: 

Would this be practical for a complete XPath implementation, given that XPath syntax allows for:

/AAA/XXX/following::*

and

/AAA/BBB/following-sibling::*

which implies look-ahead requirements ? i.e. from a particular node you're going to have to load the rest of the document anyway.

The doc for the Nux library (specifically StreamingPathFilter) makes this point, and references some implementations that rely on a subset of XPath. Nux claims to perform some streaming query capability, but given the above there will be some limitations in terms of XPath implementation.

Brian Agnew
Actually I need to execute simple xpath queries that check several nodes in given xml document for validation purposes. Xml document represent an entity and some of its nodes store foreign keys to other entities. And as a result some kind of intergity validation should be applied agains these special nodes. The overall document is quite large and it would be ineficient to store such an ammount of data in memory for execution of several simple xpath queries.
nixau
It looks like the Nux library may well be able to help you in this scenario. Alternatively, could you use a Stax library and apply the XPath to the local XML document that you pull from a certain node ?
Brian Agnew
Actually, I can't employ second approach, because the structure of xml document is relatively simple and it makes no sense to rip out certain node of document and apply evaluate xpath expression against it.
nixau
I think I will try XOM for now. @Brian thanks for you suggestions, I appreciate.
nixau
A: 

Try Joost.

FoxyBOA
+2  A: 

There are several options:

  • DataDirect Technologies sells an XQuery implementation that employs projection and streaming, where possible. It can handle files into the multi-gigabyte range - e.g. larger than available memory. It's a thread-safe library, so it's easy to integrate. Java-only.

  • Saxon is an open-source version, with a modestly-priced more expensive cousin, which will do streaming in some contexts. Java, but with a .net port also.

  • MarkLogic and eXist are XML databases that, if your XML is loaded into them, will process XPaths in a fairly intelligent fashion.

lavinio
A: 

FWIW, I've used Nux streaming filter xpath queries against very large (>3GB) files, and it's both worked flawlessly and used very little memory. My use case is been slightly different (not validation centric), but I'd highly encourage you to give it a shot with Nux.

David