tags:

views:

395

answers:

4

Hello Experts,

JAXB makes working with XML so much easier, but I have currently a big problem, that the documents I have to process are too large for an in memory unmarshalling that JAXB does. The data can be up to 4GB per document.

The datastructure I will have to process is very simple and flat: With a root element and millions of “elements”…

<root>
<element>
<sub>foo</sub>
</element>
<element>
<sub>foo</sub>
</element>
</root>

May questions are:

  1. Does JAXB maybe somehow support unmarshalling in a “streambased” way, that does not require to build the whole objecttree in memory but rather gives me some kind of “Iterator” to the elements, element by element? (Maybe I just missed that somehow…)

  2. If not what are your proposals for an good alternative with a a. “flat learningcurve, ideally very similar to JAXB b. AND VERY IMPORTANT: Ideally with the possibility / tool for the generation of the unarshaller code from an XSD file OR annotated Java Class

3.(I have searched SO and those to library that ended up on my “watchlist” (without comparing them closer) were Apache XML Beans and Xstream… What other libraries are maybe even better for the purpose and what are the disadvantages, adavangaes…

Thank you very much!!! Jan

A: 

Those are all the wrong approach, since they're all basically "bean" mapper. That is, convert XML document to a Java Bean. In order to do that, you pretty much have to suck the whole thing in to the machine.

Now, obviously, there are "better" ways it could be done. For example, it's not actually necessary to load the entire XML DOM in order to map a bean, but I don't know actually HOW JAXB et al perform their serialisation. I suspect that they don't bother with a DOM, but rather populate bean fields directly as the XML is streamed by. This will save overall processing, but you still end up with the entire document in RAM as a set of class instances.

Now, if you just want a little bit of the XML document, you might want to consider a StAX implementation. This is a DOM-like interface on top of a streaming parser. Although, in the end this may not be very good as I think these work by streaming as much of the document as necessary, which means if you need something at the front, you win because it can throw the rest away. But if you want something at the end, I think it retains most of what it's seen to that point. That's not good either.

Which leaves you with good 'ol SAX. And everyone knows, with SAX, you get the blues. Because it's such a primitive layer. But it's the most efficient, and gives you the most control.

The XSD mapping will be difficult, simply because the beauty of the mapping frameworks is that they know what to do with all of the elements (they create class instances, and stuff them in to parent classes). You want to do something different, something arbitrary at arbitrary points.

SAX isn't that bad, I wrote a nice little crude mapper that kind of allows you to do what you want to do, save you have to hand code it rather than use an XSD, and it's in Obj-C, not Java. But basically it walked the XML stream and looked for setters on classes based on the path name. This replaced the typical huge "if element = "name"..." chains in the element callback that you get with SAX code.

Not the answer you were looking for, I'm sure...be happy if I'm proved wrong.

Will Hartung
SAX in my view is not necessarily the best apporach, not the most efficient either
vtd-xml-author
+1  A: 

I would dig the JAXB/StAX approach (for something stream-based but with typed Java objects). Have a look at this post (more a hint than a strong lead though).

Pascal Thivent
A: 

The key to data-binding of big document is to use XPath to select only items that you need and filter out everything else... see the article below

http://onjava.com/pub/a/onjava/2007/09/07/schema-less-java-xml-data-binding-with-vtd-xml.html

vtd-xml-author
A: 

I wrote such a library, a long long time ago - 6+ years ago, for Java 1.4. Since I finished my PhD, it has been sitting untouched, and does not work on modern JVMs due to using internal API's to invoke javac on generated Java code.

RP Bourret maintained a list of data binding related tools that may be of interest.

I would recommend the use of the Apache Commons Digester project, as it builds on top of SAX. An oldish tutorial shows its use. The main point is that you setup a mapping of nested element patterns in the XML with actions in Java (e.g. create new object, set field) in order to build your data structure, and you can hook your per item processing into that system.

Note that the digester system is not generated from Schema, or similar to JAXB - but I think given the simplicity of structure and the input size, that should not be a major concern.

grrussel