views:

3473

answers:

8

I have to objectify very simple and small XML documents (less than 1k, and it's almost SGML: no namespaces, plain UTF-8, you name it...), read from a stream, in Java.

I am using JAXP to process the data from my stream into a Document object. I have tried Xerces, it's way too big and slow... I am using Dom4j, but I am still spending way too much time in org.dom4j.io.SAXReader.

Does anybody out there have any suggestion on a faster, more efficient implementation, keeping in mind I have very tough CPU and memory constraints?

[Edit 1] Keep in mind that my documents are very small, so the overhead of staring the parser can be important. For instance I am spending as much time in org.xml.sax.helpers.XMLReaderFactory.createXMLReader as in org.dom4j.io.SAXReader.read

[Edit 2] The result has to be in Dom format, as I pass the document to decision tools that do arbitrary processing on it, like switching code based on the value of arbitrary XPaths, but also extracting lists of values packed as children of a predefined node.

[Edit 3] In any case I eventually need to load/parse the complete document, since all the information it contains is going to be used at some point.

(This question is related to, but different from, http://stackoverflow.com/questions/373833/best-xml-parser-for-java )

+1  A: 

Generally speaking, Xerces is going to be the fastest you'll find. Also, in general, a SAX parser or pull parser should give you much better performance than a dom parser.

Don Branson
As I understand it, Xerces is better for big, complex documents...And I need a Dom parser, as I have to pass my document down a chain of decision rules, that perform actions based on the values of arbitrary XPaths.
Varkhan
I use it on small docs all the time. But, it's possible that it's better for big stuff, can't really speak to that. And if you need a dom parser, well, they're all darned slow, which is why I almost always go the SAX route. Good luck, hope someone else has better info for you.
Don Branson
+3  A: 

Look into using StAX (Streaming API for XML) instead of SAX. It will be simpler than SAX, but not as heavyweight as a Tree based parser.

Joshua
I did not know about StAX, but at first glance I would guess that the many trackbacks needed to access the data each tool needs at some point is going to be costly...
Varkhan
No, generally implementations of Stax and SAX APIs run about as fast (Woodstox for example implements both, and access is about as fast; Stax only slightly faster). Differences are more between xml libraries themselves. Aalto is the fastest one AFAIK.
StaxMan
+5  A: 

I would give XOM a try. It uses SAX parser, and builds compact tree model on top of it. You can make it even faster by using these tips.

If you want to process document on the fly, you can implement custom NodeFactory to process the document while it is still parsed by SAX parser. This is easier than custom SAX handler, because you can process whole elements after they are parsed. SAX handler would need to process correct start/end events instead.

If you are parsing multiple documents, you can reuse Builder object to save time.

Peter Štibraný
A: 

If you really must create a DOM tree, your best bet is probably just using Xerces. It's a decent parser (not the fastest, but quite fast). But with DOM comes heavy memory usage, and sub-standard speed -- that can not be avoided. Using JDom/XOM/Dom4j makes no sense if there is such a DOM limitation; otherwise XOM is very good. But in this case, you'd be converting from one tree model to another, and that's a heavy-weight operation.

It's worth noting that there is no such thing as a DOM parser: all actual xml parsers are built on streaming APIs (sax, stax or xmlpull). You can also build a DOM tree using stax parsers, but the overhead really is due to DOM, not the parser. So just using Xerces+DOM is a reasonable way to do it.

StaxMan
+2  A: 

Have a look on VTD-XML it's world's fastest XML processor and world's most memory-efficient processor, as their site says.

It doesn't support standard DOM, but it has its own techniques for Node traversal and it supports XPath. It supports incremental updates of the document.

It has implementations for Java, C and C#.

It's based on the innovative algorithm "Virtual Token Descriptor".

Bahaa Zaid
As long as "... their site says" is taken with the usual grain of salt.As in: without someone else replicating results it's hardly objective._EVERYONE_ claims theirs is the fastest, and has "revolutionary" features. techniques.
StaxMan
Yes, I agree with you.
Bahaa Zaid
If this were Wikipedia, this would be pulled on account of it reading like an advertisement. Do you have practical experience using it? If so, could you share some of your experiences: ease of use, benchmarks, hurdles.
toolbear74
No, I didn't use it for serious programming. I just saw the benchmarks on their site, and the idea of VTD looks faster if you did some big O analysis on it.
Bahaa Zaid
For what it's worth, this won't work as per "must be DOM". Which of course means that almost ALL alternatives are disqualified!Except for Xerces and... well... Crimson, which is not supported any more. So Xerces it has to be.
StaxMan
+2  A: 

Not sure if it meets all of your requirements, but I've had very good results in terms of both speed and memory consumption (on xml documents both extremely large and very small) with Nux.

One of its design points was for an "application router" that processes lots of small xml messages efficiently. It offers xpath query capability as well as dom-like access to parent, child and sibling nodes (depending upon which parsing mechanism you choose).

David
A: 

If you want fast, you need to compile the DTD or schema into a recursive descent parser. Such parsers can be 2-3x faster than general parsing engines by virtue of knowing exactly what choices come next. I'd look at tools like XML Thunder (EDIT: oops, this isn't for Java) or XML Booster

EDIT2: Just noticed the change requiring it be DOM compatible. I don't think these are. But the OP is asking IMHO for a bad combination: small/fast and DOM compatible. Pick one.

Ira Baxter
But XML Thunder seems to be for C, Cobol... not really useful with Java.
StaxMan
My, my. Actually, I've thought for years they did Java. My error! ... aha, looks like I have this one confused, with another. I've edited my answer.
Ira Baxter
A: 

Look at XML parser benchmark results from Piccolo parser site. I also used the piccolo itself and it was best from point of speed for my data.

Alexey Gopachenko