ansaurus

Question

what is the most efficient way to count xml nodes in Java

Answer 1

A:

You'd be better off using an event based parser such as SAX

spender 2010-07-20 10:54:26

Answer 2

+3 A:

The SAX or STAX API would be your best bet here. They don't parse the whole thing at once, they take one node at a time and let your app process it. They're good for arbitrarily large documents.

SAX is the older API, and works on a push model, STAX is newer and is a pull parser, and is therefore rather easier to use, but for your requirements, either one would be fine.

See this tutorial to get you started with STAX parsing.

skaffman 2010-07-20 10:54:51

+1 for mentioning that StaX (pull) is easier to use than SAX.

naikus 2010-07-20 11:14:42

Answer 3

+1 A:

I think you want to avoid creating a DOM, so SAX or StAX should be good choices.

With SAX just implement a simlpe content handler that just increments a counter if an interesting element is found.

Andreas_D 2010-07-20 10:55:08

Answer 4

+2 A:

You can use a streaming parser like StAX for this. This will not require you to read the entire file in memory at once.

Gerco Dries 2010-07-20 10:55:27

Answer 5

+1 A:

With SAX you don't have to split the file: It's streaming, so it holds only the current bits in memory. It's very easy to write a ContentHandler that just does the counting. And it's very fast (in my experience, almost as fast as simply reading the file).

Chris Lercher 2010-07-20 10:56:10

Answer 6

A:

I think splitting the file is not the way to go. You'd better handle the xml file as a stream and use the SAX API (and not the DOM API).

Even better, you should use XQuery to handle you requests.

Saxon is a good Java / .Net implementation (using sax), that is amazingly fast, even on big files. Version HE is under a MPL open-source license.

Here is a little example:

java -cp saxon9he.jar net.sf.saxon.Query -qs:"count(doc('/path/to/your/doc/doc.xml')//YouTagToCount)"

alci 2010-07-20 11:05:05

Answer 7

A:

Well I did use STAX, maybe the logic I'm using it for is wrong, I'm parsing the file, then for each node I'm getting the node value and store it inside string builder. Then in another method I go trough stringbuilder and edit the output. Then I write that output to the file. I can do no more than 10000 objects like this.

By this description, I'd say yes, the logic you're using it for is wrong. You're holding on to too much in memory.

Rather than parsing the entire file, storing all the node values into something and then processing the result, you should handle each node as you hit it, and output while parsing.

With more details on what you're actually trying to accomplish and what the input XML and out whatever looks like, we could probably help streamline.

Don Roby 2010-07-20 11:16:25

ansaurus

tags:

views:

answers:

what is the most efficient way to count xml nodes in Java

related questions