ansaurus

Question

How to read large XML file consisting of large number of small items efficiently in Java?

Answer 1

+1 A:

When the input is large, sequential (a.k.a. stream) processing of the document is generally what's called for. It's true that SAX can become a bit messy (or at least require a fair bit of code) because you basically have to build a state machine doing the extraction. If you look for XML pull parsers rather than event based implementations, you may at least find this approach slightly simpler to work with.

Your idea to extract the contents of the item elements is possible as well, using SAX for the first step, and may strike an acceptable balance between using event/pull parsing and the flexibility of full DOM access. (It will still be way slower than event/pull parsing, doing heavy allocation, but at least the requirement to keep it all in memory at the same time is lifted.)

Cumbayah 2010-09-06 18:15:34

Any pointers for good XML pull parsers?

Juha Syrjälä 2010-09-06 18:30:04

@axtavt already gave a suggestion. I used http://www.xmlpull.org/ once but am not sure of its status nowadays.

Cumbayah 2010-09-06 18:48:30

Answer 2

+2 A:

Java 6 has a StAX support. It perfroms a stream processing like SAX, but uses a pull-based approach which leads to the simplier handling code.

axtavt 2010-09-06 18:31:54

Answer 3

A:

I have not tried that, but... If your XML files have always the same format, you could parse them yourself with BufferedReader, looking for <item> tags, and store the item content in a StringBuffer. You could then parse each string (including item as the root) with a DOM parser, and process it. You need only one DocumentBuilder for all the items.

The advantage of the method is that you would parse the file quickly without any memory issue, and have the convenience of a DOM tree. The drawback is that you would not have a real XML parsing: if the XML is not exactly what you expect (is <item/> possible ?), your program might crash.

The problem here is that you need to treat some XML elements (the ones inside the items) as if they were not XML elements when you first parse the file. If you could find another way to do that, you could use SAX to parse the file, get the item content as strings in a safe way, and parse the items with a DOM parser as described above.

I guess another option would be to use SAX or StAX and create DOM trees for the items based on the related events. But it might be complex if there are many elements in the language.

Damien 2010-09-07 13:27:07

Answer 4

+1 A:

Refer to this answer by skaffman for the JAXB approach:

http://stackoverflow.com/questions/1134189/can-jaxb-parse-large-xml-files-in-chunks/1134203#1134203

Blaise Doughan 2010-09-10 20:35:46

This seems to be what I was looking for. I'll look more into it next week.

Juha Syrjälä 2010-09-11 08:57:39

ansaurus

tags:

views:

answers:

How to read large XML file consisting of large number of small items efficiently in Java?

related questions