views:

114

answers:

4

I have a large XML file that consists of relatively fixed size items i.e.

<rootElem>
  <item>...</item>

  <item>...</item>
  <item>...</item>
<rootElem>

The item elements are relatively shallow and typically rather small ( <100 KB), but there may be a lot of them (hundreds of thousands). The items are completely independent of each other.

How could I process the file efficiently in Java? I can't read the whole file in as DOM, and I don't like to use SAX because the code gets rather complex. I'd like to avoid splitting the file to smaller pieces.

Optimal would be if I could obtain each item element, one at a time, as a separate DOM document, that I could process using tools like JAXB. Basically I just want to loop once over all the items.

I would think that this is a rather common problem.

+1  A: 

When the input is large, sequential (a.k.a. stream) processing of the document is generally what's called for. It's true that SAX can become a bit messy (or at least require a fair bit of code) because you basically have to build a state machine doing the extraction. If you look for XML pull parsers rather than event based implementations, you may at least find this approach slightly simpler to work with.

Your idea to extract the contents of the item elements is possible as well, using SAX for the first step, and may strike an acceptable balance between using event/pull parsing and the flexibility of full DOM access. (It will still be way slower than event/pull parsing, doing heavy allocation, but at least the requirement to keep it all in memory at the same time is lifted.)

Cumbayah
Any pointers for good XML pull parsers?
Juha Syrjälä
@axtavt already gave a suggestion. I used http://www.xmlpull.org/ once but am not sure of its status nowadays.
Cumbayah
+2  A: 

Java 6 has a StAX support. It perfroms a stream processing like SAX, but uses a pull-based approach which leads to the simplier handling code.

axtavt
A: 

I have not tried that, but... If your XML files have always the same format, you could parse them yourself with BufferedReader, looking for <item> tags, and store the item content in a StringBuffer. You could then parse each string (including item as the root) with a DOM parser, and process it. You need only one DocumentBuilder for all the items.

The advantage of the method is that you would parse the file quickly without any memory issue, and have the convenience of a DOM tree. The drawback is that you would not have a real XML parsing: if the XML is not exactly what you expect (is <item/> possible ?), your program might crash.

The problem here is that you need to treat some XML elements (the ones inside the items) as if they were not XML elements when you first parse the file. If you could find another way to do that, you could use SAX to parse the file, get the item content as strings in a safe way, and parse the items with a DOM parser as described above.

I guess another option would be to use SAX or StAX and create DOM trees for the items based on the related events. But it might be complex if there are many elements in the language.

Damien
+1  A: 

Refer to this answer by skaffman for the JAXB approach:

Blaise Doughan
This seems to be what I was looking for. I'll look more into it next week.
Juha Syrjälä