tags:

views:

91

answers:

4

I have a large xml file (1Gb). I need to make many queries on this xml file (using xpath for example). The results are small parts of the xml. I want the queries to be as fast as possible but the 1Gb file is probably too large for working memory.

The xml looks something like this:

<all>
  <record>
      <id>1</id>
      ... lots of fields. (Very different fields per record including (sometimes) subrecords 
      so mapping on a relational database would be hard).
  </record>
  <record>
      <id>2</id>
      ... lots of fields.
  </record>
  .. lots and lots and lots of records
</all>

I need random access, selecting records using for instance as an key. (Id is most important, but other fields might be used as key too). I don't know the queries in advance, they arrive and have to be executed ASAP, no batch executing but real time. SAX does not look very promising because I don't want to reread the entire file for every query. But DOM doesn't look very promising either, because the file is very large and adding additional structure overhead will almost certainly mean that it is not going to fit in working memory.

Which java library / approach could I use best to handle this problem?

A: 

Piccolo is a small, extremely fast XML parser for Java. It implements the SAX 1, SAX 2.0.1, and JAXP 1.1 (SAX parsing only) interfaces as a non-validating parser. It's available on Apache's License

venJava
The last release of piccolo is from 2004 and there are open bug reports that are several years old, so I would not recommend to use it.
Jörn Horstmann
+3  A: 

When handling XML you generally have two approaches: streaming (SAX) or loading the entire document into memory (various DOM implementations).

If you can pre-establish a set of queries to be processed in bulk, you could write a program to use SAX to stream the file, looking for matches. If the queries come in at random intervals (i.e. a typical database application) then you will need to either load the entire document into memory, or preprocess the XML document into a database of some kind.

A better description of what you're trying to accomplish might help get better answers.

Jim Garrison
+1 for the better description for better answers ...
Xavier Combelle
A: 

depending of the application using a xml orientated database such http://exist.sourceforge.net/ could be interesting.

Xavier Combelle
+1  A: 

vtd-xml is the best-fit for your usecase. http://vtd-xml.sourceforge.net/

Pangea
This looks promising. I look into this, and if it suits my needs I can mark the question as answered.
Jan