views:

127

answers:

9

Hi,

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.

Any suggestions?

+1  A: 

Use almost any SAX Parser to stream the file a bit at a time.

Nick Fortescue
+3  A: 

Stream the file into a SAX parser and read it into memory in chunks.

SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Nathan Hughes
+10  A: 

Use a SAX based parser that presents you with the contents of the document in a stream of events.

andrewmu
+2  A: 

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

dogbane
+3  A: 

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).

You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.

If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Will Hartung
+6  A: 

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).

Tomas Narros
+3  A: 

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Eugene Kuleshov
+10 for the useful tutorial
Tomas Narros
A: 

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Chris
+1  A: 

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).

FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.

If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.

So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.

StringsFile file = new StringsFile();
StringInFile str = file.newString("abc");        // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
Adrian Smith