views:

1376

answers:

8

I need to serialize a huge amount of data (around 2gigs) of small objects into a single file in order to be processed later by another Java process. Performance is kind of important. Can anyone suggest a good method to achieve this?

+3  A: 

Have you taken a look at google's protocol buffers? Sounds like a use case for it.

André
A: 

Have you tried java serialization? You would write them out using an ObjectOutputStream and read 'em back in using an ObjectInputStream. Of course the classes would have to be Serializable. It would be the low effort solution and, because the objects are stored in binary, it would be compact and fast.

sblundy
A: 

You should probably consider a database solution--all databases do is optimize their information, and if you use Hibernate, you keep your object model as is and don't really even think about your DB (I believe that's why it's called hibernate, just store your data off, then bring it back)

Bill K
A: 

A simplest approach coming immediately to my mind is using memory-mapped buffer of NIO (java.nio.MappedByteBuffer). Use the single buffer (approximately) corresponding to the size of one object and flush/append them to the output file when necessary. Memory-mapped buffers are very effecient.

Sergey Mikhanov
A: 

protocol buffers : makes sense. here's an excerpt from their wiki : http://code.google.com/apis/protocolbuffers/docs/javatutorial.html

Getting More Speed

By default, the protocol buffer compiler tries to generate smaller files by using reflection to implement most functionality (e.g. parsing and serialization). However, the compiler can also generate code optimized explicitly for your message types, often providing an order of magnitude performance boost, but also doubling the size of the code. If profiling shows that your application is spending a lot of time in the protocol buffer library, you should try changing the optimization mode. Simply add the following line to your .proto file:

option optimize_for = SPEED;

Re-run the protocol compiler, and it will generate extremely fast parsing, serialization, and other code.

anjanb
There's a reason they have a preview for your text. look at it before posting.
shoosh
Dude, did ya mean to shout?
sblundy
-1 for font abuse
shemnon
Hi All, I pressed SUBMIT by mistake. No, I didn't mean to shout at all :-(
anjanb
A: 

If performance is very importing then you need write it self. You should use a compact binary format. Because with 2 GB the disk I/O operation are very important. If you use any human readable format like XML or other scripts you resize the data with a factor of 2 or more.

Depending on the data it can be speed up if you compress the data on the fly with a low compression rate.

A total no go is Java serialization because on reading Java check on every object if it is a reference to an existing object.

Horcrux7
+3  A: 

I don't know why Java Serialization got voted down, it's a perfectly viable mechanism.

It's not clear from the original post, but is all 2G of data in the heap at the same time? Or are you dumping something else?

Out of the box, Serialization isn't the "perfect" solution, but if you implement Externalizable on your objects, Serialization can work just fine. Serializations big expense is figuring out what to write and how to write it. By implementing Externalizable, you take those decisions out of its hands, thus gaining quite a boost in performance, and a space savings.

While I/O is a primary cost of writing large amounts of data, the incidental costs of converting the data can also be very expensive. For example, you don't want to convert all of your numbers to text and then back again, better to store them in a more native format if possible. ObjectStream has methods to read/write the native types in Java.

If all of your data is designed to be loaded in to a single structure, you could simply do ObjectOutputStream.writeObject(yourBigDatastructure), after you've implemented Externalizable.

However, you could also iterate over your structure and call writeObject on the individual objects.

Either way, you're going to need some "objectToFile" routine, perhaps several. And that's effectively what Externalizable provides, as well as a framework to walk your structure.

The other issue, of course, is versioning, etc. But since you implement all of the serialization routines yourself, you have full control over that as well.

Will Hartung
Because all the cool kids are doing protocol buffers
sblundy
A: 

I developped JOAFIP as database alternative.