views:

2128

answers:

9

I must serialize a huge tree of objects (7,000) into disk. Originally we kept this tree in a database with Kodo, but it would make thousands upon thousands of Queries to load this tree into memory, and it would take a good part of the local universe available time.

I tried serialization for this and indeed I get a performance improvement. However, I get the feeling that I could improve this by writing my own, custom serialization code. I need to make loading this serialized object as fast as possible.

In my machine, serializing / deserializing these objects takes about 15 seconds. When loading them from the database, it takes around 40 seconds.

Any tips on what could I do to improve this performance, taking into consideration that because objects are in a tree, they reference each other?

+1  A: 

Have you tried compressing the stream (GZIPOutputStream) ?

Maurice Perry
I need improved performance on loading and storing, but I didn't specify in the question and indeed space is a "performance" measure also.
Mario Ortegón
Less space means less disk access means less time
Maurice Perry
only if the serialization process is disk-bound. it doesn't seem to be on my system; it seems to be cpu-bound, so compression will just slow it down further.
Seun Osewa
+1  A: 

This is how I would do it, form the top of my head

Serialization

  1. Serialize each object individually
  2. Assign each object a unique key
  3. When an object holds a reference to another object, put the unique key for that object in the objects place in the serialization. (I would use an UUID converted to binary)
  4. Save each object into a file/database/storage using the unique key

Unserialization

  1. Start form an arbitrary object (usually the root i suspect) unserialize it and put it in a map with it's unique key as index and return it
  2. When you step on an object key in the serialization stream, first check if it's already unserialized by looking up it's unique key in the map and if it is just grab it from there, if not put a lazy loading proxy (which repeats these two steps for that object) instead of the real object which has hooks to load the right object when you need it.

Edit, you might need to use two-pass serialization and unserialization if you have circular references in there, it complicates things a bit - but not that much.

thr
That could work, but would require reworking quite some of the code I have
Mario Ortegón
How would that be an improvement over standard serialization? As far as I know that's already done by the default mechanism.
Joachim Sauer
@saua because you can lazily load and instantiate each object when needed instead of loading it all at once, you can also go down on the byte level yourself and optimize the serialization format.
thr
+3  A: 

One optimization is customizing the class descriptors, so that you store the class descriptors in a different database and in the object stream you only refer to them by ID. This reduces the space needed by the serialized data. See for example how in one project the classes SerialUtil and ClassesTable do it.

Making classes Externalizable instead of Serializable can give some performance benefits. The downside is that it requires lots of manual work.

Then there are other serialization libraries, for example jserial, which can give better performance than Java's default serialization. Also, if the object graph does not include cycles, then it can be serialized a little bit faster, because the serializer does not need to keep track of objects it has seen (see "How does it work?" in jserial's FAQ).

Esko Luontola
I have done the Externalizable route in the past, and I gained about 20-23% performance increase in the serialization/deserialization of large object graphs. The amount of work required for this will be proportional to the number of objects you have to customize.
Robin
A: 

For performance, I'd suggest not using java.io serialisation at all. Instead get down on to the bytes yourself.

If you are going to java.io serialise the tree you might need to make sure your recursion doesn't get too deep, either by flattening (as say TreeSet does) or arranging to serialise the deepest nodes first (so you have back references rather than nested readObject calls).

I would be surprised if there wasn't a way in Kodo to read the entire tree in in one (or a few) goes.

Tom Hawtin - tackline
There is a way in Kodo to do this, but the problem is that it depends on how the objects are created in the database. Unfortunately the database is in such a way that we can't do it (and there is no way to change the model)
Mario Ortegón
+5  A: 

Don't forget to use the 'transient' key word for instance variables that don't have to be serialized. This gives you a performance boost because you are no longer reading/writing unnecessary data.

dogbane
That is a good, general important consideration in any case. I do this already but it is important to mention it. +1
Mario Ortegón
+3  A: 

Hi, Mario

I would recomend you to implement custom writeObject() and readObject() methods. In this way you will able eleminate writting chidren nodes for each node in a tree. When you use default serialization, each node will be serialized with all it's children.

For example, writeObject() of a Tree class should iterate through the all nodes of a tree and only write nodes data (without Nodes itself) with some markers, which identifies tree level.

You can look at LinkedList, to see how this methods implemented there. It uses the same approach in order to prevent writting prev and next entries for each single entry.

Andrey Vityuk
+4  A: 

To avoid having to write your own serialization code, give Google Protocol Buffers a try. According to their site:

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages – Java, C++, or Python

I've not used it, but have heard a lot of positive things about it. Plus, I have to maintain some custom serialization code, and it can be an absolute nightmare to do (let alone tracking down bugs), so getting someone else to do it for you is always a Good Thing.

Rich
A: 

Hope my suggestion here will help.

01es
A: 

Also, have a look at XStream, a library to serialize objects to XML and back again.

cherouvim
I've tried it already, for these type of objects it is even worse than Kodo. Java serialization is faster than XStream by far.
Mario Ortegón