views:

121

answers:

4

In my thesis, I woud like to enhance messaging in a cluster.

It's important to log runtime information about how big a message is (should I prefer processing local or remote).

I could just find frameworks about estimating the object memory size based on java instrumentation. I've tested classmexer, which didn't come close to the serialization size and sourceforge SizeOf.

In a small testcase, SizeOf was around 10% wrong and 10x faster than serialization. (Still transient breaks the estimation completely and since e.g. ArrayList is transient but is serialized as an Array, it's not easy to patch SizeOf. But I could live with that)

On the other hand, 10x faster with 10% error doesn't seem very good. Any ideas how I could do better?

Update: I also tested ObjectSize (http://sourceforge.net/projects/objectsize-java). Results seem just good for non-inheritating objects :(

+2  A: 

Just an idea - you could serialize the object to a byte buffer first, get its length and decide now whether to send the buffers content to a remote location or do the local processing (if it depends on the messages size).

Drawback - you may waste time for serialization if later to decide not use the buffer. But if you estimate you waste estimation effort in case you need to serialize (because in this case you estimate first and serialize in a second step).

Andreas_D
The performance measuring of java serialization in the question was made with ByteArrayOutputStream. I had the same idea in mind, but I assume that just every 50th message needs to be serialized (I'm using actors). So, the performace hit for measuring the message size is significant.
Stefan K.
+1  A: 

There can be no way to estimate the serialized size of the object with nice precision and speed. For example some object could be a cache of Pi number digits that constructs itself during runtime given only the length you need. So it will serialize only 4 bytes of the 'length' attribute, while the object could be using hundreds of megabytes of memory to store that Pi number.

The only solution I can think of is adding your own interface, having method int estimateSerializeSize(). For every object implementing this interface you would need to call this method to get the proper size. If some Object does not implement it - you would have to use SizeOf.

Max
To be honest: I don't get your point. If your PiCache is serialized with 4 bytes, all the other "hundreds of megabytes of memory" are transient. Or they're serialized with the object. I don't want to estimate the size of the object by its constructor. You're right with your point, that memory and serialization size aren't compareable just like that. I guess following the non transient object graph and summing up the memory size of the objects would come close. Serialization overhead -as described here -www.javaworld.com/community/node/2915- could be neglected for a significant performanceBoost
Stefan K.
Well, maybe the PiCache example was not very precise. I just wanted to find some example on where the Memory that the object takes greatly exceeds what it will serialize to. But still, objects often implement custom serialization, that makes that 10% difference. As with PiCache example - probably it has no 'length' property at all, generates the cache in the constructor, puts it to some List<Integer>, and serializes only list.size(). This custom serialization is what makes that 10% you want to minimize. And there is no automated way to predict this kind of serializations.
Max
+2  A: 

The size a class takes at runtime doesn't necessarily have any bearing on it's size in memory. The example you've mentioned is transient fields. Other examples include when objects implement Externalizable and handle serialization themselves.

If an object implements Externalizable or provides readObject()/writeObject() then your best bet is to serialize the object to a memory buffer to find out the size. It's not going to be fast, but it will be accurate.

If an object is using the default serialization, then you could amend SizeOf to take into account transient fields.

After serializing many of the same types of objects, you may be able to build up a "serialization profile" for that type that correlates serialized size with runtime size from SizeOf. This will allow you then to estimate the serialized size quickly (using SizeOf) and then correlate this to runtime size, to arrive at a more accurate result than that provided by SizeOf.

mdma
Good point. I have to remember that (assuming I have a good estimation) if I estimate a subclass of Externalizable I should fall back with to serialization for measuring.
Stefan K.
I just realized, that overwriting readObject()/writeObject() is possible without implementing Externalizable. So my "fallback" strategy now becomes by first choice :). Maybe I can increase performance slightly by implementing my own outputstream which just collects the size. Phew.
Stefan K.
+2  A: 

There are many good points in the other answers, one thing that is lacking is that the serialization mechanism may cache certain objects.

For example you serialize a series of objects A, B, and C all of the same class that hold two objects o1 and o2 in each object. Let us say that the object overhead is 100 bytes and let us say the objects look like:

Object shared = new Object();
Object shread2 = new Object();

A.o1 = new Object()
A.o2 = shared


B.o1 = shared2
B.o2 = shared


C.o1 = shared2
C.o2 = shared

For simplicity sake we might say that the generic objects take 50 bytes to serialize and A's serialization size is 100 (overhead) + 50 (o1) + 50 (o2) = 200 bytes. One could make a similar naive estimation for B and C as well. However if all three are serialized by the same object output stream before reset is called what you will see in the stream is a serialization of A and o1 and o2, Then a serialization of B and o1 for b, BUT a reference to o2 since it was the same object that was already serialzied. So lets say an object reference takes 16 bytes the size of B is now 100 (overhead) + 50 (o1) + 16 (reference for o2) = 166. So the size that it takes to serialize has now changed! We could do a simialr calculation for C and get 132 bytes with two objects cached, so the serialization size for all three objects is different with ~33% difference between the largest and smallest.

So unless you are serializing the entire object without a cache every time it is difficult to accurately estimate the size required to serialize the object.

M. Jessup
That's a good point. I forgot to mention that only 1 object is serialized, then the stream is reset (at least I hope so, otherwise it would be a problem of the framework). Do you know if javas serialization is smart enough to serialize equal objects once?e.g. replacing your shared example with "new Long(10L)" ? All Objects would have their own Long instance (not ==), but all of them are equal().
Stefan K.