tags:

views:

246

answers:

4

I've got two applications running on two different machines that communicate by sending Serializable "Message" objects over Javas Socket implementation. Each one creates a SocketServer, connects to the others server and then the following bits of (pseudo-Java, error and connection details are elided for brevity):

Receiving code:

while (true) {
    Object received = oisFromOtherMachine.readUnshared();
    dispatch(received);
}

Sending code:

synchronized void sendMessage(Message m) {
    oosToOtherMachine.writeObject(m);
    oosToOtherMachine.flush();
    oosToOtherMachine.reset();
}

Which is called fairly regularly from a variety of different threads.

This all worked fine and dandy up until about 3 weeks ago, where, sometimes, in response to a particular bit of user input, the call to readUnshared will throw. So far, we've seen "java.lang.IllegalStateException: unread block data" and "java.lang.ClassCast Exception: java.util.HashMap cannot be cast to java.io.ObjectStreamClass", both from deep in the internals of ObjectInputStream.

It happens about one time in 5, normally after the two systems have been up and talking to each other for 15+ minutes. For various reasons, we have two network cables that regularly get used between the two, one knarled and knotted 15m (ping of 30ms+), the other about 1m (ping of <1ms). Its only ever happened over the short cable (and believe me, we've tried it over the long one a large number of times).

I've tried checking everything reachable by any Message object is Serializable, no clues in the logs for either app before the message is sent, and the app that doesn't get the error continues merrily on its way, unaware of any trouble.

So. Google doesn't suggest any gotchas in OIS, OOS or Java Sockets that could cause it and my colleagues are as stumped as me... Has anyone seen anything like this before?

Edit: Thanks for suggestions everyone. (-: In conclusion I suspect some unsynchronized access to some of the logging status objects is producing a broken object graph which is causing OIS to choke. This Needs To Be Solved Yesterday though, and a liberal application of the synchronized keyword along with the following abomination ...

try {/* message loop */ } catch (RuntimeException) { /* resync appstate and continue*/ }

... will be done much quicker and with much higher chances of success than more frustrating (25min+) attempts to reproduce the problem & associated headscratching.

A: 

Never seen that happen, and I'm using Sockets + ObjectStreams quite heavily.

I suggest you try newer JVM versions, IllegalStateExceptions deep in the bowels of core class libraries smell strange. The fact that it's happening only on a very fast connection almost makes it sound like a race condition.

Perhaps this time you did "find a bug in GCC"?

Robert Munteanu
Where do you see GCC involved?
Michael Borgwardt
Fair question. That's a more or less obscure reference to programmers who think think that they are always right and something else is to blame - OS, compiler, hardware.
Robert Munteanu
Ah, I get it. However, it's equally stupid (and in my experience just as common) to get so used to infrastructure (OS, compiler, hardware) "just working" that you stop considering the possibility that it might be broken. Case in point: So far, I've twice encountered recurring JVM crashes in a project. Both times, it turned out to be a machine with faulty RAM.
Michael Borgwardt
+2  A: 

My guesses: You have some data corruption between the two machines; or they run on different java versions; you have some tricky singletons in the object graph; the reset() on the sender side is messing up.

Why do you use readUnshared()?

kd304
MHarris: Could you share which of the guesses actually won?
kd304
A: 

Looks to me like the network data gets corrupted.

Could it simply be that the short cable is damaged? Have you tried using a different short cable?

Another possibility is a faulty network card or driver.

Michael Borgwardt
A: 

My random guess: Although the sendMessage is marked synchronized, you have more than one instance of the object for each stream. Or perhaps you have more than one ObjectOutputStream for each Socket OutputStream.

Tom Hawtin - tackline