I've got two applications running on two different machines that communicate by sending Serializable "Message" objects over Javas Socket implementation. Each one creates a SocketServer, connects to the others server and then the following bits of (pseudo-Java, error and connection details are elided for brevity):
Receiving code:
while (true) {
Object received = oisFromOtherMachine.readUnshared();
dispatch(received);
}
Sending code:
synchronized void sendMessage(Message m) {
oosToOtherMachine.writeObject(m);
oosToOtherMachine.flush();
oosToOtherMachine.reset();
}
Which is called fairly regularly from a variety of different threads.
This all worked fine and dandy up until about 3 weeks ago, where, sometimes, in response to a particular bit of user input, the call to readUnshared will throw. So far, we've seen "java.lang.IllegalStateException: unread block data" and "java.lang.ClassCast Exception: java.util.HashMap cannot be cast to java.io.ObjectStreamClass", both from deep in the internals of ObjectInputStream.
It happens about one time in 5, normally after the two systems have been up and talking to each other for 15+ minutes. For various reasons, we have two network cables that regularly get used between the two, one knarled and knotted 15m (ping of 30ms+), the other about 1m (ping of <1ms). Its only ever happened over the short cable (and believe me, we've tried it over the long one a large number of times).
I've tried checking everything reachable by any Message object is Serializable, no clues in the logs for either app before the message is sent, and the app that doesn't get the error continues merrily on its way, unaware of any trouble.
So. Google doesn't suggest any gotchas in OIS, OOS or Java Sockets that could cause it and my colleagues are as stumped as me... Has anyone seen anything like this before?
Edit: Thanks for suggestions everyone. (-: In conclusion I suspect some unsynchronized access to some of the logging status objects is producing a broken object graph which is causing OIS to choke. This Needs To Be Solved Yesterday though, and a liberal application of the synchronized keyword along with the following abomination ...
try {/* message loop */ } catch (RuntimeException) { /* resync appstate and continue*/ }
... will be done much quicker and with much higher chances of success than more frustrating (25min+) attempts to reproduce the problem & associated headscratching.