views:

792

answers:

3

Our system setup consists of two Weblogic 10.3 servers: one hosts the presentation layer and the other hosts the EJBs. The system runs fine under moderate load for some time (one to several days) after which EJB method calls from the presentation server to the EJB server start to fail with the following error:

java.rmi.RemoteException: java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is: java.io.OptionalDataException

Stack trace:

java.io.OptionalDataException
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:197)
    at weblogic.rjvm.MsgAbbrevInputStream.readObject(MsgAbbrevInputStream.java:564)
    at weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:193)
    at weblogic.jndi.internal.RootNamingNode_WLSkel.invoke(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.invoke(BasicServerRef.java:589)
    at weblogic.rmi.cluster.ClusterableServerRef.invoke(ClusterableServerRef.java:230)
    at weblogic.rmi.internal.BasicServerRef$1.run(BasicServerRef.java:477)
    at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
    at weblogic.security.service.SecurityManager.runAs(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.handleRequest(BasicServerRef.java:473)
    at weblogic.rmi.internal.wls.WLSExecuteRequest.run(WLSExecuteRequest.java:118)

Once the first OptionalDataException is encountered all subsequent calls fail with the same result. Some sources suggest that this might be related to cluster multicast port being misconfigured. However, these servers do not belong to a cluster.

Booting the EJB server always temporarily resolves the issue, but the issue seems to occur again after some time.

Update: it seems that the problem is not related to an overflow in the number of socket connections after all (see my own answer below). After disallowing network classloading we ran very steadily for a week after which we started receiving OptionalDataExceptions on the presentation server again (stack trace below). It is very strange that the system works fine for a week and then starts to fail.

javax.naming.CommunicationException [Root exception is java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:
    java.io.OptionalDataException]
    at weblogic.jndi.internal.ExceptionTranslator.toNamingException(ExceptionTranslator.java:74)
    at weblogic.jndi.internal.WLContextImpl.translateException(WLContextImpl.java:439)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:395)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:380)
    at javax.naming.InitialContext.lookup(InitialContext.java:392)
    ...
Caused by: java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:

    java.io.OptionalDataException
    at weblogic.rjvm.ResponseImpl.unmarshalReturn(ResponseImpl.java:234)
    at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:348)
    at weblogic.rmi.cluster.ClusterableRemoteRef.invoke(ClusterableRemoteRef.java:259)
    at weblogic.jndi.internal.ServerNamingNode_1030_WLStub.lookup(Unknown Source)
    at weblogic.jndi.internal.WLContextImpl.lookup(WLContextImpl.java:392)  
    ... 38 more
Caused by: java.io.OptionalDataException
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at     
    weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:197)
    at weblogic.rjvm.MsgAbbrevInputStream.readObject(MsgAbbrevInputStream.java:564)
    at     
weblogic.utils.io.ChunkedObjectInputStream.readObject(ChunkedObjectInputStream.java:193)
    at weblogic.jndi.internal.RootNamingNode_WLSkel.invoke(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.invoke(BasicServerRef.java:589)
    at weblogic.rmi.cluster.ClusterableServerRef.invoke(ClusterableServerRef.java:230)
    at weblogic.rmi.internal.BasicServerRef$1.run(BasicServerRef.java:477)
    at        
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
    at weblogic.security.service.SecurityManager.runAs(Unknown Source)
    at weblogic.rmi.internal.BasicServerRef.handleRequest(BasicServerRef.java:473)
    at weblogic.rmi.internal.wls.WLSExecuteRequest.run(WLSExecuteRequest.java:118)
    ... 2 more

We obtain the initial context quite the standard way:

Properties p = new Properties();
p.put(Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory");
p.put(Context.PROVIDER_URL, serverPath);
Context context = new InitialContext(p);

Also calls to any obtained references fail with a similar OptionalDataException. Booting the presentation server alone resolves the issue temporarily.

+1  A: 

Finally we found the solution to this (Edit: later we found out that this was not the root cause of the issue, but a separate serious issue. For the final solution, please see the answer below). Once we started to receive the following exception we got on the tracks of the cause:

<BEA-000403> <IOException occurred on socket: Socket[addr=/x.x.x.x,port=3266,localport=7001]
 java.net.SocketException: Connection refused.
java.net.SocketException: Connection refused
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at weblogic.socket.SocketMuxer.readReadySocketOnce(SocketMuxer.java:887)
at weblogic.socket.SocketMuxer.readReadySocket(SocketMuxer.java:859)
at weblogic.socket.DevPollSocketMuxer.processSockets(DevPollSocketMuxer.java:120)
at weblogic.socket.SocketReaderRequest.run(SocketReaderRequest.java:29)
at weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:42)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:145)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:117)

On the presentation server, which is running on a different host than the EJB server we had the option

-Dweblogic.NetworkClassLoadingEnabled=true

to obviously enable class loading from the EJB server. What we did not know is that using this option can result in a huge number of network sockets being opened. Using netstat we were able to find out that several thousand sockets were either in CLOSE_WAIT or FIN_WAIT_2 state. It seems that all the elements in the web UI were loaded from the EJB server in addition to the classes despite the fact that the war file on the presentation server contained all these. The huge amount of sockets did not result in "too many files" error messages since Weblogic removes the ulimit for files in its startup script. Using a test server we found out that a single click on the web UI by the user opened around 30 sockets between the two servers.

We removed this option and repackaged the war on the presentation server to contain all the needed classes thus removing the need for network classloading. This resulted in a decrease in the number of socket connections between the two servers from thousands to 1.

In a summary, avoid network class loading in Weblogic if at all possible.

MarkoU
+1  A: 

Finally the OptionalDataExceptions are history. In short, in our application code a complex value object (used as a return value for remote method invocations) had a HashMap datastructure as an internal field. After changing the type of this field to SynchronizedMap the OptionalDataExceptions stopped occurring. It seems that somewhere in the legacy code this Map is handled in non thread-safe way.

What is strange is that this caused no problems with WLS 8.1, but somehow caused WLS 10 enter a state where all subsequent remote method invocations (including JNDI lookups) started to fail.

MarkoU
A: 

I am facing a similar issue in weblogic 10.3.

But in our case we have the whole application installed on 5 weblogic server with a loadbalancer in front of the weblogic servers.

The solution mentioned above is not applicable to our case as there is no codesharing involved.

The exceltion indicated that weblogic is not able to connect to the load balancer.

Mayur
Network classloading (is this what you referred as codesharing?) was not the cause of the OptionalDataExceptions after all. Please see the accepted answer for the cause, which was concurrent modification of an unsynchronized data structure.
MarkoU