views:

1791

answers:

3

All,

We are running a J2EE application on WebLogic server 9.2 MP2 with a jrockit 64-bit JVM (27.3.1) on Solaris 10.

We call use runtime.exec to call an executable called jfmerge to create PDF documents.

We have found that in Solaris, when runtime.exec is called, a duplicate JVM is temporarily spawned to kick off the jfmerge process. While this is inefficient (our JVM is 5 GB, thus the duplicated shell JVM is also 5 GB), the major problem lies in the fact that when there is heavy load on this functionality (PDF generation) in our application, sometimes the duplicated JVM never exits.

When the JVM hangs, the servers create large issues (extreme application slowness and terminated user sessions) as the entire duplicate JVM get's all of its 5 GB of process size written to disk swap.

We have noted the following hung thread correlated with a hung JVM process until the process is manually killed:

"[STUCK] ExecuteThread: '17' for queue: 'weblogic.kernel.Default (self-tuning)'" id=3463 idx=0x158 tid=3460 prio=1 alive, in native, daemon at jrockit/io/FileNativeIO.readBytesPinned(Ljava/io/FileDescriptor;[BII)I(Native Method) at jrockit/io/FileNativeIO.readBytes(FileNativeIO.java:30) at java/io/FileInputStream.readBytes([BII)I(FileInputStream.java) at java/io/FileInputStream.read(FileInputStream.java:194) at java/lang/UNIXProcess$DeferredCloseInputStream.read(UNIXProcess.java:227) at java/io/BufferedInputStream.fill(BufferedInputStream.java:218) at java/io/BufferedInputStream.read(BufferedInputStream.java:235) ^-- Holding lock: java/io/BufferedInputStream@0xfffffffec6510470[thin lock] at gov/v3/common/formgeneration/sessionbean/FormsBean.getProcessStatus(FormsBean.java:809) at gov/v3/common/formgeneration/sessionbean/FormsBean.createPDF(FormsBean.java:750) at gov/v3/common/formgeneration/sessionbean/FormsBean.getTemplateDetails(FormsBean.java:450) at gov/v3/common/formgeneration/sessionbean/FormsBean.generateSinglePDF(FormsBean.java:1371) at gov/v3/common/formgeneration/sessionbean/FormsBean.generatePDF(FormsBean.java:263) at gov/v3/common/formgeneration/sessionbean/FormsBean.endorseDocument(FormsBean.java:2377) at gov/v3/common/formgeneration/sessionbean/Forms_qaco28_EOImpl.endorseDocument(Forms_qaco28_EOImpl.java:214) at gov/v3/delegates/common/FormsAndNoticesDelegate.endorseDocument(FormsAndNoticesDelegate.java:128) at gov/v3/actions/common/EndorseDocumentAction.executeRequest(EndorseDocumentAction.java:68) at gov/v3/fwk/controller/struts/action/V3CommonDispatchAction.dispatchToExecuteMethod(V3CommonDispatchAction.java:532) at gov/v3/fwk/controller/struts/action/V3CommonDispatchAction.executeBaseAction(V3CommonDispatchAction.java:336) at gov/v3/fwk/controller/struts/action/V3BaseDispatchAction.execute(V3BaseDispatchAction.java:69) at org/apache/struts/action/RequestProcessor.processActionPerform(RequestProcessor.java:484) at gov/v3/fwk/controller/struts/requestprocessor/V3TilesRequestProcessor.processActionPerform(V3TilesRequestProcessor.java:384) at org/apache/struts/action/RequestProcessor.process(RequestProcessor.java:274) at org/apache/struts/action/ActionServlet.process(ActionServlet.java:1482) at org/apache/struts/action/ActionServlet.doGet(ActionServlet.java:507) at gov/v3/fwk/controller/struts/servlet/V3ControllerServlet.doGet(V3ControllerServlet.java:110) at javax/servlet/http/HttpServlet.service(HttpServlet.java:743) at javax/servlet/http/HttpServlet.service(HttpServlet.java:856) at weblogic/servlet/internal/StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:227) at weblogic/servlet/internal/StubSecurityHelper.invokeServlet(StubSecurityHelper.java:125) at weblogic/servlet/internal/ServletStubImpl.execute(ServletStubImpl.java:283) at weblogic/servlet/internal/ServletStubImpl.execute(ServletStubImpl.java:175) at weblogic/servlet/internal/WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3231) at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:321) at weblogic/security/service/SecurityManager.runAs(SecurityManager.java:121) at weblogic/servlet/internal/WebAppServletContext.securedExecute(WebAppServletContext.java:2002) at weblogic/servlet/internal/WebAppServletContext.execute(WebAppServletContext.java:1908) at weblogic/servlet/internal/ServletRequestImpl.run(ServletRequestImpl.java:1362) at weblogic/work/ExecuteThread.execute(ExecuteThread.java:209) at weblogic/work/ExecuteThread.run(ExecuteThread.java:181) at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) -- end of trace

We would like to do a couple of things:

1.) Prevent the spawning of a duplicate JVM, as we do not need any of it's functions when executing the simple jfmerge executable, and it creates massive overhead.

2.) In the short term at least prevent this duplicate JVM from handing indefinitely.

A: 

Are you handling the spawned process stdout/stderr properly ? You need to consume both in separate threads to reliably prevent blocking. See this answer for details. It may be the case that your process spawning works properly for some jobs, and for others (due to the quantity of stdout/err that triggers a hang).

On the subject of duplicate processes, I would expect the JVM to fork/exec. This duplicates the Java process (fork) and then it should replace it with the new process (exec). I wonder if that's what you're seeing ? Note also that I'd expect the OS to be implementing COW (copy-on-write) to duplicate only those memory pages that differ between images, so in normal circumstances the duplication of the JVM wouldn't consume as much memory as you may think.

Brian Agnew
A: 

As Brian implied, on unix, the standard way for another process to start another program is to fork into a parent process and a child process. The child process then calls exec to replace itself with the new program. The JVM has to do this to start your jfmerge program.

Normally, the memory size of the child process isn't an issue, because the OS uses copy-on-write to let the two processes share the same memory image until the child calls exec. It could be that the JVM's model for child processes requires it to fork twice, with the grandchild exec'ing jfmerge and the child process that manages the grandchild. That would explain why you see a duplicate JVM process that you are seeing. The stack trace shows a process blocked reading from an input stream. It may be that jfmerge is running slowly and the process is just hung waiting for jfmerge to produce some output.

What you could do is to get some other process to launch jfmerge, instead of your 5GB JVM. Write a standalone program which just runs jfmerge on demand, and have it communicate with the main process through some form of inter-process communication. This standalone jfmerge server wouldn't require as much memory to operate, so the impact of forked child processes wouldn't be so great.

Kenster
A: 

This answer is late, but we have the same problem, and the problem for us is how Solaris manage the memory.

The problem is when we have a application server, using a lot of memory 10GB in my case, and we want to run a simple "ls", the new process needs 10GB to run.

Solaris needs the 10GB extra available in our server, Linux use a feature known as “copy-on-write” This feature reduces the overhead of forking a new process

http://developers.sun.com/solaris/articles/subprocess/subprocess.html

Historical Background and Problem Description

Traditionally, Unix has had only one way to create a new process: using a fork() system call, often followed by an exec() system call. The fork() call makes a copy of the entire parent process' address space, and exec() turns that copy into a new process.

(Note: In the Solaris OS, the term swap space is used to describe a combination of physical memory and disk swap space configured for the system. However, with other Unix systems this term may mean swap space on disk, also known as backing store. To avoid any confusion, I'll use the term Virtual Memory (VM) to mean physical memory plus disk swap space.)

Generally, the fork/exec method has worked quite well. However, it has disadvantages in some cases, such as running out of memory without a good reason and poor fork performance.

Out of Memory: For a large-memory process, the fork() system call can fail due to an inadequate amount of VM, because fork() requires twice the amount of the parent memory. This can happen even when fork() is immediately followed by an exec() call that would release most of that extra memory. When this happens, the application will usually terminate.

For example, suppose a 64-bit application is consuming 6 gigabytes (Gbytes) of VM at the moment, and it needs to create a subprocess to run the ls(1) command. The parent process issues a fork() call that will succeed only if there is another 6 Gbytes of VM available at the moment. If the system doesn't have that much VM available (which is a frequent situation), fork() will fail with ENOMEM. Obviously, the ls(1) command doesn't need anywhere near 6 Gbytes of memory to run, but fork() doesn't know that.

Not only applications, but also Sun's own tools can suffer from the same problem. For example, the following Sun RFE (request for enhancement) has been filed for dbx: "4748951 dbx shell should use posix_spawn() for non-builtin commands rather than fork(2)".

RFE 4748951 came about when a customer's utility invoked dbx to read a huge core file using a script that also needed to run a cut(1) command from within dbx. They got a cannot fork - try again error message causing dbx to abort. An investigation revealed that dbx used fork/exec to execute that tiny cut(1) command and ran out of VM during the fork() call.

The Solaris Java Virtual Machine (JVM) is also suffering from the same problem currently, as described in this Sun RFE: "5049299 Use posix_spawn, not fork, on S10 to avoid swap exhaustion".


So you have 3 options.

1.- Execute the Runtime.exec function earlier.

2.- Create a inter process comunication with other java server, and ececute there the Runtime.exec instruccion.

3.- Create a JNI class to call a system C function. I take this option, and it work perfect.

I put my sample code here.

Java Code.

public class CallOS {
    static {
            System.loadLibrary("CallOS");
    }

    public native int exec(java.lang.String cmd);

    public static void main(String[] args) {
            int returnValue = 0;
            returnValue = new CallOS().exec("ls -la");
            System.out.println("- " + returnValue);
    }
}

C header Code. This is generate with javah -jni CallOS

/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class CallOS */

#ifndef _Included_CallOS
#define _Included_CallOS
#ifdef __cplusplus
extern "C" {
#endif
/*
 * Class:     CallOS
 * Method:    exec
 * Signature: (Ljava/lang/String;)I
 */
JNIEXPORT jint JNICALL Java_CallOS_exec
  (JNIEnv *, jobject, jstring);

#ifdef __cplusplus
}
#endif
#endif

C code.

#include "CallOS.h"
#include <stdlib.h>

JNIEXPORT jint JNICALL Java_CallOS_exec
  (JNIEnv *env, jobject obj, jstring cmd)
{
   jint  retval;
   jbyte *str;

   str = (*env)->GetStringUTFChars(env, cmd, NULL);
   if(str == NULL) return NULL;

   retval = system(str);

   (*env)->ReleaseStringUTFChars(env, cmd, str);
   return retval;
};

I hope this help for you.

Abelardo Perez Pelayo
You refer to a problem in Solaris itself - do you know what that problem is?
Matthew Wilson
I modify the response, coz is not a Problem. Regards
Abelardo Perez Pelayo