views:

115

answers:

6

Hi!

We have an Java ERP type of application. Communication between server an client is via RMI. In peak hours there can be up to 250 users logged in and about 20 of them are working at the same time. This means that about 20 threads are live at any given time in peak hours. The server can run for hours without any problems, but all of a sudden response times get higher and higher. Response times can be in minutes.

We are running on Windows 2008 R2 with Sun's JDK 1.6.0_16. We have been using perfmon and Process Explorer to see what is going on. The only thing that we find odd is that when server starts to work slow, the number of handles java.exe process has opened is around 3500. I'm not saying that this is the acual problem.

I'm just curious if there are some guidelines I should follow to be able to pinpoint the problem. What tools should I use? ....

+1  A: 

Sounds like the garbage collection cannot keep up and starts "halt-the-world" collecting for some reason.

Attach with jvisualvm in the JDK when starting and have a look at the collected data when the performance drops.

Thorbjørn Ravn Andersen
I've added -verbose:gc in my launch script today. I'll see tomorrow if the GC is the problem.
kovica
+3  A: 

Can you access to the log configuration of this application.

If you can, you should change the log level to "DEBUG". Tracing the DEBUG logs of a request could give you a usefull information about the contention point.

If you can't, profiler tools are can help you :

  • VisualVM (Free, and good product)
  • Eclipse TPTP (Free, but more complicated than VisualVM)
  • JProbe (not Free but very powerful. It is my favorite Java profiler, but it is expensive)

If the application has been developped with JMX control points, you can plug a JMX viewer to get informations...

If you want to stress the application to trigger the problem (if you want to verify whether it is a charge problem), you can use stress tools like JMeter

Benoit Courtine
Sadly we are not using log4j or any logging framework. The server is now about 10 years old and would be expensive to change it.Using profiler would be great if the slowness would manifest itself also on test system. Since this is a production system I cannot play with it there. JMeter does not support testing over RMI.
kovica
A: 

The problem you'r describing is quite typical but general as well. Causes can range from memory leaks, resource contention etcetera to bad GC policies and heap/PermGen-space allocation. To point out exact problems with your application, you need to profile it (I am aware of tools like Yourkit and JProfiler). If you profile your application wisely, only some application cycles would reveal the problems otherwise profiling isn't very easy itself.

nabeelalimemon
A: 

In a similar situation, I have coded a simple profiling code myself. Basically I used a ThreadLocal that has a "StopWatch" (based on a LinkedHashMap) in it, and I then insert code like this into various points of the application: watch.time("OperationX");

then after the thread finishes a task, I'd call watch.logTime(); and the class would write a log that looks like this: [DEBUG] StopWatch time:Stuff=0, AnotherEvent=102, OperationX=150

After this I wrote a simple parser that generates CSV out from this log (per code path). The best thing you can do is to create a histogram (can be easily done using excel). Averages, medium and even mode can fool you.. I highly recommend to create a histogram.

Together with this histogram, you can create line graphs using average/medium/mode (which ever represents data best, you can determine this from the histogram).

This way, you can be 100% sure exactly what operation is taking time. If you can't determine the culprit, binary search is your friend (fine grain the events).

Might sound really primitive, but works. Also, if you make a library out of it, you can use it in any project. It's also cool because you can easily turn it on in production as well..

Enno Shioji
A: 

Aside from the GC that others have mentioned, Try taking thread dumps every 5-10 seconds for about 30 seconds during your slow down. There could be a case where DB calls, Web Service, or some other dependency becomes slow. If you take a look at the tread dumps you will be able to see threads which don't appear to move, and you could narrow your culprit that way.

From the GC stand point, do you monitor your CPU usage during these times? If the GC is running frequently you will see a jump in your overall CPU usage.

If only this was a Solaris box, prstat would be your friend.

Sean
A: 

For acute issues like this a quick jstack <pid> should quickly point out the problem area. Probably no need to get all fancy on it.

If I had to guess, I'd say Hotspot jumped in and tightly optimised some badly written code. Netbeans grinds to a halt where it uses a WeakHashMap with newly created objects to cache file data. When optimised, the entries can be removed from the map straight after being added. Obviously, if the cache is being relied upon, much file activity follows. You probably wont see the drive light up, because it'll all be cached by the OS.

Tom Hawtin - tackline