views:

691

answers:

3

We're running a fairly complex app as a portlet on Websphere Portal Server 5.1 on AIX using IBM JDK 1.4.2. On our production system I can see a strange behaviour in the verbose GC logs. After a period of normal behaviour the system can start rapidly allocating larger and larger blocks. The system starts to spend > 1000 ms to complete each GC, but blocks are being allocated so quickly that there is only a 30 ms gap between allocation failures.

  • Each allocation failure slightly larger than the last by some integer amount x 1024 bytes. E.g. you might have 5 MB, and then a short while later 5 MB + 17 * 1024.
  • This can go on for up to 10 minutes.
  • The blocks tend to grow up to 8 to 14 MB in size before it stops.
  • It's a quad-core system, and I assume that it's now spending >95% of it's time doing GC with three cores waiting for the other core to complete GC. For 10 minutes. Ouch.
  • Obviously system performance dies at this point.
  • We have JSF, hibernate & JDBC, web services calls, log4j output and not much else.

I interpret this as likely to be something infrastructural rather than our application code. If it was a bad string concatenation inside a loop we would expect more irregular growth than blocks of 1024. If it was StringBuffer or ArrayList growth we would see the block sizes doubling. The growth is making me think of log buffering or something else. I can't think of anything in our app that allocations even 1 MB, let alone 14. Today I looked for logging backing up in memory before being flushed to disk, but the volume of logging statements over this period of GC thrashing was nowhere near the MB range.

Clearly the problem is with the excessive memory allocation rather than with the garbage collection, which is just doing its best to keep up. Something is allocating a large block and trying to grow it inefficiently in increments that are far too small.

Any ideas what might be causing all this when the system is under load? Anybody seen anything similar with Portal Server?

Note: for anybody who's interested it's starting to look like the cause is an occasional but enormous database query. It seems the culprit is either Hibernate or the JDBC driver.

+2  A: 

Not sure what could cause the problem, but here is an idea on how to investigate more: The IBM JDK is great because it can be configured to do a heap dump when it receives a SIGQUIT signal.
In a previous project, it was not our JDK, but we would use it whenever we had memory issues to investigate.

Here's how to enable the heapdump: http://publib.boulder.ibm.com/infocenter/javasdk/v1r4m2/index.jsp?topic=/com.ibm.java.doc.diagnostics.142j9/html/enabling_a_heapdump.html

Then there's a tool called heaproot that will allow you to see what's in these dumps.

Finding the type of objects should lead you to the culprit.

Nicolas
If you gather a heapdump then you could do worst than use http://eclipse.org/mat to analyse the file. If you install the IBM DTFJ heapdump reader plugin it provides a really nice interface to browse through the large objects in your heap.
fd
A: 

Only a hint... once we had a project that suffered major GC problems (Websphere and IBM JDK) due to heap fragmentation. At the end, we added a JDK switch to force heap compaction.

The Sun JDK does not tent to have a fragmented heap, but the IBM JDK does due to the different memory/GC handling.

Just give it a try... I cannot remember the magic switch.

ReneS
http://www.ibm.com/developerworks/ibm/library/i-incrcomp/
bwalliser
+1  A: 

Depending on the exact version of the IBM JDK you are using, there are various options for tracking "large allocations". The differences are mainly in the implementation, and the result is a logging Java stack trace when an allocation over a certain size is made (which should help you track down the culprit).

"Sovereign" 1.4.2 SR4+: http://www-01.ibm.com/support/docview.wss?uid=swg21236523

"J9" 1.4.2 (if Java is running under -Xj9 option): You need to get hold of a JVMPI / JVMTI agent for the same purpose, I can't find a link for this one right now.

fd
fd