We're running a fairly complex app as a portlet on Websphere Portal Server 5.1 on AIX using IBM JDK 1.4.2. On our production system I can see a strange behaviour in the verbose GC logs. After a period of normal behaviour the system can start rapidly allocating larger and larger blocks. The system starts to spend > 1000 ms to complete each GC, but blocks are being allocated so quickly that there is only a 30 ms gap between allocation failures.
- Each allocation failure slightly larger than the last by some integer amount x 1024 bytes. E.g. you might have 5 MB, and then a short while later 5 MB + 17 * 1024.
- This can go on for up to 10 minutes.
- The blocks tend to grow up to 8 to 14 MB in size before it stops.
- It's a quad-core system, and I assume that it's now spending >95% of it's time doing GC with three cores waiting for the other core to complete GC. For 10 minutes. Ouch.
- Obviously system performance dies at this point.
- We have JSF, hibernate & JDBC, web services calls, log4j output and not much else.
I interpret this as likely to be something infrastructural rather than our application code. If it was a bad string concatenation inside a loop we would expect more irregular growth than blocks of 1024. If it was StringBuffer or ArrayList growth we would see the block sizes doubling. The growth is making me think of log buffering or something else. I can't think of anything in our app that allocations even 1 MB, let alone 14. Today I looked for logging backing up in memory before being flushed to disk, but the volume of logging statements over this period of GC thrashing was nowhere near the MB range.
Clearly the problem is with the excessive memory allocation rather than with the garbage collection, which is just doing its best to keep up. Something is allocating a large block and trying to grow it inefficiently in increments that are far too small.
Any ideas what might be causing all this when the system is under load? Anybody seen anything similar with Portal Server?
Note: for anybody who's interested it's starting to look like the cause is an occasional but enormous database query. It seems the culprit is either Hibernate or the JDBC driver.