views:

124

answers:

3

Hi

We have made a webapp running in JBoss 4.2.3 on RedHat Linux 5.2 on a 12 core IBM machine. Lately we have seen long response times, which seems to be caused be GC activities. The pattern is more or less as follows: All requests seem to come at normal rate, but no bytes are sent until suddenly all responses are "flushed". At same time the gc.log has this pattern:

// Here the unresponsiveness starts
36788.309: [GC [1 CMS-initial-mark: 1919297K(4456448K)] 1969797K(5111808K), 0.1473460 secs] [Times: user=0.15 sys=0.00, real=0.15 secs]
Total time for which application threads were stopped: 0.1530790 seconds
36788.457: [CMS-concurrent-mark-start]
36793.886: [CMS-concurrent-mark: 5.429/5.429 secs] (CMS-concurrent-mark yielded 46 times)
 [Times: user=6.69 sys=0.07, real=5.43 secs]
36793.886: [CMS-concurrent-preclean-start]
 (cardTable: 15956 cards, re-scanned 15956 cards, 1 iterations)
36793.988: [CMS-concurrent-preclean: 0.089/0.101 secs] (CMS-concurrent-preclean yielded 9 times)
 [Times: user=0.12 sys=0.01, real=0.10 secs]
36793.988: [CMS-concurrent-abortable-preclean-start]
 (cardTable: 75 cards, re-scanned 75 cards, 1 iterations)
Application time: 5.6923460 seconds
Total time for which application threads were stopped: 0.0170000 seconds
 (cardTable: 96 cards, re-scanned 96 cards, 1 iterations)
 (cardTable: 207 cards, re-scanned 207 cards, 1 iterations)
 (cardTable: 45 cards, re-scanned 45 cards, 1 iterations)
 (cardTable: 157 cards, re-scanned 157 cards, 1 iterations)
 (cardTable: 88 cards, re-scanned 88 cards, 1 iterations)
 (cardTable: 213 cards, re-scanned 213 cards, 1 iterations)
 (cardTable: 84 cards, re-scanned 84 cards, 1 iterations)
 (cardTable: 143 cards, re-scanned 143 cards, 1 iterations)
 (cardTable: 40 cards, re-scanned 40 cards, 1 iterations)
 (cardTable: 109 cards, re-scanned 109 cards, 1 iterations)
 (cardTable: 77 cards, re-scanned 77 cards, 1 iterations)
 (cardTable: 156 cards, re-scanned 156 cards, 1 iterations)
 (cardTable: 132 cards, re-scanned 132 cards, 1 iterations)
 (cardTable: 119 cards, re-scanned 119 cards, 1 iterations)
 (cardTable: 115 cards, re-scanned 115 cards, 1 iterations)
 (cardTable: 195 cards, re-scanned 195 cards, 1 iterations)
 (cardTable: 163 cards, re-scanned 163 cards, 1 iterations)
 (cardTable: 166 cards, re-scanned 166 cards, 1 iterations)
 (cardTable: 149 cards, re-scanned 149 cards, 1 iterations)
 (cardTable: 218 cards, re-scanned 218 cards, 1 iterations)
 (cardTable: 46 cards, re-scanned 46 cards, 1 iterations)
 (cardTable: 57 cards, re-scanned 57 cards, 1 iterations)
 (cardTable: 213 cards, re-scanned 213 cards, 1 iterations)
 (cardTable: 189 cards, re-scanned 189 cards, 1 iterations)
 (cardTable: 104 cards, re-scanned 104 cards, 1 iterations)
 (cardTable: 60 cards, re-scanned 60 cards, 1 iterations)
 (cardTable: 368 cards, re-scanned 368 cards, 1 iterations)
 (cardTable: 118 cards, re-scanned 118 cards, 1 iterations)
 (cardTable: 129 cards, re-scanned 129 cards, 1 iterations)
 (cardTable: 291 cards, re-scanned 291 cards, 1 iterations)
 (cardTable: 105 cards, re-scanned 105 cards, 1 iterations)
 (cardTable: 63 cards, re-scanned 63 cards, 1 iterations)
 (cardTable: 73 cards, re-scanned 73 cards, 1 iterations)
 (cardTable: 165 cards, re-scanned 165 cards, 1 iterations)
 (cardTable: 69 cards, re-scanned 69 cards, 1 iterations)
 (cardTable: 157 cards, re-scanned 157 cards, 1 iterations)
 CMS: abort preclean due to time  [37 iterations, 13 waits, 4954 cards)] 36799.161: [CMS-concurrent-abortable-preclean: 2.965/5.173 secs] (CMS-concurrent-abortable-preclean yielded 166 times)
 [Times: user=5.12 sys=0.10, real=5.17 secs]
Application time: 4.9945030 seconds
36799.172: [GC[YG occupancy: 572074 K (655360 K)]36799.172: [GC 36799.173: [ParNew
Desired survivor size 67108864 bytes, new threshold 2 (max 15)
- age   1:   59654872 bytes,   59654872 total
- age   2:   26827232 bytes,   86482104 total
- age   3:    2613360 bytes,   89095464 total
- age   4:     929600 bytes,   90025064 total
- age   5:     893176 bytes,   90918240 total
- age   6:     795736 bytes,   91713976 total
- age   7:     809240 bytes,   92523216 total
- age   8:     936216 bytes,   93459432 total
- age   9:     971376 bytes,   94430808 total
- age  10:     624472 bytes,   95055280 total
- age  11:     883616 bytes,   95938896 total
- age  12:     817296 bytes,   96756192 total
- age  13:     977992 bytes,   97734184 total
- age  14:     733816 bytes,   98468000 total
- age  15:     654744 bytes,   99122744 total
: 572074K->110988K(655360K), 0.1855140 secs] 2491372K->2031059K(5111808K) icms_dc=0 , 0.1866380 secs] [Times: user=0.72 sys=0.00, real=0.19 secs]
36799.359: [Rescan (parallel)  (Survivor:12chunks) Finished young gen rescan work in 3th thread: 0.032 sec
Finished young gen rescan work in 2th thread: 0.035 sec
Finished young gen rescan work in 0th thread: 0.035 sec
Finished young gen rescan work in 1th thread: 0.035 sec
Finished remaining root rescan work in 1th thread: 0.030 sec
Finished remaining root rescan work in 3th thread: 0.033 sec
Finished remaining root rescan work in 2th thread: 0.032 sec
Finished remaining root rescan work in 0th thread: 0.042 sec
Finished dirty card rescan work in 2th thread: 0.029 sec
Finished dirty card rescan work in 3th thread: 0.031 sec
Finished dirty card rescan work in 0th thread: 0.019 sec
Finished dirty card rescan work in 1th thread: 0.031 sec
Finished work stealing in 3th thread: 0.000 sec
Finished work stealing in 0th thread: 0.000 sec
Finished work stealing in 2th thread: 0.000 sec
Finished work stealing in 1th thread: 0.000 sec
, 0.0959640 secs]36799.455: [weak refs processing, 0.0121220 secs] [1 CMS-remark: 1920070K(4456448K)] 2031059K(5111808K), 0.2952350 secs] [Times: user=1.11 sys=0.00, real=0.30 secs]
Total time for which application threads were stopped: 0.3066430 seconds
36799.468: [CMS-concurrent-sweep-start]
36805.824: [CMS-concurrent-sweep: 6.355/6.356 secs] (CMS-concurrent-sweep yielded 515 times)
 [Times: user=8.76 sys=0.24, real=6.35 secs]
36805.824: [CMS-concurrent-reset-start]
36805.903: [CMS-concurrent-reset: 0.079/0.079 secs] (CMS-concurrent-reset yielded 0 times)
 [Times: user=0.64 sys=0.01, real=0.07 secs]
Application time: 6.5590510 seconds
// About here the server starts sending out responses
36806.044: [GC 36806.045: [ParNew
Desired survivor size 67108864 bytes, new threshold 2 (max 15)
- age   1:   55456640 bytes,   55456640 total
- age   2:   54278984 bytes,  109735624 total
: 635276K->131072K(655360K), 0.3230430 secs] 2554507K->2089990K(5111808K) icms_dc=0 , 0.3242750 secs] [Times: user=1.02 sys=0.02, real=0.33 secs]
Total time for which application threads were stopped: 0.3422430 seconds
Application time: 0.7628480 seconds

We have tried various GC setups, but we always end up more or less like this. Our current parameters:

-Xms4096m -Xmx5120m -XX:MaxPermSize=384m
-XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent 
-XX:+PrintClassHistogram -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails 
-XX:ReservedCodeCacheSize=64m -XX:NewSize=768m -XX:SurvivorRatio=4 
-XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:+CMSIncrementalMode 
-XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 
-XX:+UseTLAB -XX:ParallelGCThreads=4 
-XX:+UseParNewGC -XX:MaxGCPauseMillis=1000 -XX:GCTimeRatio=10 
-XX:+CMSPermGenSweepingEnabled 
-XX:+PrintCMSInitiationStatistics -XX:PrintCMSStatistics=1 -XX:+PrintGCTaskTimeStamps 
-XX:+PrintGCTimeStamps 
-verbose:gc -XX:+TraceClassLoading -XX:+TraceClassUnloading 
-XX:+PrintTenuringDistribution 

How can we avoid the unresponsiveness?

A: 
  • Use a real-time JVM
  • Try IBM's JRockIt
Dave Jarvis
I'm confused - are you advocating for calling `System.gc()` manually? What would this change?
matt b
If he is advising to explicitly call `System.gc()` that will probably make things worse.
Stephen C
@Stephen: You are correct, I didn't think through the logic.
Dave Jarvis
A: 

Adding a question as an answer as I don't have the rep yet to add comments. What JRE are you running on?

karianna
JDK 1.6.0_06. Sorry forgot to mention.
Morten Knudsen
Hmm, re-reading your question, I see that you stated 'lately', so does that mean it's been running 'normally' until recently? With roughly the same load/throughput/memory footprint?
karianna
Well, actually it has been there a while. We did have a period where there was no problems.
Morten Knudsen
Are you able to take some measurements of the load on your server over time? It might not be the GC's fault (as it were), although you may need to change your GC settings in accordance with a changed usage profile of your app. If you can get a similar set-up in pre-PRD/UAT then I can recommend using JMeter alongside a profiling tool to discover what triggers the GC to behave as it does and/or help discover what the true root cause is.
karianna
I also really hate to add this as 'advice' but if you're really desperate you could try simply updating the JRE to 1.6.0_21, there have been some GC fixes and performance enhancements since _06. However, you're obviously best to find the root cause before applying a fix such as this.
karianna
Yes we are in process of updating to u21, in order to see if it is a bug that has been fixed.
Morten Knudsen
Cool, please let me know how you get on! I don't normally like 'stab in the dark' attempts, but if this works then I think we'll all be interested in the answer :)
karianna
I might add that our problem looks a LOT like this:
Morten Knudsen
http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2010-February/001513.html
Morten Knudsen
If that's the case then _21 should solve it (as that bug was solved in _18), please do let us know how you get on! This is interesting stuff :)
karianna
@karianna How can you tell that this bug is fixed in u18? Thanks.
Morten Knudsen
The Open JDK guy stated it in that thread.
karianna
The bug mentioned in the thread to be fixed in u18 is with regards to the flag: -XX:-CMSParallelRemarkEnabled. As I read it, there is no connection with the main problem.
Morten Knudsen
<goes back, re-reads thread>Hmm, yeah looks like you're right</goes back, re-reads thread>. Have you tried the JRE update anyhow?
karianna
Not yet, we are testing it in our staging environment.
Morten Knudsen
We have just switched to u21 in production. It didn't help. :-(
Morten Knudsen
In order to try out profilers, can you replicate the load on your test system? It would be interesting to see where your object spike is. Another question is, does this happen at regular intervals or randomly during the day?
karianna
A: 

The GC logs include lines like this:

Total time for which application threads were stopped: ... seconds

Unless I'm mistaken, they add up to less than 1 second. This does not jive with your webapp going unresponsive for any significant length of time.

I'm going to suggest that the cause may be something else. One possibility is that you are pushing the limits of your machine, and that the extra load of running the GC is causing requests to backlog. For instance, since you've configured 4 parallel GC threads, I'd expect the theoretical max request rate to drop by 33% while the GC is running (16 secs?). That could lead to a significant request backlog.

A second possibility your machine is physical memory starved and GC is pushing it into a mild case of virtual memory thrashing.

Stephen C
Yes, the unresponsiveness does not correlate with the "Application threads were stopped..." statements. However the problems do correlate with the CMS statements. The request does not seem to decrease only the response times.
Morten Knudsen
@Morten - and that tends to support my alternative explanations, IMO.
Stephen C
@Morten - I do not understand what you mean by *"the request does not seem to decrease only the response times."*?
Stephen C
Sorry I meant the request *rate*. The responses seem to hang and then all of a sudden the responses are delivered.
Morten Knudsen
@Morten - Have you actually measured the request rates and response times, or is this just an "it feels like ..." observation? In the latter case, you really ought to take some hard measurements before making any conclusions about what might be happening.
Stephen C
Yes we have log file which logs the response times and request times for certain requests.
Morten Knudsen
@Morten - so do your measurements confirm that *absolutely* no data is sent during the GC? Also, what about my two alternative theories. Do you have evidence to disprove either of them?
Stephen C
The machine has plenty of physical memory and does not swap. Yes the GC will impact the performance but the machine is not stretched to its limits by any means.
Morten Knudsen