I have an application which uses rather a lot of actors: 25,000 to be precise. It uses Scala 2.7.7 and is running on jdk6_u18. It basically listens to and processes market data and has very little state.
It starts at 8.02am every day and within the hour it has crashed with an OutOfMemoryError
. "Aha" you say, "you have a memory leak!" Except that when I restart it, it never, ever crashes again for the rest of the day! This despite ramping up both GC and CPU overhead when the US markets open at 2.30pm.
A few anecdotal findings:
- it runs on Solaris. When I used to run it on Linux, it never crashed at all.
- I have tried messing with generation heap sizing, allocating more memory etc. I think It makes no difference
- The behaviour of the collector seems to differ when I have
verbose:gc
switched on
A few questions present themselves:
- Why would the behaviour of this program be different between Linux and Solaris?
- Why would the behaviour be different between starting it at 8.02 and starting it at 8.42?
- I have heard that the actors library had some memory-leak issues. What were they, when were they fixed and how would I discover if anything "similar" might be going on here? (Things to look for in jhat etc)
- Does anyone have an idea as to what might be going on?
I am now trying G1
to see if that makes any difference. I will update this question tomorrow!
Some output from G1 with verbose:gc on
I think I just caught it in the act:
600.290: [Full GC 255M->144M(256M), 1.5772616 secs]
602.084: [GC pause (young) 227M->145M(256M), 0.0556769 secs]
602.418: [Full GC 254M->144M(256M), 1.6415216 secs]
604.279: [GC pause (young) 227M->145M(256M), 0.0415157 secs]
604.602: [Full GC 255M->145M(256M), 1.6041762 secs]
606.422: [GC pause (young) 227M->145M(256M), 0.0237441 secs]
606.710: [Full GC 254M->145M(256M), 1.6022185 secs]
And then a bit later (you can see the full GC has taken longer and is reclaiming less)
849.084: [Full GC 254M->176M(256M), 1.9658754 secs]
851.191: [GC pause (young) 228M->176M(256M), 0.0218611 secs]
851.414: [Full GC 254M->176M(256M), 1.9352357 secs]
853.492: [GC pause (young) 228M->176M(256M), 0.0224688 secs]
853.716: [Full GC 254M->176M(256M), 1.9339705 secs]
855.793: [GC pause (young) 228M->176M(256M), 0.0215707 secs]
856.009: [Full GC 254M->176M(256M), 1.9805797 secs]
858.137: [GC pause (young) 228M->176M(256M), 0.0223224 secs]
Some output from G1 with verbose:gc off
It's fine again! *Sigh*
303.656: [GC pause (young) 225M->93M(256M), 0.1680767 secs]
308.060: [GC pause (young) 226M->94M(256M), 0.1793724 secs]
312.746: [GC pause (young) 227M->93M(256M), 0.1674851 secs]
316.162: [GC pause (young) 227M->95M(256M), 0.1826145 secs]
320.147: [GC pause (young) 226M->94M(256M), 0.1656664 secs]
325.978: [GC pause (young) 226M->93M(256M), 0.1475760 secs]
330.176: [GC pause (young) 226M->94M(256M), 0.1727795 secs]
and much, much later it is still OK!
25882.894: [GC pause (young) 224M->125M(256M), 0.2126515 secs]
25884.880: [GC pause (young) 224M->126M(256M), 0.2059802 secs]
25887.027: [GC pause (young) 224M->125M(256M), 0.1851359 secs]
25889.940: [GC pause (young) 223M->126M(256M), 0.2046496 secs]
25891.567: [GC pause (young) 224M->126M(256M), 0.1600574 secs]
and later still, a full GC
37180.191: [GC pause (young) 225M->154M(256M), 0.1716404 secs]
37182.163: [GC pause (young) (initial-mark) 225M->153M(256M)37182.326: [GC concurrent-mark-start], 0.1622246 secs]
37183.089: [GC concurrent-mark-end, 0.7635219 sec]
37183.090: [GC remark, 0.0032547 secs]
37183.093: [GC concurrent-count-start]
37183.297: [GC concurrent-count-end, 0.2043307]
37183.393: [GC cleanup 198M->198M(256M), 0.0068127 secs]
37183.400: [GC concurrent-cleanup-start]
37183.400: [GC concurrent-cleanup-end, 0.0000393]
37183.648: [GC pause (young) 222M->153M(256M), 0.1483041 secs]
37184.235: [GC pause (partial) 171M->91M(256M), 0.2520714 secs]
37187.223: [GC pause (young) 221M->92M(256M), 0.1721220 secs]
UPDATE
Well, since switching to the G1 garbage collector on jdk1.6.0_18 the application has behaved itself on three consecutive days. I suspect that Erik is correct in his analysis of the situation about the VM getting itself into a death spiral in high-throughput cases where it has promoted objects into the tenured generation.