views:

259

answers:

2

I have an application which uses rather a lot of actors: 25,000 to be precise. It uses Scala 2.7.7 and is running on jdk6_u18. It basically listens to and processes market data and has very little state.

It starts at 8.02am every day and within the hour it has crashed with an OutOfMemoryError. "Aha" you say, "you have a memory leak!" Except that when I restart it, it never, ever crashes again for the rest of the day! This despite ramping up both GC and CPU overhead when the US markets open at 2.30pm.

A few anecdotal findings:

  • it runs on Solaris. When I used to run it on Linux, it never crashed at all.
  • I have tried messing with generation heap sizing, allocating more memory etc. I think It makes no difference
  • The behaviour of the collector seems to differ when I have verbose:gc switched on

A few questions present themselves:

  1. Why would the behaviour of this program be different between Linux and Solaris?
  2. Why would the behaviour be different between starting it at 8.02 and starting it at 8.42?
  3. I have heard that the actors library had some memory-leak issues. What were they, when were they fixed and how would I discover if anything "similar" might be going on here? (Things to look for in jhat etc)
  4. Does anyone have an idea as to what might be going on?

I am now trying G1 to see if that makes any difference. I will update this question tomorrow!

Some output from G1 with verbose:gc on

I think I just caught it in the act:

600.290: [Full GC 255M->144M(256M), 1.5772616 secs]
602.084: [GC pause (young) 227M->145M(256M), 0.0556769 secs]
602.418: [Full GC 254M->144M(256M), 1.6415216 secs]
604.279: [GC pause (young) 227M->145M(256M), 0.0415157 secs]
604.602: [Full GC 255M->145M(256M), 1.6041762 secs]
606.422: [GC pause (young) 227M->145M(256M), 0.0237441 secs]
606.710: [Full GC 254M->145M(256M), 1.6022185 secs]

And then a bit later (you can see the full GC has taken longer and is reclaiming less)

849.084: [Full GC 254M->176M(256M), 1.9658754 secs]
851.191: [GC pause (young) 228M->176M(256M), 0.0218611 secs]
851.414: [Full GC 254M->176M(256M), 1.9352357 secs]
853.492: [GC pause (young) 228M->176M(256M), 0.0224688 secs]
853.716: [Full GC 254M->176M(256M), 1.9339705 secs]
855.793: [GC pause (young) 228M->176M(256M), 0.0215707 secs]
856.009: [Full GC 254M->176M(256M), 1.9805797 secs]
858.137: [GC pause (young) 228M->176M(256M), 0.0223224 secs]

Some output from G1 with verbose:gc off

It's fine again! *Sigh*

303.656: [GC pause (young) 225M->93M(256M), 0.1680767 secs]
308.060: [GC pause (young) 226M->94M(256M), 0.1793724 secs]
312.746: [GC pause (young) 227M->93M(256M), 0.1674851 secs]
316.162: [GC pause (young) 227M->95M(256M), 0.1826145 secs]
320.147: [GC pause (young) 226M->94M(256M), 0.1656664 secs]
325.978: [GC pause (young) 226M->93M(256M), 0.1475760 secs]
330.176: [GC pause (young) 226M->94M(256M), 0.1727795 secs]

and much, much later it is still OK!

25882.894: [GC pause (young) 224M->125M(256M), 0.2126515 secs]
25884.880: [GC pause (young) 224M->126M(256M), 0.2059802 secs]
25887.027: [GC pause (young) 224M->125M(256M), 0.1851359 secs]
25889.940: [GC pause (young) 223M->126M(256M), 0.2046496 secs]
25891.567: [GC pause (young) 224M->126M(256M), 0.1600574 secs]

and later still, a full GC

37180.191: [GC pause (young) 225M->154M(256M), 0.1716404 secs]
37182.163: [GC pause (young) (initial-mark) 225M->153M(256M)37182.326: [GC concurrent-mark-start], 0.1622246 secs]
37183.089: [GC concurrent-mark-end, 0.7635219 sec]
37183.090: [GC remark, 0.0032547 secs]
37183.093: [GC concurrent-count-start]
37183.297: [GC concurrent-count-end, 0.2043307]
37183.393: [GC cleanup 198M->198M(256M), 0.0068127 secs]
37183.400: [GC concurrent-cleanup-start]
37183.400: [GC concurrent-cleanup-end, 0.0000393]
37183.648: [GC pause (young) 222M->153M(256M), 0.1483041 secs]
37184.235: [GC pause (partial) 171M->91M(256M), 0.2520714 secs]
37187.223: [GC pause (young) 221M->92M(256M), 0.1721220 secs]

UPDATE

Well, since switching to the G1 garbage collector on jdk1.6.0_18 the application has behaved itself on three consecutive days. I suspect that Erik is correct in his analysis of the situation about the VM getting itself into a death spiral in high-throughput cases where it has promoted objects into the tenured generation.

+1  A: 

At this point, I wonder if it would be worth trying to replace your actors with Lift's or Akka? Though I think it is unlikely for it to be a bug with them, they might not tickle whatever it is that causes the crash.

Daniel
You give up easily!
oxbow_lakes
@oxbow What can I say? I'm lazy. ;-)
Daniel
+3  A: 

Do you have any reason to expect your heap to slowly grow in size? It looks like in both traces it is growing. One thing I've done on many occasions is reduce my heap to try to make the problem worse. However, 256M is about the lower limit for Scala.

One thing I've noticed before is that if you have short-lived objects that make it out of the eden generation due to too much pressure it can gradually kill you. This can happen when there's a burst of activity (maybe you have a burst in the morning?) and your eden space isn't big enough. Scala in general and actors in particular make heavy use of small short-lived objects, and it seems there's this magic threshold that once you cross it's all down hill. So one run will be fine and the next crashes and burns.

Another thing I've noticed before is that the GC settings that work well on OSX/x86 often do not work well on Sparc/Solaris and vice versa. If you're on CoolThreads, I suggest you configure the GC to have one GC thread per thread in your scheduler/pool.

Which brings up another thing - make sure the scheduler isn't creating new threads willy-nilly. It will do that sometimes. I'd say you should almost always manually set a cap on the threads. I don't know how relevant it is, but an interesting factoid about the fork-join scheduler that actors use by default is that it is intended for short, CPU bound tasks. Doing IO in the threads it manages screws up its assumptions. Of course it should still work, but...

Good luck! I've lost many, many days of my life to such problems.

Take a look at some of the options here: http://java.sun.com/performance/reference/whitepapers/tuning.html

It looks like you are using the concurrent-mark-sweep collector. Try setting:

-XX:+UseParallelGC
Erik Engbrecht
Hi Erik - No - I've added a third trace. The second example was quite stable and still going strong after 6 hours. A full GC would bring the memory down belong 100Mb whereupon it would slowly creep back up to 150Mb before a full GC kicked in again
oxbow_lakes
I've set the max pool size of the actor thread pool to be 8. Your point about short bursts of creation in the morning is probably correct but I can't quite understand it. When markets open at 8am obviously there's a rampimg up of messages. These stay roughly constant until the US opens at 2.30pm when activity doubles. I was thinking something along the lines of "my app can't grow the heap fast enough" so I set the Xms/mx both to 256Mb and scheduled the app to start at 8.02am but this hasn't worked. If the move to G1 does not prove fruitful, I'll look at changing Eden rations in the old GC
oxbow_lakes
GC performance can vary widely depending on what optimizations HotSpot has performed on the code. If you are hitting a cold VM with a high load that then tapers off that may stress it too much. By the time US markets open the VM may be warmed up and better able to handle the load.
Erik Engbrecht
I have looked around for tips on good VM options to use in certain situations without much success. The general opinion seems to be *don't use any bespoke options; let the VM decide what to do itself*
oxbow_lakes
@Erik - I have other actor-based applications which run happily with <10Mb of heap. Can you expand on what you mean by "Scala needs 256Mb?"
oxbow_lakes
I just find Scala works better with more memory. 256M is not a scientific number, but rather that I find if I don't want to be bothered with tuning setting the heap at 512m or often 1g works better. It gives HotSpot more breathing room.
Erik Engbrecht