views:

215

answers:

2

Not sure if this would be better suited for ServerFault, but since I am not an admin but a developer I figured I would try SO.

We've been struggling to keep our multi-server configuration stable for quite some time now. At the end of last month we were running under CF 7.0.2 on a two servers setup (one instance each). At that point we managed to get our uptime to around 1 week per instance before they would restart by themselves. Since the beginning of the month we upgraded to CF 9 and we're back to square one with multi-restart a day.

Our current configuration is 2 Win2k3 servers, running a cluster of 4 instances, 2 instances per server. At this point we are pretty certain this is due to improper JVM settings.

We've been toying with them and while some are more stable than others we never quite got it right.

From the default:

java.args=-server -Xmx512m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=192m -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/

To currently:

java.args=-server -Xmx896m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=512m -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/ -verbose:gc -Xloggc:c:/Jrun4/logs/gc/gcInstance1b.log

We have determined that we do need more than the default 512MB simply by monitoring with FusionReactor, on average our amount of memory consumed is hovering in the mid 300MB and can go up to low 700MB under heavy load.

Most of the crash will be logged in jrun4/bin/hs_err_pid*.log always an "Out of swap space"

I've attached links to the hs_err and garbage collector log file from yesterday at the bottom of the post.

The relevant part is (I think) this:

Heap
 PSYoungGen      total 89856K, used 19025K [0x55490000, 0x5b6f0000, 0x5b810000)
  eden space 79232K, 16% used [0x55490000,0x561a64c0,0x5a1f0000)
  from space 10624K, 52% used [0x5ac90000,0x5b20e2f8,0x5b6f0000)
  to   space 10752K, 0% used [0x5a1f0000,0x5a1f0000,0x5ac70000)
 PSOldGen        total 460416K, used 308422K [0x23810000, 0x3f9b0000, 0x55490000)
  object space 460416K, 66% used [0x23810000,0x36541bb8,0x3f9b0000)
 PSPermGen       total 107520K, used 106079K [0x03810000, 0x0a110000, 0x23810000)
  object space 107520K, 98% used [0x03810000,0x09fa7e40,0x0a110000)

From it, I gather that its the PSPermGen that is full (most logs will show the same before a crash), which is why we increased MaxPermSize but the total still show as 107520K!??!

No one here is a jRun expert, so any help or even ideas on what to try next would be greatly appreciated!!

The log files: Sorry I know sendspace isn't the friendliest of places - if you have other host suggestion for log files let me know and I'll update the post (SO doesn't like them inline, it blows up the format of the post).

+2  A: 

This is an effect that could have many causes -- anything from the way your application is constructed (unconventional usage of application or server scope? Bad database drivers and connection management? Parsing giant XML files? Use of CFHTTP or other external resources? Problems with native session replication?) to your coding practices (var scoping everywhere?) to the kinds of CPUs in your servers. It's not likely you'll come up with some magic bullet JVM settings without much analysis (and perhaps not even then). But for starters, why do you have such an unusually large PermGen? Seems like a peculiar pattern, but of course I don't know anything about your app.

It seems you have little to lose by trying some different garbage collectors. If appropriate to your JVM version, try:

-XX:+UseConcMarkSweepGC -XX:+UseParNewGC 

and add in:

-XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled

which may help manage your large PermGen. Don't forget to take out XX:+UseParallelGC if you try these.

Ken Redler
We did see error with session replication where the memory usage would shoot up, we disabled it on the cluster while troubleshooting the other issue. I'll try the other GC see if it's better.We increased the MaxPerm after seeing it was full in all the hs_err_*.log file---PSPermGen total 107520K, used 106079K [0x03810000, 0x0a110000, 0x23810000) object space 107520K, 98% used [0x03810000,0x09fa7e40,0x0a110000)
jfrobishow
My comment about the big PermGen doesn't mean it's wrong -- just that it's not a pattern I've commonly seen. Good luck!
Ken Redler
+1  A: 

A little update. I've tried different GCs and while some stabilized the system for a while it kept crashing, only less frequently. So I kept digging and eventually found out that the JVM will throw "Out of swap space" when the OS itself refuses to allocate the memory requested.

This usually happen when the maximum memory is already assigned to the JVM process, this is the jrun overhead, the JVM itself, all the libraries, the heap AND the stack. Since request are living on the stack if you have a lot of requests being spawned the stack will grow and grow. The size of each thread varies according to the OS and version of the JVM but can be controlled using the -Xss argument. I reduced ours to 64k so our java.args looks like this:

java.args=-server -Xmx768m -Xss64k -Dsun.io.useCanonCaches=false -XX:MaxPermSize=512m -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/ -verbose:gc -Xloggc:c:/Jrun4/logs/gc/gcInstance2a.log

So far everything has been stable without any noticeable slowdown for 6 days, which is definitely the longest I've ever seen the application stay up. If you reduce the request size too much, you'll start noticing stack overflow errors in the log instead of the OOM error.

My next step will be to tweak the MaxPermSize but so far so good!

jfrobishow