views:

53

answers:

3

I'm doing some load testing on a web app deployed in JBoss. It starts up fine, but as the test ramps up and more simulated users start hitting JBoss, performance degrades severely:

Resposne time chart

Connecting VisualVM to it, I can see the threads were all fine, then suddenly started spending most of their time waiting for a monitor (green is running, red is monitor, yellow is wait):

Thread state graph

Running jstack, I see the threads are all waiting in the same place:

"http-0.0.0.0-8080-172" daemon prio=6 tid=0x000000005da90000 nid=0xd2c waiting for monitor entry [0x000000006cb4e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.log4j.Category.callAppenders(Category.java:185)
    - waiting to lock  (a org.apache.log4j.spi.RootCategory)
    at org.apache.log4j.Category.forcedLog(Category.java:372)
    at org.apache.log4j.Category.debug(Category.java:241)
    [my code]

Most of the ~200 HTTP processor threads are waiting for the same monitor. Looking at log4j.xml for the WAR, it has a single appender setup for CONSOLE. I delete the appender and try my test again. Same behavior, except jstack shows all the threads waiting in a different place:

"http-0.0.0.0-8080-251" daemon prio=6 tid=0x0000000059811800 nid=0x1108 waiting for monitor entry [0x0000000073ebe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.util.Hashtable.get(Hashtable.java:333)
    - waiting to lock  (a org.jboss.util.property.PropertyMap)
    at java.util.Properties.getProperty(Properties.java:932)
    at org.jboss.util.property.PropertyMap.getProperty(PropertyMap.java:626)
    at java.lang.System.getProperty(System.java:653)
    at org.jaxen.saxpath.helpers.XPathReaderFactory.createReader(XPathReaderFactory.java:109)
    at org.jaxen.BaseXPath.(BaseXPath.java:124)
    at org.jaxen.BaseXPath.(BaseXPath.java:153)
    at nu.xom.JaxenConnector.(JaxenConnector.java:49)
    at nu.xom.Node.query(Node.java:424)
    [my code]

Changing nothing, I restart JBoss, run the test, then run jstack once it gets slow. All the threads are waiting in yet a different place:

"http-0.0.0.0-8080-171" daemon prio=6 tid=0x000000005d0d1000 nid=0x15d4 waiting for monitor entry [0x000000006cb4e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:118)
    - waiting to lock  (a sun.nio.cs.StandardCharsets)
    at java.nio.charset.Charset.lookup2(Charset.java:449)
    at java.nio.charset.Charset.lookup(Charset.java:437)
    at java.nio.charset.Charset.isSupported(Charset.java:479)
    at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:49)
    at java.io.InputStreamReader.(InputStreamReader.java:57)
    at java.io.FileReader.(FileReader.java:41)
    [my code]

What in the heck is going on? I have used jstack in the past and I tried running it when things are running fine and got the expected results. I assume jstack is fine. Any ideas what could cause such weird behavior? Any ideas on where to go from here?

+2  A: 

This sort of behaviour is to be expected. As you scale up a load test, you're always going to find bottlenecks, and in a complex system, those bottlenecks are going to shift around.

Your job is to identify those bottlenecks and try to fix them, one at a time. Each time you do, you'll always find another one, but hopefully the system will scale better each time. It's not easy, but then scaling for load isn't easy.

  • Take your 1st example. You have lots of calls to log4j's Logger.debug() method. Log4j doesn't perform well when logging under load, so you need to take some precautions. Even if your log4j config says "do not log DEBUG messages", log4j still has to do some work before realising this. The recommended approach to handle to is to wrap every Logger.debug() call in a if Logger.isDebugEnabled() { Logger.debug(); }` block. This should shift that particular bottleneck.

  • In your 2ndexample, you're calling XOM's Node.query() method. This method has to recompile the XPath expression on every call, and this seems to be a bottleneck. Find an API where you can pre-compile the XPath expression and re-use it.

  • In the 3rd example, you're reading a File. This isn't a good idea in a high-load system, file-io is slow when you're doing a large number of small operations. Consider re-implementing this to work a different way if you can.

All of these are unrelated, but all present performance bottlenecks you'll see when scaling for load. You'll never get rid of them all, but hopefully you can get it to a point where it's good enough.

skaffman
I'm not seeing bottlenecks. Log4j calls (even writing to a file serially), calling System.getProperty, and calling Charset.isSupported are fast operations, not normally calls that need to be avoided. I'm only running 250 threads. These locks are normally let go very fast. 100+ threads waiting on System.getProperty means the thread locking the HashTable is taking a long time. The locking thread is just calling Hashtable.get. There are few reasons for that to take a long time. Maybe jstack is misleading? Seems unlikely, since it doesn't show blocking threads when the web app is running smoothly.
NateS
@NateS: They may be fast operations, but they're single-threaded, and that makes them bottlenecks. It doesn't matter how fast they are, they're not concurrent. Your own stack dumps prove this.
skaffman
@skaffman, I can't believe that a simple test program that had 250 threads all calling System.getProperties would show the same contention problems I am seeing with my app.
NateS
@NateS: It doesn't matter if you believe it or not, the evidence is right there. Anyway, you have my advice, it is of course up to you if you want to act on it.
skaffman
@skaffman, I do appreciate the help. My point is, the evidence doesn't add up to your conclusion. The problem isn't XOM, but System.getProperty. This is called throughout code that isn't mine and cannot be avoided. Eg, JBoss calls this without involving my code at all. I can't avoid using Jboss. Also, the problem isn't reading a File, but Charset.isSupported. This is called by a String constructor, URLEncoder, etc. I can't avoid creating Strings! Minor contention over these resources is fine, but I am baffled as to why these locks are held for so long to create problems with only 250 threads.
NateS
A: 

I set up the application in Tomcat running through Eclipse and did not see the problem. Eventually I found we were starting JBoss using a 32-bit Windows service wrapper, even though we were using a 64-bit JDK. The machine was 64-bit. I'm not sure how this would even work? At any rate, changing to a 32-bit JDK caused the crazy problem to go away and I was able to move on with my life.

NateS
A: 

I think we're seeing a similar situation on our 64 bit servers. Any chance you were running JRockit rather than the Sun JVM? I'm wondering if this is a bug in the 64bit JRockit, since we both saw our 64 bit servers locking on HashTable. If you could reply, I would really appreciate it.

Thanks!

mountainman99
Was a Sun JVM, not JRockit.
NateS