Over the past year I've made huge improvements in my application's Java heap usage--a solid 66% reduction. In pursuit of that, I've been monitoring various metrics, such as Java heap size, cpu, Java non-heap, etc. via SNMP.
Recently, I've been monitoring how much real memory (RSS, resident set) by the JVM and am somewhat surprised. The real memory consumed by the JVM seems totally independent of my applications heap size, non-heap, eden space, thread count, etc.
Heap Size as measured by Java SNMP
Real Memory in KB. (E.g.: 1 MB of KB = 1 GB)
(The three dips in the heap graph correspond to application updates/restarts.)
This is a problem for me because all that extra memory the JVM is consuming is 'stealing' memory that could be used by the OS for file caching. In fact, once the RSS value reaches ~2.5-3GB, I start to see slower response times and higher CPU utilization from my application, mostly do to IO wait. As some point paging to the swap partition kicks in. This is all very undesirable.
So, my questions:
- Why is this happening? What is going on "under the hood"?
- What can I do to keep the JVM's real memory consumption in check?
The gory details:
- RHEL4 64-bit (Linux - 2.6.9-78.0.5.ELsmp #1 SMP Wed Sep 24 ... 2008 x86_64 ... GNU/Linux)
- Java 6 (build 1.6.0_07-b06)
- Tomcat 6
- Application (on-demand HTTP video streaming)
- High I/O via java.nio FileChannels
- Hundreds to low thousands of threads
- Low database use
- Spring, Hibernate
Relevant JVM parameters:
-Xms128m
-Xmx640m
-XX:+UseConcMarkSweepGC
-XX:+AlwaysActAsServerClassMachine
-XX:+CMSIncrementalMode
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+CMSLoopWarn
-XX:+HeapDumpOnOutOfMemoryError
How I measure RSS:
ps x -o command,rss | grep java | grep latest | cut -b 17-
This goes into a text file and is read into an RRD database my the monitoring system on regular intervals. Note that ps outputs Kilo Bytes.
The Problem & Solution*s*:
While in the end it was ATorras's answer that proved ultimately correct, it kdgregory who guided me to the correct diagnostics path with the use of pmap
. (Go vote up both their answers!) Here is what was happening:
Things I know for sure:
- My application records and displays data with JRobin 1.4, something I coded into my app over three years ago.
- The busiest instance of the application currently creates
- Over 1000 a few new JRobin database files (at about 1.3MB each) within an hour of starting up
- ~100+ each day after start-up
- The app updates these JRobin data base objects once every 15s, if there is something to write.
- In the default configuration JRobin:
- uses a
java.nio
-based file access back-end. This back-end mapsMappedByteBuffers
to the files themselves. - once every five minutes a JRobin daemon thread calls
MappedByteBuffer.force()
on every JRobin underlying database MBB
- uses a
pmap
listed:- 6500 mappings
- 5500 of which were 1.3MB JRobin database files, which works out to ~7.1GB
That last point was my "Eureka!" moment.
My corrective actions:
- Consider updating to the latest JRobinLite 1.5.2 which is apparently better
- Implement proper resource handling on JRobin databases. At the moment, once my application creates a database and then never dumps it after the database is no longer actively used.
- Experiment with moving the
MappedByteBuffer.force()
to database update events, and not a periodic timer. Will the problem magically go away? - Immediately, change the JRobin back-end to the java.io implementation--a line line change. This will be slower, but it is possibly not an issue. Here is a graph showing the immediate impact of this change.
Questions that I may or may not have time to figure out:
- What is going on inside the JVM with
MappedByteBuffer.force()
? If nothing has changed, does it still write the entire file? Part of the file? Does it load it first? - Is there a certain amount of the MBB always in RSS at all times? (RSS was roughly half the total allocated MBB sizes. Coincidence? I suspect not.)
- If I move the
MappedByteBuffer.force()
to database update events, and not a periodic timer, will the problem magically go away? - Why was the RSS slope so regular? It does not correlate to any of the application load metrics.