views:

544

answers:

6

I have to find a memory leak in a Java application. I have some experience with this but would like advice on a methodology/strategy for this. Any reference and advice is welcome.

About our situation:

  1. Heap dumps are larger than 1 GB
  2. We have heap dumps from 5 occasions.
  3. We don't have any test case to provoke this. It only happens in the (massive) system test environment after at least a weeks usage.
  4. The system is built on a internally developed legacy framework with so many design flaws that they are impossible to count them all.
  5. Nobody understands the framework in depth. It has been transfered to one guy in India who barely keeps up with answering e-mails.
  6. We have done snapshot heap dumps over time and concluded that there is not a single component increasing over time. It is everything that grows slowly.
  7. The above points us in the direction that it is the frameworks homegrown ORM system that increases its usage without limits. (This system maps objects to files?! So not really a ORM)

Question: What is the methodology that helped you succeed with hunting down leaks in a enterprise scale application?

+4  A: 

Take a look at Eclipse Memory Analyzer. It's a great plugin which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.

I've used it in the past to help hunt down suspicious leaks.

matt b
I used this to successfully analyze a ~180 meg heap dump just yesterday, works like a charm.
Esko
Eclipse MAT is amazing, especially it's Memory Leak Detector.
Pascal Thivent
Yes this is what we are mainly using. It works nicely with at least 1.5 GB heap dumps on 64bit Linux (of course Win 32 bit fails fast). The only downside is that I haven't got very much useful help from the automated analysis in it.
Rickard von Essen
+1  A: 

If it's happening after a week's usage, and your application is as byzantine as you describe, perhaps you're better off restarting it every week ?

I know it's not fixing the problem, but it may be a time-effective solution. Are there time windows when you can have outages ? Can you load balance and fail over one instance whilst keeping the second up ? Perhaps you can trigger a restart when memory consumption breaches a certain limit (perhaps monitoring via JMX or similar).

Brian Agnew
The windows solution! (Our IT department uses this for our Windows servers) We don't run the system our selves, it is sold to companies that cannot accept restarts (scheduled or unscheduled). The bare sign of instability would cause threats of fines.
Rickard von Essen
I don't like it, but it's pragmatic in some scenarios. I note your point about selling to companies, however.
Brian Agnew
I agree that it can be a (temporary) solution in some cases.
Rickard von Essen
+1  A: 

Can you accelerate time? i.e. can you write a dummy test client that forces it to do a weeks worth of calls/requests etc in a few minutes or hours? These are your biggest friend and if you don't have one - write one.

We used Netbeans a while ago to analyse heap dumps. It can be a bit slow but it was effective. Eclipse just crashed and the 32bit Windows tools did as well.

If you have access to a 64bit system or a Linux system with 3GB or more you will find it easier to analyse the heap dumps.

Do you have access to change logs and incident reports? Large scale enterprises will normally have change management and incident management teams and this may be useful in tracking down when problems started happening.

When did it start going wrong? Talk to people and try and get some history. You may get someone saying, "Yeah, it was after they fixed XYZ in patch 6.43 that we got weird stuff happening".

Fortyrunner
We thought about that it should be a good idea but in our case it is unfeasible in the hole. We can only execute some test cases more often. The system test is only executed each 6 months or so and the last time they decided to make it more intense. After this we found the problem. We tried to downgrade the framework and the application to a version that passed the test before. All three tests failed, this tell us that the fault is ether in another component in the system or have been in ours for a long time. Another component is unlikely.
Rickard von Essen
A: 

I've had success with Heap Analyzer. It offers several views of the heap, including largest drop-off in object size, most frequently occurring objects, and objects sorted by size.

Drew Johnson
A: 

I've used jhat, this is a bit harsh, but it depends on the kind of framework you had.

LB
We didn't manage to load such a big heap dumps in jhat and that seems to be a common problem. Also as I remember it when I used it two years ago it was a bit slow on bigger data sets.
Rickard von Essen
as another anecdote I've also had loading issues with jhat and largeish (1G) heaps
matt b
what kind of issue ? heap space problem on the jvm running jhat ?
LB
+4  A: 

It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.

Also, you can't know if something is a leak or not without know why the class is there in the first place.

I just spent the past couple of weeks doing exactly this, and I used an iterative process.

First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.

Rather, I relied almost solely on jmap histograms.

I imagine you're familiar with these, but for those not:

jmap -histo:live <pid> > dump.out

creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.

I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.

I ran several different analyses on this data.

I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.

I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.

However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.

To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysis on the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.

I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.

Now, all this can do is identify potentially leaking classes.

This is where understanding how the classes are used, and whether they should or should not be their comes in to play.

For example, you may find that you have a lot of Map.Entry classes, or some other system class.

Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.

Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.

Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.

In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).

A profiler may be able to help you track the owners of that "now leaked" class.

But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.

Will Hartung