views:

416

answers:

4

We are developing a rather large Windows Forms application. In several customers' computers it often crashes with OutOfMemory exception. After obtaining full memory dump of the application moments after the exception (clrdump invoked from UnhandledException handler) I analyzed it with ".NET Memory Profiler" and windbg.

The Memory Profiler has shown only 130MB in live object instances. What's interesting is that for many object types is has shown a very large number of unreachable instances (e.g. 22000 unreachable Byte[] instances). In native memory statistics it totals 127MB in all heaps for Data (which is ok), but indicates unreachable 133MB in gen #2 heap and 640MB in large heap(not ok!).

When analyzing the dump with windbg, the above stats are confirmed:

!dumpheap -stat
..... acceptable object sizes...
79330a00   467216     30638712 System.String
0016d488     4804    221756612      Free
79333470    27089    574278304 System.Byte[]

The application does use large number of short buffers through its run time, but does not leak them. Testing many of the Byte[] instances with !gcroot ends up with no roots. Obviously most of those arrays are unreachable as indicated by the memory profiler.

Just to ensure all is fine, !finalizequeue shows no objects are waiting to be finalized

generation 0 has 138 finalizable objects (18bd1938->18bd1b60)
generation 1 has 182 finalizable objects (18bd1660->18bd1938)
generation 2 has 75372 finalizable objects (18b87cb0->18bd1660)
Ready for finalization 0 objects (18bd1b60->18bd1b60)

And also check for native finalizer thread stack trace shows it is not blocked.

At the moment I don't how to diagnose why the GC doesn't collect the data (and I believe it would love to since the process ran out of memory..)

edit: Based in input below I read some more on Large Object Heap fragmentation and it seems that this could be the case.

I have seen some advices to allocate bigger blocks of memory for this kind of data (various byte[] in my case) and manage the memory in this area by myself, but this seems like a rather hackish solution, not the one I would expect to resolve a problem with not-so-special desktop application.

The fragmentation issue is caused by the fact (At least that is what many people from Microsoft state in blogs) that objects on LOH are not relocated during existence, which is understandable, but it seems logical that once some memory pressure is reached, such as a threat of getting OOM, relocation should be performed.

The only thing that worries me before fully trusting that fragmentation is the cause, is that so many object on the LOH are without gcroot references - is this because even for LOH garbage collection is performed only partially?

I'll be happy for pointing me to any interesting solution as at the moment the only one that I know of is custom management of some preallocated memory block.

Any ideas are welcome. Thanks.

+2  A: 

The LOH is subject to fragmentation. This article provides an analysis and the basic directions to work around it.
Maybe you could post some code showing a 'typical' usage of those byte[] buffers?

Henk Holterman
The buffers vary from 20k to few MB and usually are created by database provider to load data form db or by a our code that then fills it with data from socket (the data size is known, so the buffers are defined with the correct size, not growing)
grepfruit
Are they known-size at runtime or designtime? Looks like you will have to find a way to re-use them.
Henk Holterman
The size is determined at runtime - the buffers store email MIME parts, so each buffers differs in size.We may attempt to reuse our own buffers, but there are also the buffers from the provider. But yes, at the moment we will have to attempt to reuse some of the buffer. Although it will only postpone the problem, I believe..
grepfruit
grepfruit, it is paradoxical territory. It may pay to round __up__ allocations (allocate 1 MB when you need between 85kB and 1MB). Also you could post a more specific question here (How top optimize for LOH).
Henk Holterman
The thing is I am still not completely sure that this is LOH fragmentation, or that it is still only problem.What disturbs me is the fact that so many objects on the LOH are unreference, yet they are not marked as Free space. Since from what I read for gen#2 and LOH a full garbage collection is always performed - so there should be minimum of unreferenced objects assuming that right before the OOM a collection was performed.
grepfruit
+1  A: 

Sometimes Image.FromFile("a non-image file") throws OutOfMemoryException. A zero byte file is one such file that will.

Joshua
Thanks for the info, we have experienced this issue before too, so good point. However at this moment the OOM is thrown during an allocation, so probably there is really not enough continuous space.
grepfruit
+1  A: 

As usually things turned out to be little different. We have found a usecase where the application did consume lots of memory and eventually would go OOM. What was strange in the dumps we got before we found this was that there were lots of objects without gcroot - I didn't understand why wasn't it freed and used for new allocations? Then it came to me that what probably what happened when the OOM occurred - the stack was unwound and the objects that owned the memory were no longer reachable and THEN the dump was performed. So that is why there seemed lots of memory that could be GCed.

What I did in a debug version - to retrieve real state-of-the-memory dump - is to created a Threading.Timer that checks whether some reasonably large object could be allocated - if it can't be allocated, it is an indication that we're near OOM and that its good time to take the memory dump. Code follows:

private static void OomWatchDog(object obj)
{
 try                    
 {
   using(System.Runtime.MemoryFailPoint memFailPoint = 
          new System.Runtime.MemoryFailPoint(20))
   {
   }
 }
 catch (InsufficientMemoryException)
 {
   PerformDump();
 }
}
grepfruit
+1  A: 
Naveen