views:

735

answers:

3

Recently our application encountered a strange problem.

The application has a win32 window in the WPF window, when resize the WPF window, the problem occurred.

StackTrace:

Exception object: 0000000002ab2c78
Exception type: System.OutOfMemoryException
InnerException: <none>
StackTrace (generated):
    SP       IP       Function
    0048D94C 689FB82F PresentationCore_ni!System.Windows.Media.Composition.DUCE+Channel.SyncFlush()+0x80323f
    0048D98C 681FEE37 PresentationCore_ni!System.Windows.Media.Composition.DUCE+CompositionTarget.UpdateWindowSettings(ResourceHandle, RECT, System.Windows.Media.Color, Single, System.Windows.Media.Composition.MILWindowLayerType, System.Windows.Media.Composition.MILTransparencyFlags, Boolean, Boolean, Boolean, Int32, Channel)+0x127
    0048DA38 681FEAD1 PresentationCore_ni!System.Windows.Interop.HwndTarget.UpdateWindowSettings(Boolean, System.Nullable`1<ChannelSet>)+0x301
    0048DBC8 6820718F PresentationCore_ni!System.Windows.Interop.HwndTarget.UpdateWindowSettings(Boolean)+0x2f
    0048DBDC 68207085 PresentationCore_ni!System.Windows.Interop.HwndTarget.UpdateWindowPos(IntPtr)+0x185
    0048DC34 681FFE9F PresentationCore_ni!System.Windows.Interop.HwndTarget.HandleMessage(Int32, IntPtr, IntPtr)+0xff
    0048DC64 681FD0BA PresentationCore_ni!System.Windows.Interop.HwndSource.HwndTargetFilterMessage(IntPtr, Int32, IntPtr, IntPtr, Boolean ByRef)+0x3a
    0048DC88 68C6668E WindowsBase_ni!MS.Win32.HwndWrapper.WndProc(IntPtr, Int32, IntPtr, IntPtr, Boolean ByRef)+0xbe
    0048DCD4 68C665BA WindowsBase_ni!MS.Win32.HwndSubclass.DispatcherCallbackOperation(System.Object)+0x7a
    0048DCE4 68C664AA WindowsBase_ni!System.Windows.Threading.ExceptionWrapper.InternalRealCall(System.Delegate, System.Object, Boolean)+0x8a
    0048DD08 68C6639A WindowsBase_ni!System.Windows.Threading.ExceptionWrapper.TryCatchWhen(System.Object, System.Delegate, System.Object, Boolean, System.Delegate)+0x4a
    0048DD50 68C64504 WindowsBase_ni!System.Windows.Threading.Dispatcher.WrappedInvoke(System.Delegate, System.Object, Boolean, System.Delegate)+0x44
    0048DD70 68C63661 WindowsBase_ni!System.Windows.Threading.Dispatcher.InvokeImpl(System.Windows.Threading.DispatcherPriority, System.TimeSpan, System.Delegate, System.Object, Boolean)+0x91
    0048DDB4 68C635B0 WindowsBase_ni!System.Windows.Threading.Dispatcher.Invoke(System.Windows.Threading.DispatcherPriority, System.Delegate, System.Object)+0x40
    0048DDD8 68C65CFC WindowsBase_ni!MS.Win32.HwndSubclass.SubclassWndProc(IntPtr, Int32, IntPtr, IntPtr)+0xdc

StackTraceString: <none>
HResult: 8007000e

Also, I found some related links:

relatedA

relatedB

  1. Is there any way to avoid or handle this problem?

  2. How to find out the real problem?

  3. From the call stack, Can we determine that the problem was came from .NET Framework?

Thank you for your answer or comments!

+2  A: 

Here's a useful article on memory leaks in WPF. You might also consider something like ANTS Performance and/or Memory Profiler from RedGate to help diagnose problems like this.

HTH

Chris Nicol
Yeah... Ants memory profiler looks like the best option for me.
Anvaka
+4  A: 

Your problem is not caused by a managed memory leak. Clearly you are tickling a bug somewhere in unmanaged code.

The SyncFlush() method is called after several MILCore calls, and it appears to cause the changes that have been sent to be processed immediately instead of being left in queue for later processing. Since the call processes everything previously sent, nothing in your visual tree can be ruled out from the call stack you sent.

A call stack that includes unmanaged calls may turn up more useful information. Run the application under VS.NET with native debugging, or with windbg or another native code debugger. Set the debugger to break on the exception, and get the call stack at the relative breakpoint.

The call stack will of course descend into MILCore, and from there it may go into the DirectX layer and the DirectX driver. A clue as to which part of your code caused the problem may be found somewhere in this native call stack.

Chances are that MILCore is passing a huge value of some parameter into DirectX based on what you are telling it. Check your application for anything that could cause a bug that would make DirectX to allocate a lot of memory. Examples of things to look for would be:

  • BitmapSources that are set to load at very high resolution.
  • Large WritableBitmaps
  • Extremely large (or negative) transform or size values

Another way to attack this problem is to progressively simplify your application until the problem disappears, then look very closedly at what you removed last. When convenient, it can be good to do this as a binary search: Initially cut out half of the visual complexity. If it works, put back half of what was removed, otherwise remove another half. Repeat until done.

Also note that it is usually unnecssary to actually remove UI components to keep MILCore from seeing then. Any Visual with Visibility.Hidden may be skipped over entirely.

There is no generalized way to avoid this problem, but the search technique will help you pinpoint what specifically needs to be changed to fix it in the particular case.

It is safe to say from the call stack, that you have found a bug in either NET Framework or the DirectX drivers for a particular video card.

Regarding the second stack trace you posted

John Knoeller is correct that the transition from RtlFreeHeap to ConvertToUnicode is nonsense, but draws the wrong conclusion from it. What we are seeing is that your debugger got lost when tracing back the stack. It started correctly from the exception but got lost below the Assembly.ExecuteMainMethod frame because that part of the stack had been overwritten as the exception was handled and the debugger was invoked.

Unfortunately any analysis of this stack trace is useless for your purposes because it was captured too late. What we are seeing is an exception occuring during processing of a WM_LBUTTONDOWN which is converted to a WM_SYSCOMMAND, which then catches an exception. In other words, you clicked on something that caused a system command (such as a resize), which caused an exception. At the point this stack trace was captured, the exception was already being handled. The reason you are seeing User32 and UxTheme calls is because these are involved in processing the button click. They have nothing to do with the real problem.

You are on the right track, but you will need to capture a stack trace at the moment the allocation fails (or you can use one of the other approaches I suggested above).

You will know you have the correct stack trace when the all the managed frames in your first stack trace appear in it and the top of the stack is a failing memory allocation. Note that we are really interested only in the unmanaged frames that appear above the DUCE+Channel.SyncFlush call -- everything below that will be NET Framework and your application code.

How to get a native stack trace at the right time

You want to get a stack trace at the time the first memory allocation failure within the DUCE+Channel.SyncFlush call shown. This may be tricky. There are three approaches I use: (note that in each case you start with a breakpoint inside the SyncFlush call - see note below for more details)

  1. Set the debugger to break on all exceptions (managed and unmanaged), then keep hitting go (F5, or "g") until it breaks on the memory allocation exception you are interested in. This is the first thing to try because it is quick, but it often fails when working with native code because the native code often returns an error code to the calling native code instead of throwing an exception.

  2. Set the debugger to break on all exceptions and also set breakpoints on common memory allocation routines, then hit F5 (go) repeatedly until the exception occurs, counting how many F5s you hit. Next time you run, use one fewer F5 and you may be on the allocation call that generated the exception. Capture the call stack to Notepad, then F10 (step over) repeatedly from there to see if it really was the allocation that failed.

  3. Set a breakpoint on the first native frame called by SyncFlush (this is wpfgfx_v0300!MilComposition_SyncFlush) to skip over the managed to native transition, then F5 to run to it. F10 (step over) through the function it until EAX contains one of the error codes E_OUTOFMEMORY (0x8007000E), ERROR_OUTOFMEMORY (0x0000000E), or ERROR_NOT_ENOUGH_MEMORY (0x0000008). Note the most recent "Call" instruction. The next time you run the program, run to there and step into it. Repeat this until you are down to the memory allocation call that caused the problem and dump the stack trace. Note that in many cases you will find yourself looping through a largish data structure, so some intelligence is required to set an appropriate breakpoint to skip over the loop so you can get where you need to be quickly. This technique is very reliable but very labor-intensive.

Note: In each case you don't want to set breakpoints or start single-stepping until your application is inside the failing DUCE+Channel.SyncFlush call. To ensure this, start the application with all breakpoints disabled. When it is running, enable a breakpoint on System.Windows.Media.Composition.DUCE+Channel.SyncFlush and resize the window. The first time around just hit F5 to make sure the exception fails on the first SyncFlush call (if not, count how many times you have to hit F5 before the exception occurs). Then disable the breakpoint and restart the program. Repeat the procedure but this time after you hit the SyncFlush call the right time, set your breakpoints or do you single-stepping as described above.

Recommendations

The debugging techniques I describe above are labor-intensive: Plan to spend several hours at least. Because of this, I generally try repeatedly simplifying my application to find out exactly what tickles the bug before jumping into the debugger for something like this. This has two advantages: It will give you a good repro to send the graphics card vendor, and it will make your debugging faster because there will be less displayed and therefore less code to single-step through, fewer allocations, etc.

Because the problem happens only with a specific graphics card, there is no doubt that the problem is either a bug in the graphics card driver or in the MilCore code that calls it. Most likely it is in the graphics card driver, but it is possible that MilCore is passing invalid values that are handled correctly by most graphics cards but not this one. The debugging techniques I describe above will tell you this is the case: For example, if MilCore is telling the graphics card to allocate a 1000000x1000000 pixel area and the graphics card is giving correct resolution information, the bug is in the MilCore. But if MilCore's requests are reasonable then the bug is in the graphics card driver.

Ray Burns
THANKs!Hi Ray, I dumped the managed and unmanaged stack:Can i say that the problem cames from "uxtheme!_ThemeDefWindowProc"?
whunmr
No, it is not related to UxTheme. The UxTheme code is simply handling your button click. The stack trace is the right kind of trace, but not taken at the correct time. I've added more explanation to my answer, and some tips for getting a good stack trace. Hope they help.
Ray Burns
+1  A: 

I'm not sure that the stack part (or at least the UXTheme stuff) is trustworthy. The bottom of the stack seems normal. And we see what appears to be an exception handler trying to do cleanup. Then lots of nested calls to various layers of heap management code.

But this part where the stack transitions from RtlFreeHeap to ConvertToUnicode doesn't make any sense. I suspect that everything above that is leftover from previous use of the stack.

0048f40c 6b88f208 mscorwks!_EH_epilog3_GS+0xa, calling mscorwks!__security_check_cookie 
0048f410 6b8a756e mscorwks!SString::ConvertToUnicode+0x81, calling mscorwks!_EH_epilog3_GS 
0048f424 77b4371e ntdll_77b10000!RtlpFreeHeap+0xbb1, calling ntdll_77b10000!RtlLeaveCriticalSection 
0048f42c 77b436fa ntdll_77b10000!RtlpFreeHeap+0xb7a, calling ntdll_77b10000!_SEH_epilog4 

A Crash in RtlFreeHeap points to heap corruption, which suggests that the problem is in unmanaged code, but the memory for manged objects must ultimately be allocated from unmanaged memory, so it could be either.

I suggest you look for places whre your unmanaged window can corrupt heap; multiple free's of the same allocation, or overwriting an allocation's boundaries.

John Knoeller
Thank you for your analysis. Now i can conclude UXTheme works fine.
whunmr
@John Knoeller: Yes the transition you identify is part of a stack overwrite. The rest of your analysis is plausible **except** for the fact that the original trace shows the OutOfMemoryException is occuring inside SyncFlush and in response to the button click. Since the RtlpFreeHeap call is completely outside of all managed code, including `Application.Run`, this call stack is clearly not showing the same exception as the previous one. Specifically, it is not showing the error that is causing the out of memory exception to be thrown on resize.
Ray Burns
Note that the stack overwrite in question is due to the thread being shut down. It is conceivable that the final exit included a GP fault from RtlpFreeHeap, but there is nothing visible in the stack trace to indicate that that is in fact the case. It is just as likely that there was no corruption at all and we are seeing a normal exception exit due to an unhandled OutOfMemoryException. Nevertheless, checking your unmanaged code for things that may corrupt the heap is always a good idea in these cases.
Ray Burns
@Ray: Good Point. I'm operating under the assumption that at some point a managed allocation request, especially a large one, can trigger unmanaged allocation. I'm also assuming that he isn't truly out of memory. In my experience when you _actually_ run out of memory, your process falls over so hard that it can't report 'out of memory' in a nice friendly dialog box.
John Knoeller