views:

445

answers:

1

Hey, I am attempting to use the Microsoft.MSHTML (Version 7.0.3300.0) library to extract the body text from an HTML string. I've abstracted this functionality into a single helper method GetBody(string).

When called in an infinite loop, the process eventually runs out of memory (confirmed by eyeballing Mem Usage in Task Manager). I suspect the problem is due to my incorrect cleanup of the MSHTML objects. What am I doing wrong?

My current definition of GetBody(string) is:

public static string GetBody(string html)
{
    mshtml.IHTMLDocument2 htmlDoc = null;
    mshtml.IHTMLElement bodyElement = null;
    string body;

    try
    {
        htmlDoc = new mshtml.HTMLDocumentClass();
        htmlDoc.write(html);
        bodyElement = htmlDoc.body;
        body = bodyElement.innerText;
    }
    catch (Exception ex)
    {
        Trace.TraceError("Failed to use MSHTML to parse HTML body: " + ex.Message);
        body = email.Body;
    }
    finally
    {
        if (bodyElement != null)
            Marshal.ReleaseComObject(bodyElement);
        if (htmlDoc != null)
            Marshal.ReleaseComObject(htmlDoc);
    }

    return body;
}

Edit: the memory leak has been traced to the code used in populating a value for html. In this case it was Outlook Redemption.

+2  A: 

It has been a long time since I have used mshtml, but doesn't the IHTMLElement2 interface have a close method? Have you tried calling it?

How long did the loop run before the leak was obvious?

I will see if I can dig through some of the legacy code I have here that uses mshtml and see how the developers released the objects.

EDIT:

The old code we have here calls close on the HTMLDocument2 then release com object as you have it.

One thing to note though is that the ReleaseComObject method is called in a loop until it returns zero. This will ensure all com wrapers and the original object are released, there is a note about it here.

Glenn Condron
Thanks! I can see a close() method on IHTMLDocument2, I'll add in a call to it and see how it goes. The leak itself is quite consistent, memory usage grows steadily once the program begins executing.
NoizWaves
I would suspect this is the case. Since IHTMLDocument2 is a COM object, it is most likely not handled explicitly by the .NET garbage collection. Since, in your loop, you create a new instance of an object that implements this interface without cleaning up the previous instance, you would get a memory leak.
Matt Jordan
@Caelum - the call to close() has slowed the speed of the leak slightly, but it is still present.@Matt - This is what I'm thinking as well, do you know how I should be cleaning up the previous instances?
NoizWaves
Are you calling ReleaseComObject in a loop now? So loop until the call to ReleaseComObject returns 0.
Glenn Condron
Hey Caelum, I added the calls to ReleaseComObject in a loop, but the memory leak is still present. I also tried adding a forced call to System.Windows.Forms.Application.DoEvents() before GetBody returns, but there was no improvement. My loop is being run inside a Console Application.
NoizWaves
Interesting, the code I have here does no more then what you are doing and doesn't show symptoms of leaking like your description. The only other difference I can see is that the code I have uses the HTMLDocument object rather than the HTMLDocumentClass. But I have no idea of the difference, if any. I have just not used mshtml enough to really say. Sorry about that.
Glenn Condron
Thanks for your help Caelum, after reviewing my code, I've traced the source of the memory leak to code external to MSHTML... In this case it is the Outlook Redemption library... I've rerun my example and confirmed your findings, the MSHTML usage does not cause a leak.
NoizWaves
Good to hear :)
Glenn Condron