views:

437

answers:

8

Edit2: I just want to make sure my question is clear: Why, on each iteration of AppendToLog(), the application uses 15mb more? (the size of the original log file)

I've got a function called AppendToLog() which receives the file path of an HTML document, does some parsing and appends it to a file. It gets called this way:

this.user_email = uemail;
string wanted_user = wemail;

string[] logPaths;
logPaths = this.getLogPaths(wanted_user);

foreach (string path in logPaths)
{              

    this.AppendToLog(path);                

}

On every iteration, the RAM usage increases by 15mb or so. This is the function: (looks long but it's simple)

public void AppendToLog(string path)
{

Encoding enc = Encoding.GetEncoding("ISO-8859-2");
StringBuilder fb = new StringBuilder();
FileStream sourcef;
string[] messages;

try
{
    sourcef = new FileStream(path, FileMode.Open);
}
catch (IOException)
{
    throw new IOException("The chat log is in use by another process."); ;
}
using (StreamReader sreader = new StreamReader(sourcef, enc))
{

    string file_buffer;
    while ((file_buffer = sreader.ReadLine()) != null)
    {
        fb.Append(file_buffer);
    }                
}

//Array of each line's content
messages = parseMessages(fb.ToString());

fb = null;

string destFileName = String.Format("{0}_log.txt",System.IO.Path.GetFileNameWithoutExtension(path));
FileStream destf = new FileStream(destFileName, FileMode.Append);
using (StreamWriter swriter = new StreamWriter(destf, enc))
{
    foreach (string message in messages)
    {
        if (message != null)
        {
            swriter.WriteLine(message);
        }
    }
}

messages = null;

sourcef.Dispose();
destf.Dispose();


sourcef = null;
destf = null;
}

I've been days with this and I don't know what to do :(

Edit: This is ParseMessages, a function that uses HtmlAgilityPack to strip parts of an HTML log.

public string[] parseMessages(string what)
{
StringBuilder sb = new StringBuilder();
HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(what);            

HtmlNodeCollection messageGroups = doc.DocumentNode.SelectNodes("//body/div[@class='mplsession']");
int messageCount = doc.DocumentNode.SelectNodes("//tbody/tr").Count;

doc = null;

string[] buffer = new string[messageCount];

int i = 0;

foreach (HtmlNode sessiongroup in messageGroups)
{
    HtmlNode tablegroup = sessiongroup.SelectSingleNode("table/tbody");

    string sessiontime = sessiongroup.Attributes["id"].Value;

    HtmlNodeCollection messages = tablegroup.SelectNodes("tr");
    if (messages != null)
    {
        foreach (HtmlNode htmlNode in messages)
        {
            sb.Append(
                    ParseMessageDate(
                        sessiontime,
                        htmlNode.ChildNodes[0].ChildNodes[0].InnerText
                    )
                ); //Date
            sb.Append(" ");

            try
            {
                foreach (HtmlTextNode node in htmlNode.ChildNodes[0].SelectNodes("text()"))
                {
                    sb.Append(node.Text.Trim()); //Name
                }
            }
            catch (NullReferenceException)
            {
                /*
                 * We ignore this exception, it just means there's extra text
                 * and that means that it's not a normal message
                 * but a system message instead
                 * (i.e. "John logged off")
                 * Therefore we add the "::" mark for future organizing
                 */
                sb.Append("::");
            }
            sb.Append(" ");

            string message = htmlNode.ChildNodes[1].InnerHtml;
            message = message.Replace(""", "'");
            message = message.Replace(" ", " ");
            message = RemoveMedia(message);
            sb.Append(message); //Message
            buffer[i] = sb.ToString();
            sb = new StringBuilder();
            i++;
        }
    }
}
messageGroups = null;
what = null;
return buffer;
}
+1  A: 

One thing you may want to try, is temporarily forcing a GC.Collect after each run. The GC is very intelligent, and will not reclaim memory until is feels the expense of a collection is worth the value of any recovered memory.

Edit: I just wanted to add that its important to understand that calling GC.Collect manually is a bad practice (for any normal use case. Abnormal == perhaps a load function for a game or somesuch). You should let the garbage collector decide whats best, as it will generally have more information than avaliable to you about system resources and the like on which to base its collection behaviour.

Gregory
don't forget to remove it after!, don't keep the collect there, bad idea
Fredou
haha, I was just writing that in, thanks :)
Gregory
A: 

I would manually clear the array of message and the stringbuilder before the setting them to null.

edit

looking at what the process seem to do I got a suggestion, if it's not too late instead of parsing an html file.

create a dataset schemas and use that to write and read an xml log file and use a xsl file to convert it into an html file.

Fredou
Could you elaborate on that last point, please? I don't want to create another HTML file, the whole purpose of my application is to create a stripped down version of the bulky HTML logs :P
Daniel S
A: 

The try-catch block could use a finally (cleanup). If you look at what the using statement does, it is equivalent to try catch finally. Yes, running GC is a good idea also. Without compiling this code and giving it a try it is hard to say for sure ...

Also, dispose this guy properly using a using:

FileStream destf = new FileStream(destFileName, FileMode.Append);

Look up Effective C# 2nd edition

Hamish Grubijan
+2  A: 

I would look carefully at why you need to pass a string to parseMessages, ie fb.ToString().

Your code comment says that this returns an array of each lines content. However you are actually reading all lines from the log file into fb and then converting to a string.

If you are parsing large files in parseMessages() you could do this much more efficiently by passing the StringBuilder itself or the StreamReader into parseMessages(). This would enable only loading a portion of the file into memory at any time, as opposed to using ToString() which currently forces the entire logfile into memory.

You are less likely to have a true memory leak in a .NET application thanks to garbage collection. You do not look to be using any large resources such as files, so it seems even less likely that you have an actual memory leak.

It looks like you have disposed of resources ok, however the GC is probably struggling to allocate and then deallocate the large memory chunks in time before the next iteration starts, and so you see the increasing memory usage.

While GC.Collect() may allow you to force memory deallocation, I would strongly advise looking into the suggestions above before resorting to trying to manually manage memory via GC.

[Update] Seeing your parseMessages() and the use of HtmlAgilityPack (a very useful library, by the way) it looks likely there are some large and possibly numerous allocations of memory being performed for every logile.

HtmlAgility allocates memory for various nodes internally, when combined with your buffer array and the allocations in the main function I'm even more confident that the GC is being put under a lot of pressure to keep up.

To stop guessing and get some real metrics, I would run ProcessExplorer and add the columns to show the GC Gen 0,1,2 collections columns. Then run your application and observe the number of collections. If you're seeing large numbers in these columns then the GC is struggling and you should redesign to use less memory allocations.

Alternatively, the free CLR Profiler 2.0 from Microsoft provides nice visual representation of .NET memory allocations within your application.

Ash
"However you are actually reading all lines from the log file into fb and then converting to a string."Yes, because then parseMessages() uses HtmlAgilityPack to scrap the file.
Daniel S
@Daniel, HtmlAgilityPack can also read from a Stream such as StreamReader etc (pass it to the Load() method). Using a Stream allows you to avoid loading the whole string/file into memory.
Ash
A: 

I don't see any obvious memory leaks; my first guess would be that it's something in the library.

A good tool to figure this kind of thing out is the .NET Memory Profiler, by SciTech. They have a free two-week trial.

Short of that, you could try commenting out some of the library functions, and see if the problem goes away if you just read the files and do nothing with the data.

Also, where are you looking for memory use stats? Keep in mind that the stats reported by Task Manager aren't always very useful or reflective of actual memory use.

RickNZ
+3  A: 

As many have mentioned, this is probably just an artifact of the GC not cleaning up the memory storage as fast as you are expecting it to. This is normal for managed languages, like C#, Java, etc. You really need to find out if the memory allocated to your program is free or not if you're are interested in that usage. The questions to ask related to this are:

  1. How long is your program running? Is it a service type program that runs continuously?
  2. Over the span of execution does it continue to allocate memory from the OS or does it reach a steady-state? (Have you run it long enough to find out?)

Your code does not look like it will have a "memory-leak". In managed languages you really don't get memory leaks like you would in C/C++ (unless you are using unsafe or external libraries that are C/C++). What happens though is that you do need to watch out for references that stay around or are hidden (like a Collection class that has been told to remove an item but does not set the element of the internal array to null). Generally, objects with references on the stack (locals and parameters) cannot 'leak' unless you store the reference of the object(s) into an object/class variables.

Some comments on your code:

  1. You can reduce the allocation/deallocation of memory by pre-allocating the StringBuilder to at least the proper size. Since you know you will need to hold the entire file in memory, allocate it to the file size (this will actually give you a buffer that is just a little bigger than required since you are not storing new-line character sequences but the file probably has them):

    FileInfo fi = new FileInfo(path);
    StringBuilder fb = new StringBuilder((int) fi.Length);
    

    You may want to ensure the file exists before getting its length, using fi to check for that. Note that I just down-cast the length to an int without error checking as your files are less than 2GB based on your question text. If that is not the case then you should verify the length before casting it, perhaps throwing an exception if the file is too big.

  2. I would recommend removing all the variable = null statements in your code. These are not necessary since these are stack allocated variables. As well, in this context, it will not help the GC since the method will not live for a long time. So, by having them you create additional clutter in the code and it is more difficult to understand.

  3. In your ParseMessages method, you catch a NullReferenceException and assume that is just a non-text node. This could lead to confusing problems in the future. Since this is something you expect to normally happen as a result of something that may exist in the data you should check for the condition in the code, such as:

    if (node.Text != null)
        sb.Append(node.Text.Trim()); //Name
    

    Exceptions are for exceptional/unexpected conditions in the code. Assigning significant meaning to NullReferenceException more than that there was a null reference can (likely will) hide errors in other parts of that same try block now or with future changes.

Kevin Brock
Looks like you were right, there's no memory leak. And thank you for the comments on my code, I'm still grasping C#.
Daniel S
+2  A: 

There is no memory leak. If you are using Windows Task Manager to measure the memory used by your .NET application you are not getting a clear picture of what is going on, because the GC manages memory in a complex way that Task Manager doesn't reflect.

A MS engineer wrote a great article about why .NET applications that seem to be leaking memory probably aren't, and it has links to very in depth explanations of how the GC actually works. Every .NET programmer should read them.

RossFabricant
I would mark this as accepted as well but I can't choose 2 answers. Thank you!
Daniel S
A: 

HtmlDocument class (as far as I can determin) has a serious memory leak when used from managed code. I reccomend using the XMLDOM parser instead (though this does require well formed documents, but thats another +).

KVK