views:

508

answers:

3

I have a tool to compare 2 csv files and then bucket each cell into one of the 6 buckets. Basically, it reads in the csv files (using fast csv reader, credit: http://www.codeproject.com/KB/database/CsvReader.aspx) and then creates a dictionary pertaining to each file based on the keys provided by the user. I then iterate through th dictionaries comparing the values and writing a result csv file.

While it is blazing fast, it is very inefficient in terms of memory usage. I cannot compare more than 150 MB files on my box with 3 GB physical memory.

Here is a code snippet to read the expected file. At the end of this piece, the memory usage is close to 500 MB from the task manager.

// Read Expected
long rowNumExp;
System.IO.StreamReader readerStreamExp = new System.IO.StreamReader(@expFile);
SortedDictionary<string, string[]> dictExp = new SortedDictionary<string, string[]>();
List<string[]> listDupExp = new List<string[]>();
using (CsvReader readerCSVExp = new CsvReader(readerStreamExp, hasHeaders, 4096))
{
    readerCSVExp.SkipEmptyLines = false;
    readerCSVExp.DefaultParseErrorAction = ParseErrorAction.ThrowException;
    readerCSVExp.MissingFieldAction = MissingFieldAction.ParseError;
    fieldCountExp = readerCSVExp.FieldCount;                
    string keyExp;
    string[] rowExp = null;
    while (readerCSVExp.ReadNextRecord())
    {
        if (hasHeaders == true)
        {
            rowNumExp = readerCSVExp.CurrentRecordIndex + 2;
        }
        else
        {
            rowNumExp = readerCSVExp.CurrentRecordIndex + 1;
        }
        try
        {
            rowExp = new string[fieldCount + 1];                    
        }
        catch (Exception exExpOutOfMemory)
        {
            MessageBox.Show(exExpOutOfMemory.Message);
            Environment.Exit(1);
        }                
        keyExp = readerCSVExp[keyColumns[0] - 1];
        for (int i = 1; i < keyColumns.Length; i++)
        {
            keyExp = keyExp + "|" + readerCSVExp[i - 1];
        }
        try
        {
            readerCSVExp.CopyCurrentRecordTo(rowExp);
        }
        catch (Exception exExpCSVOutOfMemory)
        {
            MessageBox.Show(exExpCSVOutOfMemory.Message);
            Environment.Exit(1);
        }
        try
        {
            rowExp[fieldCount] = rowNumExp.ToString();
        }
        catch (Exception exExpRowNumOutOfMemory)
        {
            MessageBox.Show(exExpRowNumOutOfMemory.Message);
            Environment.Exit(1);
        }
        // Dedup Expected                        
        if (!(dictExp.ContainsKey(keyExp)))
        {
            dictExp.Add(keyExp, rowExp);                        
        }
        else
        {
            listDupExp.Add(rowExp);
        }                    
    }                
    logFile.WriteLine("Done Reading Expected File at " + DateTime.Now);
    Console.WriteLine("Done Reading Expected File at " + DateTime.Now + "\r\n");
    logFile.WriteLine("Done Creating Expected Dictionary at " + DateTime.Now);
    logFile.WriteLine("Done Identifying Expected Duplicates at " + DateTime.Now + "\r\n");                
}

Is there anything, I could do to make it more memory efficient? Anything I could do differently above, to consume less mermory?

Any ideas are welcome.

Thanks guys for all the feedback.

I have incorporated the changes as suggested to store the index of the row instead of the row itself in the dictionaries.

Here is the same code fragment with the new implementation.

// Read Expected
        long rowNumExp;
        SortedDictionary<string, long> dictExp = new SortedDictionary<string, long>();
        System.Text.StringBuilder keyExp = new System.Text.StringBuilder();
        while (readerCSVExp.ReadNextRecord())
        {
            if (hasHeaders == true)
            {
                rowNumExp = readerCSVExp.CurrentRecordIndex + 2;
            }
            else
            {
                rowNumExp = readerCSVExp.CurrentRecordIndex + 1;
            }
            for (int i = 0; i < keyColumns.Length - 1; i++)
            {
                keyExp.Append(readerCSVExp[keyColumns[i] - 1]);
                keyExp.Append("|");
            }
            keyExp.Append(readerCSVExp[keyColumns[keyColumns.Length - 1] - 1]);
            // Dedup Expected                       
            if (!(dictExp.ContainsKey(keyExp.ToString())))
            {
                dictExp.Add(keyExp.ToString(), rowNumExp);
            }
            else
            {
                // Process Expected Duplicates          
                string dupExp;
                for (int i = 0; i < fieldCount; i++)
                {
                    if (i >= fieldCountExp)
                    {
                        dupExp = null;
                    }
                    else
                    {
                        dupExp = readerCSVExp[i];
                    }
                    foreach (int keyColumn in keyColumns)
                    {
                        if (i == keyColumn - 1)
                        {
                            resultCell = "duplicateEXP: '" + dupExp + "'";
                            resultCell = CreateCSVField(resultCell);
                            resultsFile.Write(resultCell);
                            comSumCol = comSumCol + 1;
                            countDuplicateExp = countDuplicateExp + 1;
                        }
                        else
                        {
                            if (checkPTColumns(i + 1, passthroughColumns) == false)
                            {
                                resultCell = "'" + dupExp + "'";
                                resultCell = CreateCSVField(resultCell);
                                resultsFile.Write(resultCell);
                                countDuplicateExp = countDuplicateExp + 1;
                            }
                            else
                            {
                                resultCell = "PASSTHROUGH duplicateEXP: '" + dupExp + "'";
                                resultCell = CreateCSVField(resultCell);
                                resultsFile.Write(resultCell);
                            }
                            comSumCol = comSumCol + 1;
                        }
                    }
                    if (comSumCol <= fieldCount)
                    {
                        resultsFile.Write(csComma);
                    }
                }
                if (comSumCol == fieldCount + 1)
                {
                    resultsFile.Write(csComma + rowNumExp);
                    comSumCol = comSumCol + 1;
                }
                if (comSumCol == fieldCount + 2)
                {
                    resultsFile.Write(csComma);
                    comSumCol = comSumCol + 1;
                }
                if (comSumCol > fieldCount + 2)
                {
                    comSumRow = comSumRow + 1;
                    resultsFile.Write(csCrLf);
                    comSumCol = 1;
                }
            }
            keyExp.Clear();
        }
        logFile.WriteLine("Done Reading Expected File at " + DateTime.Now + "\r\n");
        Console.WriteLine("Done Reading Expected File at " + DateTime.Now + "\r\n");
        logFile.WriteLine("Done Analyzing Expected Duplicates at " + DateTime.Now + "\r\n");
        Console.WriteLine("Done Analyzing Expected Duplicates at " + DateTime.Now + "\r\n");
        logFile.Flush();

However, the problem is I need both the data sets in memory. I actually iterate through both the dictionaries looking for matches, mismatches, duplicates and dropouts based on the key.

Using the this approach of storing the row index, I am still using a lot of memory because for dynamic access I have to now use cached version of the csv reader. So although the dictionary is much smaller now, the caching of data makes up for the savings and I still ended up with about similar memory usage.

Hope, I am making sense...:)

One option is to get rid of the dictionary entirely and just loop through the 2 files, but not sure if the performance would be as fast as comparing 2 dictionaries.

Any inputs are much appreciated.

+4  A: 

You could replacekeyExp by a StringBuilder. reallocating the string in a loop like that will keep allocating more memory as strings are immutable.

StringBuilder keyExp = new StringBuilder();
...
    keyExp.Append("|" + readerCSVExp[i - 1]) ;
... 

are a lot of the strings the same? you could try interning them, then any identical strings will share the same memory rather than being copies...

rowExp[fieldCount] = String.Intern(rowNumExp.ToString()); 

// Dedup Expected               
string internedKey = (String.Intern(keyExp.ToString()));        
if (!(dictExp.ContainsKey(internedKey)))
{
   dictExp.Add(internedKey, rowExp);                        
}
else
{
   listDupExp.Add(rowExp);
}  

I'm not certain exactly how the code works but...beyond that I'd say you don't need to keep rowExp in the dictionary, keep something else, like a number and write rowExp back out to disk in another file. This will probably save you the most memory as this seems to be an array of strings from the file so is probably big. If you write it to a file and keep the number in the file its at then you can get back to it again in the future if you then need to process. If you saved the offset in the file as the value in the dictionary you,d be able to find it again quickly. Maybe :).

Sam Holder
Interesting, I was thinking that the compiler/interpreter/jitter/something interned strings automatically, but that's probably only for stings that are known to be identical at compile time I guess.
Davy8
@Davy8, that is correct. String interning only happens by default on strings that are created from compile-time constants.
Eric Lippert
+2  A: 

Tell me if I get anything wrong.

The code above reads one CSV file and looks for duplicate keys. Each row goes into one of two sets, ones for duplicate keys, and one without.

What do you do with these rowsets?

Are they written to different files?

If so there's no reason to store the non-unqiue rows in a list, as you find them write them to a file.

When you do find duplicates, there's no need to store the entire row, just store the key, and write the row to file (obviously a different file if you want to keep them seperate).

If you need to do further processing on the different sets, then instead of storing the entire row, when not store the row number. Then when you do what ever it is you do with the rows, you have the row number necessarry to fetch the row again.

NB: rather than storing a row number, you can store the offset in the file of the start point of the row. Then you can access the file and read rows randomly, if you need.

Just comment this answer with any questions (or clarifications) you might have, I'll update the answer, I'll be here for another couple of hours anyway.

Edit
You can reduce the memory foot print further by not storing the keys, but storing the hashes of the keys. If you find a duplicate, seek to that position in the file, re-read the row and compare the actual keys.

Binary Worrier
Please look at my reply in the edited post above.Sorry, did not know how to successfully paste the code sample in comments.
+2  A: 

If you haven't already get a profiler on this like DotTrace to see whic objects are using the memory, that'll give you a good idea of what needs optimising.

Some ideas from looking at the code:

Do you need to store the listDupExp? Seems to me with list you're effectively loading both files into memory so 2 x 150MB + some overhead could easily approach 500MB in task manager.

Secondly, can you start writing the output before you've read all the input? I'm presume this is tricky as it looks like you need all the output items sorted before you write them out, but may be something you could look at.

Paolo