ansaurus

Question

C# Dictionary and Efficient Memory Usage

Answer 1

+4 A:

You could replacekeyExp by a StringBuilder. reallocating the string in a loop like that will keep allocating more memory as strings are immutable.

StringBuilder keyExp = new StringBuilder();
...
    keyExp.Append("|" + readerCSVExp[i - 1]) ;
...

are a lot of the strings the same? you could try interning them, then any identical strings will share the same memory rather than being copies...

rowExp[fieldCount] = String.Intern(rowNumExp.ToString()); 

// Dedup Expected               
string internedKey = (String.Intern(keyExp.ToString()));        
if (!(dictExp.ContainsKey(internedKey)))
{
   dictExp.Add(internedKey, rowExp);                        
}
else
{
   listDupExp.Add(rowExp);
}

I'm not certain exactly how the code works but...beyond that I'd say you don't need to keep rowExp in the dictionary, keep something else, like a number and write rowExp back out to disk in another file. This will probably save you the most memory as this seems to be an array of strings from the file so is probably big. If you write it to a file and keep the number in the file its at then you can get back to it again in the future if you then need to process. If you saved the offset in the file as the value in the dictionary you,d be able to find it again quickly. Maybe :).

Sam Holder 2010-01-29 20:21:17

Interesting, I was thinking that the compiler/interpreter/jitter/something interned strings automatically, but that's probably only for stings that are known to be identical at compile time I guess.

Davy8 2010-01-29 20:39:14

@Davy8, that is correct. String interning only happens by default on strings that are created from compile-time constants.

Eric Lippert 2010-01-29 21:15:53

Answer 2

+2 A:

Tell me if I get anything wrong.

The code above reads one CSV file and looks for duplicate keys. Each row goes into one of two sets, ones for duplicate keys, and one without.

What do you do with these rowsets?

Are they written to different files?

If so there's no reason to store the non-unqiue rows in a list, as you find them write them to a file.

When you do find duplicates, there's no need to store the entire row, just store the key, and write the row to file (obviously a different file if you want to keep them seperate).

If you need to do further processing on the different sets, then instead of storing the entire row, when not store the row number. Then when you do what ever it is you do with the rows, you have the row number necessarry to fetch the row again.

NB: rather than storing a row number, you can store the offset in the file of the start point of the row. Then you can access the file and read rows randomly, if you need.

Just comment this answer with any questions (or clarifications) you might have, I'll update the answer, I'll be here for another couple of hours anyway.

Edit
You can reduce the memory foot print further by not storing the keys, but storing the hashes of the keys. If you find a duplicate, seek to that position in the file, re-read the row and compare the actual keys.

Binary Worrier 2010-01-29 20:38:07

Please look at my reply in the edited post above.Sorry, did not know how to successfully paste the code sample in comments.

2010-01-30 05:59:55

Answer 3

+2 A:

If you haven't already get a profiler on this like DotTrace to see whic objects are using the memory, that'll give you a good idea of what needs optimising.

Some ideas from looking at the code:

Do you need to store the listDupExp? Seems to me with list you're effectively loading both files into memory so 2 x 150MB + some overhead could easily approach 500MB in task manager.

Secondly, can you start writing the output before you've read all the input? I'm presume this is tricky as it looks like you need all the output items sorted before you write them out, but may be something you could look at.

Paolo 2010-01-29 20:39:21

ansaurus

tags:

views:

answers:

C# Dictionary and Efficient Memory Usage

related questions