ansaurus

Question

Reducing the memory footprint of a C# application

Answer 1

+4 A:

The Dictionary type itself can consume a lot of memory. Have you considered using a List<KeyValuePair<string, TreeNode>> instead? The generic List uses a lot less memory per instance than a generic Dictionary.

Of course, the limitation of using a List instead of a Dictionary is that you don't get automatic indexing by strings. This would be a clear trade off between time and space. If the lists are short, it might even be faster than the dictionary (a linear search of ~10 keys is often going to be faster than a hashtable search). Even if at least most of the lists are short, it could still be a large improvement (e.g. if 95% of the lists have 10 or fewer items, and the other 5% have a max of maybe 100 items).

You could even use Collection<KeyValuePair<string, TreeNode>>, which uses even less memory than List<T>.

Eilon 2010-01-02 00:59:56

So ... there is a HybridDictionary for that purpose. It starts out a a list and then turns into a dict.

Hamish Grubijan 2010-01-02 01:09:03

Yes, there is HybridDictionary, but even it has some additional costs associated with it. HybridDictionary starts out using about 32 bytes of memory, Dictionary<K,V> about 44, List<T> about 16, and Collection<T> about 8. (This is excluding CLR overhead and assumes 32bit.)

Eilon 2010-01-02 01:11:33

I'll try out the HybridDictionary first, since if possible, i would like to keep the string indexing.

Spi1988 2010-01-02 10:26:04

Answer 2

+9 A:

What is it you are using as the key? Where are you getting the data from? If these are words (not full setences), I'm wondering if you have a lot of duplicated keys (different string instances with the same fundamental value), in which case you might benefit from implementing a local interner to re-use the values (and let the transient copies get garbage collected).

public sealed class StringCache {
    private readonly Dictionary<string,string> values
        = new Dictionary<string,string>(StringComparer.Ordinal);
    public string this[string value] {
        get {
            string cached;
            if (!values.TryGetValue(value, out cached)) {
                values.Add(value, value);
                cached = value;
            }
            return cached;
        }
    }
}

Instantiate this when building the tree, and use (when you think a value is likely to be duplicated):

StringCache cache = new StringCache(); // re-use this instance while building
                                       // your tree
...
string s = ... // whatever (from reading your input)
s = cache[s];

Marc Gravell 2010-01-02 01:10:15

No question about it, this will reduce the memory requirements. There are far fewer than 4 million words - it's closer to 100k. Interning them will make a huge difference.

LBushkin 2010-01-02 01:48:50

@Marc: any reason you would not use string.Intern()?

Mitch Wheat 2010-01-02 10:19:14

@Marc: I think I just discovered why!: memory might not be released until the CLR terminates...

Mitch Wheat 2010-01-02 10:22:07

Thanks I think this help me a lot.

Spi1988 2010-01-02 10:22:32

Classic Flyweight pattern.

Cameron MacFarland 2010-01-02 10:23:04

@Mitch - indeed; the intent here was to allow the data to be shared for a controlled time / batch, not for the entire process.

Marc Gravell 2010-01-02 10:46:09

@Cameron - well, I *guess* you could categorise a reference as a flyweight, but ultimately it is object re-use.

Marc Gravell 2010-01-02 10:47:26

I implemented your solution, and the memory usage improved a lot. Before using your approach the system was filling up all my 2GB RAM when it reaches 1/3 of the whole process. Now it only uses up approx. 200MB of RAM for the whole process. The gain was enormous, because I have a lot of duplicate strings. Thanks for your help guys

Spi1988 2010-01-02 18:43:50

(replied on question)

Marc Gravell 2010-01-02 19:47:25

Answer 3

+2 A:

Could you map each word to an int? That way you have one map of int to string that contains unique english words and a tree structure that contains sentences like so:

class TreeNode
{
    protected int word;
    protected Dictionary<int, TreeNode> children;
}

Dictionary<string, int> _AllWords;

Now the _AllWords collection is not optimal for looking up words based on a key as is. What you probably want here is something like a multi-key list where you can do fast lookup based on both key and value. CodeProject has an atricle about that.

Igor Zevaka 2010-01-02 02:27:42

Note that on x86, this is actually the same as the "interner" suggestion I give, but without the requirement to do an extra lookup between the int key and a string value. Instead, each int **is** the reference itself.

Marc Gravell 2010-01-02 07:47:24

Answer 4

+2 A:

If your requirement is for performance and you feel as though you need all words in memory then I'd suggest you use a string array to contain all words. Then store all indexes into a sorted binary tree.

Trey Sargent 2010-01-02 02:37:27

Answer 5

+1 A:

It might be overkill for your situation but you could store your nodes in files on disk and use a B-Tree implementation to maximize IO performance. This is what most databases use internally because there is simply too much data to store in memory.

Ash 2010-01-02 03:59:22

Answer 6

A:

The only way you can significantly reduce memory usage is by not keeping the sentences in memory.

What are you trying to accomplish? Why are you building a tree? If you are counting something, count and discard the strings as your read them in. If you are building a graph (ie. to analysis relationships between sentence and/or words), try enumerating the sentences and words to they can be unique/key by that id. Use that id in memory instead.

I hope this helps.

tgiphil 2010-01-02 08:58:30

I'm happy to report that there *were* ways of significantly reducing memory usage.

Marc Gravell 2010-01-02 19:56:16

Answer 7

+1 A:

Some points to think about.

When you initialize your Dictionary<,>, pass in the max number of items you need. This will make it allocate enough buckets at the startup. Default is to initialize with 0 buckets, which evaluates to 3(prime). Once you add more items, the dictionary must reinitialize and copy all items to a new larger storage. If you program never idles, then the GC will not collect the old Dictionaries.
You could save space by encoding your strings. Strings will use two bytes per character in memory. With some helper functions you could have your class like this:

    class TreeNode
    {
        protected byte[] word;
        protected Dictionary<byte[], TreeNode> children;

        public string Word
        {
            get { return Encoding.UTF8.GetString(word); }
            set { word = Encoding.UTF8.GetBytes(value); }
        }

        public TreeNode GetChildByKey( string key )
        {
            TreeNode node;
            if(children.TryGetValue( Encoding.UTF8.GetBytes(key), out node  ))
            {
                return node;
            }
            return null;
        }
    }

[Edit] And I forgot that you also need a new comparer for the byte[] key.

var children = new Dictonary<string,TreeNode>(new ByteArrayComparer);

public class ByteArrayComparer : IEqualityComparer<byte[]>
{
    public bool Equals(byte[] x, byte[] y)
    {
        if (x.Length != y.Length)
            return false;

        for (int i = 0; i < x.Length; i++)
        {
            if (x[i] != y[i])
                return false;
        }

        return true;
    }

    public int GetHashCode(byte[] a)
    {
        return a[0] | (int)a[1] << 8 | (int)a[2] << 16 | (int)a[3] << 24;
    }
}

Mikael Svenson 2010-01-02 09:51:25

Just for completeness - encoding may have a place here since the question relates to "English sentences", but for some cultures this could actually double the memory used by the strings.

Marc Gravell 2010-01-02 10:16:48

That's a nice observation which I actually didn't consider. I'm so used to work in the western character set. Before doing encoding, test and see if it helps. Using a variable byte structure might also help, specially if the strings are longer. But before going the compression way, you should reconsider the whole problem at hand.

Mikael Svenson 2010-01-02 11:34:52

ansaurus

tags:

views:

answers:

Reducing the memory footprint of a C# application

related questions