views:

615

answers:

7

Hi,

I am developing a C# application which needs to process approximately 4,000,000 english sentences. All these sentences are being stored in a tree. Where each node in the tree is a class which has these fields:

class TreeNode
{
    protected string word;
    protected Dictionary<string, TreeNode> children;
}

My problem is that the application is using up all the RAM (I have 2 GB RAM) when it reaches the 2,000,000th sentence. So it only manages to process half the sentences and then it slows down drastically.

What can I do to try and reduce the memory footprint of the application?

EDIT: Let me explain a bit more my application. So I have approximately 300,000 english sentences, and from each sentence I am generating further sub sentences like this:

Example: Sentence: Football is a very popular sport Sub Sentences I need:

  1. Football is a very popular sport
  2. is a very popular sport
  3. a very popular sport
  4. very popular sport
  5. popular sport
  6. sport

Each sentence is stored in a tree word by word. So considering the example above, i have a TreeNode Class with the word field = "Football", and the children list has the TreeNode for the word "is". The child of the "is" node is the "a" node. The child for the "a" node is the "very" node. I need to store the sentences word by word since i need to be able to search for all the sentences starting with Example: "Football is".

So basically for each word in a sentence i am creating a new (sub-sentence). And this is the reason I ultimately end up with 4,000,000 different sentences. Storing the data in a database is not an option, since the app needs to work on the whole structure at once. And it will further slow down the process if i had to stay writing all the data to a database.

Thanks

+4  A: 

The Dictionary type itself can consume a lot of memory. Have you considered using a List<KeyValuePair<string, TreeNode>> instead? The generic List uses a lot less memory per instance than a generic Dictionary.

Of course, the limitation of using a List instead of a Dictionary is that you don't get automatic indexing by strings. This would be a clear trade off between time and space. If the lists are short, it might even be faster than the dictionary (a linear search of ~10 keys is often going to be faster than a hashtable search). Even if at least most of the lists are short, it could still be a large improvement (e.g. if 95% of the lists have 10 or fewer items, and the other 5% have a max of maybe 100 items).

You could even use Collection<KeyValuePair<string, TreeNode>>, which uses even less memory than List<T>.

Eilon
So ... there is a HybridDictionary for that purpose. It starts out a a list and then turns into a dict.
Hamish Grubijan
Yes, there is HybridDictionary, but even it has some additional costs associated with it. HybridDictionary starts out using about 32 bytes of memory, Dictionary<K,V> about 44, List<T> about 16, and Collection<T> about 8. (This is excluding CLR overhead and assumes 32bit.)
Eilon
I'll try out the HybridDictionary first, since if possible, i would like to keep the string indexing.
Spi1988
+9  A: 

What is it you are using as the key? Where are you getting the data from? If these are words (not full setences), I'm wondering if you have a lot of duplicated keys (different string instances with the same fundamental value), in which case you might benefit from implementing a local interner to re-use the values (and let the transient copies get garbage collected).

public sealed class StringCache {
    private readonly Dictionary<string,string> values
        = new Dictionary<string,string>(StringComparer.Ordinal);
    public string this[string value] {
        get {
            string cached;
            if (!values.TryGetValue(value, out cached)) {
                values.Add(value, value);
                cached = value;
            }
            return cached;
        }
    }
}

Instantiate this when building the tree, and use (when you think a value is likely to be duplicated):

StringCache cache = new StringCache(); // re-use this instance while building
                                       // your tree
...
string s = ... // whatever (from reading your input)
s = cache[s];
Marc Gravell
No question about it, this will reduce the memory requirements. There are far fewer than 4 million words - it's closer to 100k. Interning them will make a huge difference.
LBushkin
@Marc: any reason you would not use string.Intern()?
Mitch Wheat
@Marc: I think I just discovered why!: memory might not be released until the CLR terminates...
Mitch Wheat
Thanks I think this help me a lot.
Spi1988
Classic Flyweight pattern.
Cameron MacFarland
@Mitch - indeed; the intent here was to allow the data to be shared for a controlled time / batch, not for the entire process.
Marc Gravell
@Cameron - well, I *guess* you could categorise a reference as a flyweight, but ultimately it is object re-use.
Marc Gravell
I implemented your solution, and the memory usage improved a lot. Before using your approach the system was filling up all my 2GB RAM when it reaches 1/3 of the whole process. Now it only uses up approx. 200MB of RAM for the whole process. The gain was enormous, because I have a lot of duplicate strings. Thanks for your help guys
Spi1988
(replied on question)
Marc Gravell
+2  A: 

Could you map each word to an int? That way you have one map of int to string that contains unique english words and a tree structure that contains sentences like so:

class TreeNode
{
    protected int word;
    protected Dictionary<int, TreeNode> children;
}

Dictionary<string, int> _AllWords;

Now the _AllWords collection is not optimal for looking up words based on a key as is. What you probably want here is something like a multi-key list where you can do fast lookup based on both key and value. CodeProject has an atricle about that.

Igor Zevaka
Note that on x86, this is actually the same as the "interner" suggestion I give, but without the requirement to do an extra lookup between the int key and a string value. Instead, each int **is** the reference itself.
Marc Gravell
+2  A: 

If your requirement is for performance and you feel as though you need all words in memory then I'd suggest you use a string array to contain all words. Then store all indexes into a sorted binary tree.

Trey Sargent
+1  A: 

It might be overkill for your situation but you could store your nodes in files on disk and use a B-Tree implementation to maximize IO performance. This is what most databases use internally because there is simply too much data to store in memory.

Ash
A: 

The only way you can significantly reduce memory usage is by not keeping the sentences in memory.

What are you trying to accomplish? Why are you building a tree? If you are counting something, count and discard the strings as your read them in. If you are building a graph (ie. to analysis relationships between sentence and/or words), try enumerating the sentences and words to they can be unique/key by that id. Use that id in memory instead.

I hope this helps.

tgiphil
I'm happy to report that there *were* ways of significantly reducing memory usage.
Marc Gravell
+1  A: 

Some points to think about.

  1. When you initialize your Dictionary<,>, pass in the max number of items you need. This will make it allocate enough buckets at the startup. Default is to initialize with 0 buckets, which evaluates to 3(prime). Once you add more items, the dictionary must reinitialize and copy all items to a new larger storage. If you program never idles, then the GC will not collect the old Dictionaries.
  2. You could save space by encoding your strings. Strings will use two bytes per character in memory. With some helper functions you could have your class like this:
    class TreeNode
    {
        protected byte[] word;
        protected Dictionary<byte[], TreeNode> children;

        public string Word
        {
            get { return Encoding.UTF8.GetString(word); }
            set { word = Encoding.UTF8.GetBytes(value); }
        }

        public TreeNode GetChildByKey( string key )
        {
            TreeNode node;
            if(children.TryGetValue( Encoding.UTF8.GetBytes(key), out node  ))
            {
                return node;
            }
            return null;
        }
    }

[Edit] And I forgot that you also need a new comparer for the byte[] key.

var children = new Dictonary<string,TreeNode>(new ByteArrayComparer);

public class ByteArrayComparer : IEqualityComparer<byte[]>
{
    public bool Equals(byte[] x, byte[] y)
    {
        if (x.Length != y.Length)
            return false;

        for (int i = 0; i < x.Length; i++)
        {
            if (x[i] != y[i])
                return false;
        }

        return true;
    }

    public int GetHashCode(byte[] a)
    {
        return a[0] | (int)a[1] << 8 | (int)a[2] << 16 | (int)a[3] << 24;
    }
}
Mikael Svenson
Just for completeness - encoding may have a place here since the question relates to "English sentences", but for some cultures this could actually double the memory used by the strings.
Marc Gravell
That's a nice observation which I actually didn't consider. I'm so used to work in the western character set. Before doing encoding, test and see if it helps. Using a variable byte structure might also help, specially if the strings are longer. But before going the compression way, you should reconsider the whole problem at hand.
Mikael Svenson