Data structure for recognizing repeat values.

views:

answers:

+5 Q:

Data structure for recognizing repeat values.

I am running a fairly large search, and am getting a System.OutOfMemoryException.

The problem is I am storing a string key for each state I have previously visited as a HashSet<sting>. Once this gets to around 7 million elements, it crashes. My thought is that I don't need to be able to retrieve the strings, only recognize if it exists in the set.

I seem to remember a specialized data structure for this kind of thing, but I can't remember the name of it for the life of me. If I recall correctly it had fairly constant memory requirements and you add elements to it, and it can tell you with some degree of certainty whether you have already added some value to it. Am I making this up, or does this exist. Any tips?

Are you talking about the Dictionary class?

http://msdn.microsoft.com/en-us/library/xfhwa508.aspx

An excerpt from MSDN:

Every key in a Dictionary must be unique according to the dictionary's equality comparer. A key cannot be null, but a value can be, if the value type TValue is a reference type.

You can use the ContainsKey method to check to see if an entry has already been inserted before inserting a new record.

Abe Miessler 2010-08-11 20:58:55

Yes, that is true, but the storage required by a dictionary is linear to the number of elements. I need something that uses less memory.

CaptnCraig 2010-08-11 21:01:43

+4 A:

You're probably thinking of a Bloom filter. It gives you a probabilistic result when you check if a string is in the set. If it is, you'll always find it. If it isn't, you still might detect that it is, depending on what else in in your set. Its memory requirements do change based on the number of unique elements you add, but it's far below what an HashSet would take up.

Karmastan 2010-08-11 21:02:03

Yes! That is it! I would accept this, but I really do like nos' idea to use a trie.

CaptnCraig 2010-08-11 21:08:28

+2 A:

Bloom Filter?

bowenl2 2010-08-11 21:02:17

+2 A:

There's no standard collection in .NET for this, but you can store alot of strings in a Trie ,using a lot less space than e.g. a hashtable/set

nos 2010-08-11 21:02:19

+3 A:

I think u meant trie data structure. A trie can be used to replace a hash table, over which it has the following advantages:

Looking up data in a trie is faster in the worst case, O(m) time, compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie which are analogous to hash table buckets that store key collisions are only necessary if a single key is associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.

Prabhu Jayaraman 2010-08-12 04:57:36

ansaurus

tags:

views:

answers:

Data structure for recognizing repeat values.

related questions