views:

226

answers:

7

I'm in need for a data structure that can handle small sets (10-20 strings, at most 50, of varying length) very fast. False positives is ok, but false negatives are not.

The last requirement makes bloom filters seem like a good fit, but I'm not sure about their speed, any other recommendations?

Edit: The set only needs to support insert + membership test.

+1  A: 

Depending on what operations you wish to perform against the set, the fastest will likely be a HashSet<string>. See HashSet for more.

ADDITION Asking Mr. Google, here's an article written by a gentlemen that wrote a Bloom Filter function in C#. However, he's still using (multiple) hashcodes to populate the filter. I would expect that on small data sets it will be slower than a HashSet.

Thomas
Using a HashSet atm, it's to slow.
thr
@thr - From what I've read, a Bloom Filter is used on very large datasets, on small datasets of the size you mentioned you probably will not get any sort of performance gain.
Thomas
That's my understanding also. But since I can pre-calculate all hashes I need to check against I wanted to investigate how fast bloom filters are.
thr
@thr - Found an article that implemented a Bloom Filter. Still, I think that it will be slower for the size datasets you are discussing than a `HashSet`.
Thomas
`HashSet` pre-calculates hashes too (use Reflector).
Anton Tykhyy
A: 

Check out the System.Collections.Specialized Namespace on MSDN.

Especially the HybridDictionary and the StringDictionary.

I know they're not sets, but you can use null values for each key. (Java does the same with out-of-the box Sets and still is "fast".

TWiStErRob
A: 

Why not use a Radix Tree? It's a specialized set data structure based on the trie that is used to store a set of strings.

gradbot
+1  A: 

If the set of strings to check for membership is much larger than the set of valid strings then a Trie might give you better performance than a HashSet. The speed of a lookup in a hashset is dependent on the run time of the hashing algorithm which is usually O(k) where k is the length of the string. This is true whether the string is in the hashset or not.

With a Trie, lookup is still O(k), but if the string is not in the Trie, it will terminate the lookup as soon as a single character doesn't match. So best-case, a lookup for an invalid string is O(1).

Niki Yoshiuchi
Big-oh analysis is rarely useful for such a small N.
Brian
Agreed, but if hashsets are too slow (as reported in another comment) maybe speeding up the hashing algorithm can help.
Niki Yoshiuchi
+4  A: 

How about an array of strings that you use a for-loop over checking membership with String.Equals?

For sets this small, fancy data structures may incur too much overhead, and big-oh does not apply. Have you tried doing the simplest possible thing and measuring that?

(If false positives are ok, you might also keep e.g. an array of 1024 bools, where you compute a poor 'hash' of strings by looking at just the first two characters' lowest 5 bits to give you a 10-bit index into the boolean array. Seems like this would be just a few instructions long.)

Brian
A: 

If HashSet is too slow for you, you can use classic LZ compressor's technique: fixed size array of hash codes where each entry points to linked list of strings.

In case you know domain of your data just construct ideal hash function and use it. If it's not your case you can use string.GetHashCode() of something like Murmur hash and use hash(str) % array.Length as array's index.

I suppose array size of 256-512 entries in good enough for your data structure with 50 strings.

ssp
A: 

The main benefit of bloom filters over hash tables is that their size depends on the number of objects in the database and the permitted probability for false positives, but not on the size of the objects themselves. Since your database is so small I doubt its size is your main concern.

HashSets are theoretically the best data structure for your requirement, but since the database is so small, an O(log (n)) structure like a SortedDictionary is often preferable, or maybe even just linear search (as mentioned). I recall stories where switching from hash-based collections to tree-based ones drastically increased performance for small sets.

The best way is to switch between them and compare the performance of each.

Oak