Disclaimer: I realize the totally obvious answer to this question is HashSet<string>
. It is absurdly fast, it is unordered, and its values are unique.
But I'm just wondering, because HashSet<T>
is a mutable class, so it has Add
, Remove
, etc.; and so I am not sure if the underlying data structure that makes these operations possible makes certain performance sacrifices when it comes to read operations -- in particular, I'm concerned with Contains
.
Basically, I'm wondering what are the absolute fastest-performing data structures in existence that can supply a Contains
method for objects of type string
. Within or outside of the .NET framework itself.
I'm interested in all kinds of answers, regardless of their limitations. For example I can imagine that some structure might be restricted to strings of a certain length, or may be optimized depending on the problem domain (e.g., range of possible input values), etc. If it exists, I want to hear about it.
One last thing: I'm not restricting this to read-only data structures. Obviously any read-write data structure could be embedded inside a read-only wrapper. The only reason I even mentioned the word "read-only" is that I don't have any requirement for a data structure to allow adding, removing, etc. If it has those functions, though, I won't complain.
UPDATE:
Moron's answer is an excellent example of the sort of thing I'm looking for. A Trie* definitely seems like a great possibility for the following reason: HashSet<T>.Contains
depends on the GetHashCode
function of some IEqualityComparer<string>
, which, as far as I can tell, is O(n)** by default in .NET. In other words, every character in a string must be examined for HashSet<string>.Contains
to return either true
or false
. For a Trie
, only a return value of true
would take O(n) to determine; a return value of false
could potentially return much more quickly.
This is of course hypothetical. So far I have not written or come across a Trie implementation in .NET that can beat a HashSet<string>
at Contains
(though an implementation I wrote myself came quite close for the alphabet 'a' to 'z'). I'm just saying, it seems possible.
*That link, by the way, also led me to another intriguing/similar possibility: DAWG.
**Here "n" is referring to the length of the string.