What are the fastest-performing options for a read-only, unordered collection of unique strings?

views:

100

answers:

+3 Q:

What are the fastest-performing options for a read-only, unordered collection of unique strings?

Disclaimer: I realize the totally obvious answer to this question is HashSet<string>. It is absurdly fast, it is unordered, and its values are unique.

But I'm just wondering, because HashSet<T> is a mutable class, so it has Add, Remove, etc.; and so I am not sure if the underlying data structure that makes these operations possible makes certain performance sacrifices when it comes to read operations -- in particular, I'm concerned with Contains.

Basically, I'm wondering what are the absolute fastest-performing data structures in existence that can supply a Contains method for objects of type string. Within or outside of the .NET framework itself.

I'm interested in all kinds of answers, regardless of their limitations. For example I can imagine that some structure might be restricted to strings of a certain length, or may be optimized depending on the problem domain (e.g., range of possible input values), etc. If it exists, I want to hear about it.

One last thing: I'm not restricting this to read-only data structures. Obviously any read-write data structure could be embedded inside a read-only wrapper. The only reason I even mentioned the word "read-only" is that I don't have any requirement for a data structure to allow adding, removing, etc. If it has those functions, though, I won't complain.

UPDATE:

Moron's answer is an excellent example of the sort of thing I'm looking for. A Trie* definitely seems like a great possibility for the following reason: HashSet<T>.Contains depends on the GetHashCode function of some IEqualityComparer<string>, which, as far as I can tell, is O(n)** by default in .NET. In other words, every character in a string must be examined for HashSet<string>.Contains to return either true or false. For a Trie, only a return value of true would take O(n) to determine; a return value of false could potentially return much more quickly.

This is of course hypothetical. So far I have not written or come across a Trie implementation in .NET that can beat a HashSet<string> at Contains (though an implementation I wrote myself came quite close for the alphabet 'a' to 'z'). I'm just saying, it seems possible.

_{*That link, by the way, also led me to another intriguing/similar possibility: DAWG.}
_{**Here "n" is referring to the length of the string.}

+1 A:

Apart from your wondering Hashset is the fastest collection.

There's no faster method because the underlying Hashtable allows O(1) read-write-access

Tobias P. 2010-06-17 16:25:25

There might nevertheless be a faster collection, e.g. one which is always twice as fast.

ChrisW 2010-06-17 16:26:32

O(1) doesn't mean "as fast as possible"; it just means that the cost doesn't increase with the size of the input (in this case, the size of the collection). I'm certainly open to the possibility that you're right, though. But I'm hoping to learn about some other options.

Dan Tao 2010-06-17 16:30:47

Yeah you'r right, but direct access - which a hashtable provides - is nerverless the fastest access method

Tobias P. 2010-06-17 16:32:55

+1 A:

A hashing container approaches O(1) for insert and retrieval, so from an order-of-magnitude perspective you can't get much better than that.

Within a hash container, your performance over time is going to be related to two things: how good of a distribution your hash function provides, and how fast it can compute it. These are not equivalent - a poorly distributed function (where you end up with a lot of collisions) is going to be much more performance-impacting than a slower but better distributed hash function.

Thus, if you could come up with a perfect hash function that was also extremely fast to compute, that would be an improvement. It's possible that constraining the data in specific ways may make that easier. But, odds are you, whatever you come up with will not be as good as what already exists.

Joe 2010-06-17 17:00:27

+1 A:

Tries are good for doing a Contains, especially for strings from a finite alphabet. Given a string s, the time complexity for Contains on a trie is O(|s|) (|s| = length of s), which is optimal.

Moron 2010-06-17 17:33:31

I was actually thinking of something just like this myself. Definitely seems like a good option: fixed on the characters 'a' to 'z' (I attempted an implementation that was case-insensitive), I was able to get something together that performed about *as* fast as a `HashSet<string>`. Obviously, though, if it's only *as* fast, it's not really worth implementing. Still, despite the limitations, this definitely seems like a promising option. I'll have to explore it further.

Dan Tao 2010-06-17 17:57:15

@Dan. I am surprised that it wasn't faster than HashSet. Does HashSet's/string's hashing function not look at all the characters in the string? Anyway, not sure what tests you did, but I suppose it is relevant to your scenarios. In general, what you say is kind of suprising. It all depends on the hash function used, though. For instance, if the hash function looks at each character in the string, tries _will probably_ outperfom hash (if implemented properly).

Moron 2010-06-17 20:26:42

@Moron: It could of course just be that my implementation sucks ;) I do intend to revisit this soon, in any case.

Dan Tao 2010-06-17 21:00:18

+1 A:

Hashing tables are amortized O(1) for lookup. Can't do better than that, O(1/n) algorithms are perpetual motion devices. There are only two things that make them behave poorly:

A poor hashing function that causes many collisions. The worst one will degenerate lookup to O(n). You'll have no trouble with strings, they hash really well. String.GetHashCode() does a terrific job.
A collection that is heavily mutated with many removed items that were added early. This can causes many empty hash buckets that need to be skipped by iterators. Degradation to O(n) is technically possible although quite rare. A simple workaround is to rebuild the collection by reassigning the reference (like table = new HashSet(table); )

These kind of problems are rare. You don't design for them up front (other than the hash function), you start considering them only when you detect perf problems with the program.

Hans Passant 2010-06-17 18:18:03

ansaurus

tags:

views:

answers:

What are the fastest-performing options for a read-only, unordered collection of unique strings?

related questions