views:

644

answers:

6

In some library code, I have a List that can contain 50,000 items or more.

Callers of the library can invoke methods that result in strings being added to the list. How do I efficiently check for uniqueness of the strings being added?

Currently, just before adding a string, I scan the entire list and compare each string to the to-be-added string. This starts showing scale problems above 10,000 items.

I will benchmark this, but interested in insight.

  • if I replace the List<> with a Dictionary<> , will ContainsKey() be appreciably faster as the list grows to 10,000 items and beyond?
  • if I defer the uniqueness check until after all items have been added, will it be faster? At that point I would need to check every element against every other element, still an n^^2 operation.

EDIT

Some basic benchmark results. I created an abstract class that exposes 2 methods: Fill and Scan. Fill just fills the collection with n items (I used 50,000). Scan scans the list m times (I used 5000) to see if a given value is present. Then I built an implementation of that class for List, and another for HashSet.

The strings used were uniformly 11 characters in length, and randomly generated via a method in the abstract class.

A very basic micro-benchmark.

Hello from Cheeso.Tests.ListTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.4428266
Time to scan: 00:00:13.0291180

Hello from Cheeso.Tests.HashSetTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.3797751
Time to scan: 00:00:00.4364431

So, for strings of that length, HashSet is roughly 25x faster than List , when scanning for uniqueness. Also, for this size of collection, HashSet has zero penalty over List when adding items to the collection.

The results are interesting and not valid. To get valid results, I'd need to do warmup intervals, multiple trials, with random selection of the implementation. But I feel confident that that would move the bar only slightly.

Thanks everyone.

EDIT2

After adding randomization and multple trials, HashSet consistently outperforms List in this case, by about 20x.

These results don't necessarily hold for strings of variable length, more complex objects, or different collection sizes.

+38  A: 

You should use the HashSet<T> class, which is specifically designed for what you're doing.

SLaks
Yep, the `Add()` method will return false if the element was already present in the collection.
Josh Stodola
+14  A: 

Use HashSet<string> instead of List<string>, then it should scale very well.

Pent Ploompuu
A: 

I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative%5Farrays.

A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.

edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.

edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.

San Jacinto
FIY, in .NET a Dictionary is implemented as an hash table. It's not a tree structure. String length only matters for calculating the hashes.
Martinho Fernandes
... and it yields O(1) lookup time, btw.
Martinho Fernandes
@Martinho does this "hashing" mean a miller-rabin type of hash or the type of hash i see in other languages that use tha Aho-Corasick style of storage? That is my question. could you point me to some docs? Thanks for correcting me :)
San Jacinto
O(1) lookup is impossible with strings, intuition says. How is such a thing accomplished? Even if you are taking advantage of immutability for strings, you still must examine every character to know if it's equal to the one in permanent storage.
San Jacinto
San Janinto: "Probably" (def: without much doubt) doesn't cut it in programming, where ALL doubt can be removed by experiment and/or RTFM (i.e. consultation of documentation). Personally I prefer "Definite" (def: precise; explicit and clearly defined) over "Probable". Also, O(1) is possible with any data-type, see this for a DEFINITE definition of Big O Notation http://en.wikipedia.org/wiki/Big_o_notation
Binary Worrier
For information on hash calculations see wikipedia: http://en.wikipedia.org/wiki/Hash_function. It's not what you think it is. As to the O(1) part, actually it's documentation as "very close to O(1)", assuming a reasonably fast hash function. The string comparison cost is not very relevant in asymptotic analysis here: the size of the input is the size of the collection. Here's the MSDN page for Dictionary: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx where it says how it's implemented.
Martinho Fernandes
@Binary I understand algorithmic complexity. Thus, my questions on how the structure is implemented. Even hashing strings is not considered O(1) unless you bound the length of the string as a constant. In some applications this makes sense. In others, it does not. I used the word "probably" as a means to say: "this is common. It doesn't mean this is how .NET does it." I'll edit for clarity. @Martinho thanks for your patience and the link. I read it on my own before you posted it, and it doesn't provide a description of how the hash is implemented.
San Jacinto
@Martinho oops, i was thinking of rabin-karp. not sure why miller-rabin popped into my head.
San Jacinto
@San Jacinto: Yes, MSDN only mentions it uses an hash table. It doesn't mention the hash algorithm because each type is responsible for its own implementation by overriding the GetHashCode method. Still about the O(1), you can actually place an upper bound on the length of the string: 2^31-1 (int.MaxValue), because the Length property is an int.
Martinho Fernandes
@Martinho so what you're saying is that you wouldn't rely on that property to prove constant hash time for DNA to ASCII encoding ;)
San Jacinto
Eh? Even assuming they were stored in a tree, how would it depend (only) on the length of the string and not the depth of the tree?
Mark
@Mark It depends on the structure that you've used. If you fork at the letters like the Aho-Corasick method, the search is O(M), M = length of pattern. It is independent of the tree (unless the strings in the tree are always shorter than M). The hash time for 1 character is assumed to be constant.
San Jacinto
@San Jacinto: Oh... I had a different kind of tree in mind.
Mark
+3  A: 

From my tests, HashSet<string> takes no time compared to List<string> :)

mYsZa
Did you really need to test that? I sure hope it does, or computer science is built on some pretty shady theories. (That, or whoever wrote the .net library mucked up big time)
Mark
A: 

Does the Contains(T) function not work for you?

Bobby
It probably does, but it's slow with List<T>.
mYsZa
Yes, that's right. It's slow. But actually the situation is a little more complicated that I described. It's a List<T> where T contains a property that is a String. So it is not in fact a List<String>. Furthermore the Contains() *that I want* has to test for the value equivalency, not object equality. So I had been using an iteration to just go through all the items in the list and comparing the string property on each item against the Strnig prop on the candidate instance to be added to the list. And that is very slow.
Cheeso
+1  A: 

Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.

Rich Apodaca
I discussed bloom filters in a paper I wrote about examining volumous amounts of data during the forensic process. They seem to be good pre-processors to weed out obvious non-matches, and if your scenario warrants, you can build an index based off of the false positives only to check from that point on, but why not just index them from the beginning and skip that step and throw out the bloom filter?
San Jacinto
Probably space is the primary reason for implementing a bloom filter, no?
San Jacinto