ansaurus

Question

.NET: How to efficiently check for uniqueness in a List<string> of 50,000 items?

Answer 1

+38 A:

You should use the HashSet<T> class, which is specifically designed for what you're doing.

SLaks 2009-12-07 14:30:04

Yep, the `Add()` method will return false if the element was already present in the collection.

Josh Stodola 2009-12-07 14:36:52

Answer 2

+14 A:

Use HashSet<string> instead of List<string>, then it should scale very well.

Pent Ploompuu 2009-12-07 14:30:38

Answer 3

A:

I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative%5Farrays.

A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.

edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.

edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.

San Jacinto 2009-12-07 14:34:15

FIY, in .NET a Dictionary is implemented as an hash table. It's not a tree structure. String length only matters for calculating the hashes.

Martinho Fernandes 2009-12-07 14:38:18

... and it yields O(1) lookup time, btw.

Martinho Fernandes 2009-12-07 14:38:58

@Martinho does this "hashing" mean a miller-rabin type of hash or the type of hash i see in other languages that use tha Aho-Corasick style of storage? That is my question. could you point me to some docs? Thanks for correcting me :)

San Jacinto 2009-12-07 14:40:09

O(1) lookup is impossible with strings, intuition says. How is such a thing accomplished? Even if you are taking advantage of immutability for strings, you still must examine every character to know if it's equal to the one in permanent storage.

San Jacinto 2009-12-07 14:41:53

San Janinto: "Probably" (def: without much doubt) doesn't cut it in programming, where ALL doubt can be removed by experiment and/or RTFM (i.e. consultation of documentation). Personally I prefer "Definite" (def: precise; explicit and clearly defined) over "Probable". Also, O(1) is possible with any data-type, see this for a DEFINITE definition of Big O Notation http://en.wikipedia.org/wiki/Big_o_notation

Binary Worrier 2009-12-07 14:52:27

For information on hash calculations see wikipedia: http://en.wikipedia.org/wiki/Hash_function. It's not what you think it is. As to the O(1) part, actually it's documentation as "very close to O(1)", assuming a reasonably fast hash function. The string comparison cost is not very relevant in asymptotic analysis here: the size of the input is the size of the collection. Here's the MSDN page for Dictionary: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx where it says how it's implemented.

Martinho Fernandes 2009-12-07 14:52:27

@Binary I understand algorithmic complexity. Thus, my questions on how the structure is implemented. Even hashing strings is not considered O(1) unless you bound the length of the string as a constant. In some applications this makes sense. In others, it does not. I used the word "probably" as a means to say: "this is common. It doesn't mean this is how .NET does it." I'll edit for clarity. @Martinho thanks for your patience and the link. I read it on my own before you posted it, and it doesn't provide a description of how the hash is implemented.

San Jacinto 2009-12-07 15:44:01

@Martinho oops, i was thinking of rabin-karp. not sure why miller-rabin popped into my head.

San Jacinto 2009-12-07 16:04:54

@San Jacinto: Yes, MSDN only mentions it uses an hash table. It doesn't mention the hash algorithm because each type is responsible for its own implementation by overriding the GetHashCode method. Still about the O(1), you can actually place an upper bound on the length of the string: 2^31-1 (int.MaxValue), because the Length property is an int.

Martinho Fernandes 2009-12-07 19:21:34

@Martinho so what you're saying is that you wouldn't rely on that property to prove constant hash time for DNA to ASCII encoding ;)

San Jacinto 2009-12-07 19:30:22

Eh? Even assuming they were stored in a tree, how would it depend (only) on the length of the string and not the depth of the tree?

Mark 2010-05-21 19:10:29

@Mark It depends on the structure that you've used. If you fork at the letters like the Aho-Corasick method, the search is O(M), M = length of pattern. It is independent of the tree (unless the strings in the tree are always shorter than M). The hash time for 1 character is assumed to be constant.

San Jacinto 2010-05-21 20:29:45

@San Jacinto: Oh... I had a different kind of tree in mind.

Mark 2010-05-21 21:44:51

Answer 4

+3 A:

From my tests, HashSet<string> takes no time compared to List<string> :)

mYsZa 2009-12-07 14:37:09

Did you really need to test that? I sure hope it does, or computer science is built on some pretty shady theories. (That, or whoever wrote the .net library mucked up big time)

Mark 2010-05-21 19:11:40

Answer 5

A:

Does the Contains(T) function not work for you?

Bobby 2009-12-07 14:42:24

It probably does, but it's slow with List<T>.

mYsZa 2009-12-07 14:43:42

Yes, that's right. It's slow. But actually the situation is a little more complicated that I described. It's a List<T> where T contains a property that is a String. So it is not in fact a List<String>. Furthermore the Contains() *that I want* has to test for the value equivalency, not object equality. So I had been using an iteration to just go through all the items in the list and comparing the string property on each item against the Strnig prop on the candidate instance to be added to the list. And that is very slow.

Cheeso 2010-01-04 21:07:05

Answer 6

+1 A:

Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.

Rich Apodaca 2009-12-07 15:28:39

I discussed bloom filters in a paper I wrote about examining volumous amounts of data during the forensic process. They seem to be good pre-processors to weed out obvious non-matches, and if your scenario warrants, you can build an index based off of the false positives only to check from that point on, but why not just index them from the beginning and skip that step and throw out the bloom filter?

San Jacinto 2009-12-07 15:57:50

Probably space is the primary reason for implementing a bloom filter, no?

San Jacinto 2009-12-07 15:58:36

ansaurus

tags:

views:

answers:

.NET: How to efficiently check for uniqueness in a List<string> of 50,000 items?

related questions