views:

959

answers:

13

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.

First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.

That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.

Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.

I am wondering maybe some of you know a better solution. (Computer has extremely high specification.)

+4  A: 

If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.

The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.

Also, because it's sorted, the retrieval and insertion time are both O log(n).

Nick
+1  A: 

divide and conquer - partition data by first 2 letters (say)

dictionary of xx=>dictionary of string=> count

pm100
My inclination but I would make the first partition more effective. Don't key on two characters, key on say the first 16 bits of a hash of the string.
Loren Pechtel
to get a hash of the string, you need to scan the entire string. inspecting the first couple of characters may be faster ( though it probably will be limited by bus speeds, and since cache is loaded a line at a time, may not be )
Pete Kirkham
+22  A: 

I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.

Lee
I don't have hard data, but I would assume that a trie will be faster than a database.
David Rodríguez - dribeas
Added benefit: Tries are really really easy to implement.
trinithis
Let's not mix up our Ns. Testing for membership in a DAWG would be O(n), but with n being the number of characters in the string, NOT the number of strings in the set. HUGE difference.
Lucas
I've used Directed Acyclic Word Graphs. Very effective.
Norman Ramsey
A trie with this many words might or might not fit into 2G, depending on the data, and the implementation of a trie node (even an index will take 4 bytes...).
comingstorm
Exactly what I said, as one of the first answers - why no love for Raja?
BlueRaja - Danny Pflughoeft
A: 

Have you tried a Hash-map (Dictionary in .Net)? Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.

If the strings are very similar, you could also write your own Trie implementation.

Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.

BlueRaja - Danny Pflughoeft
Why would strings have to be "very similar" in order to use a Trie?
khedron
Because if every node is 26 extra pointers you have to have memory for. The less similar the strings are, the more nodes you have.
BlueRaja - Danny Pflughoeft
+3  A: 

Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.

jk
+1  A: 

I would use a database, any database would do.

Probably the fastest because modern databases are optimized for speed and memory usage.

You need only one column with index, and then you can count the number of records.

GvS
Why 2 columns? One would do.
Mehrdad Afshari
You are right, thx, edited.
GvS
I doubt that a general purpose database can outperform a specialized, optimized algorithm in this case. General purpose DB's balance a number of competing needs (insert speed, update speed, query speed, memory vs. CPU). A specialized algorithm can be tuned according to the OP's needs.
Eric J.
But what's the fastest? Dump the data in a database, or select or invent and tune a specialized algorithm. If you can keep all in memory, and don't meet the internal limits of Array, List<> or Dictionary<>, implementing will be about the same, code performance probably faster. If you reach these limits however....
GvS
+26  A: 

I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.

D.Shawley
+1 FOr simple answer, Databases are partly designed for this kind of task. No sense "Reinventing the wheel".
Jamie Keeling
That really depends on his application constraints, and makes an assumption that might well not be valid.
Eric J.
This is a programming question, not a query question. (Yeah, queries are programs, but let's eschew that.) Plus the OP labeled the question as C#.
trinithis
This is indeed easiest to code, but the OP asked for fastest. A customised algorithm like a DAWG (see Lee's answer) will beat any database for the OP's specific needs in both space and performance.
SPWorley
A database engine like SQL Server is optimized for large amounts of data. Any in-memory algorithm runs the risk of taking too much time and causing too much paging, and/or thread contention. I do not think you should rule out a database as possibly being fastest in this case.
John Saunders
This is a very bad idea, and will take an eternity compared to a single iteration over the data using a trie to keep track of strings you've seen. My only regret is that I can only downvote this once.
Terry Mahaffey
(1) databases are optimized for set-based functions - existence, intersection, counting, etc. (2) C# has database access the last time that I checked (3) if the data set is larger than available/efficient memory size, then a custom data structure becomes *very* difficult - think about how you would page out portions of a trie to disk and still make it efficient (4) don't rule out the cost of loading the data if you need to do it more than once (5) try writing a trie that multiple threads can traverse and allow for modification
D.Shawley
@Terry: I don't see a requirement for iterating over the data set. Querying for existence does not necessarily imply iteration. Based on the fact that the OP is talking about data sets that exceed 2GB, any in-memory data structure is likely to cause enough VM cache misses and swapping to slow it to a crawl. The only effective option for non-memory solutions is a database IMHO.
D.Shawley
+1 Most practical answer. I swear nobody understands the word "fast". If you spend 6 hours creating/testing your own custom hash table, you are slower than molasses. Writing something to populate a database and then writing something to query it would take me roughly ten minutes (plus transfer times). Computer time is cheap. Programmer time is **expensive**. Think about it.
Josh Stodola
@Josh: most programmers do not like to hear the answer _well don't write a program to solve the problem, use something that exists_
D.Shawley
@D.Shawley I disagree. I think most programmers are lazy, and therefore that phrase is music to the ears. Re-inventing the wheel and optimizing prematurely are both huge wastes of time. The sooner a programmer understands that and utilizes tools that the previous generations have perfected, the more productive they shall be.
Josh Stodola
+1  A: 

A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.

Hans Passant
Is there a difference in the maximum size of a CLR object on 32-bit and 64-bit operating systems?
0xA3
No, it is still 2GB.
Hans Passant
Ok, so the limit is the maximum memory size of a process?
0xA3
No, a 64-bit process has terabytes, depending on the size of the paging file. The 2GB restriction is an x64 instruction set issue, combined with an Int32 array indexing limit.
Hans Passant
+2  A: 

With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.

You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.

Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.

Eric J.
A: 

I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.

Noon Silk
+7  A: 

This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.

If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.

This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.

KirarinSnow
Better than suggesting a database, but still - sorting the data is overkill, and not required by the problem space. Any solution which does so is not optimal. trie trees, however, are designed to solve almost this exact problem.
Terry Mahaffey
@Terry Mahaffey: Comparison sort (mergesort for example) is not optimal. The constraints of the problem, however, allow for radix sort, which is optimal (asymptotically). The tokens being sorted are strings of bounded length and there is a bounded number of possible characters in each position. I do agree that tries are better (for space reasons), but not because radix sort is not optimal.
KirarinSnow
A: 

Andras, you are on the right track, just make use of more than one hash table. I don't know any .net so I've written the pseudo-code below:

uniqueCounter
{
  int counter;
  void initializer
  {
    hashtable table[256];
    counter = 0;
  }

  int getTableIndex (string str)
  {
    unsigned int index = 0;
    for (i = 0; i < str.size; i++)
    {
      index += 1103515245 * (unsigned int) str[i] + 12345;
    }
    return index & 255; // bitwise &
  }

  int getUnique () { return counter; }

  int addString (string str) // returns number of unique strings in object.
  {
    int tableIndex = getTableIndex (str);
    if (table[tableIndex].exists(str)) return counter;
    table[tableIndex].add(str);
    return ++counter;
  }
}

The getTableIndex function is just a string hash function, I would recommend using one of the ones from here rather than the one I have in the code. The time complexity of adding the strings will be linear in the number of strings since adding a string takes constant time on average.

Jeff Kubina
A: 

+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.

But just for academic purposes, I will like to add my 2 cents.

-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.

Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.

Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.

The below is my solution for a one time duplicate counting with limits on how much space can be used.

If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:

if (a[i] == a[i+1])
    dupCount++;

If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.

I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).

Edit: SQl/DB solutions are good only when you need to store this data for a long duration.

Ak