views:

413

answers:

2

I have to store lot of strings in c++ map to keep unique strings and when ever duplicate string occurs I just need to increment the counter (pair.second). I've used c++ map and it well fits to this situation. Since the file that processing is gone now upto 30gig I am trying to keep this in a file instead of memory.

I also came across trie which is faster than map in this case. Any one aware of file backed trie implementation? I came across a Trie implementation similar to what I am looking for but not seems to be bug free ..

+1  A: 

How are you going to load 30GB into memory all at once? And since it is a dictionary-based behavior you want, I'd imagine everytime you insert, or increment, you'll need to load the whole file (even if piece by piece) for lookup.

I suggest using a database. That is what they're for...

navigator
+1  A: 

If you can sort your file containing the strings, then reading the sorted list and counting duplicates would be easy. (You can retain the original file and create a new file of sorted strings.) Sorting large files efficiently is old technology. You should be able to find a utility for that.

If you can't sort, then consider digesting the strings. MD5 may be overkill for your purpose. You can cobble something up. For billions of strings, you could use 8 byte digests. Use a tree (probably a BST) of digests. For each digest, store the file offsets of the unique strings that produce that digest.

When you read a string, compute it's digest, and look it up. If you don't find the digest, you know the string is unique. Store it in the tree. If you do find the digest, check each associated string for a match and handle accordingly.

To compare strings, you will need to go to the file, since all you've stored is the file offsets.

What's important to remember it that if two digests are different, the strings that produced them must be different. If the digests are the same, the strings may not be the same, so you need to check. This algorithm will be more efficient when there are fewer duplicate strings.

Rob deFriesse