ansaurus

Question

Memory Efficient Methods To Find Unique Strings

Answer 1

+4 A:

O(1) memory [ram]:

If you want to use no memory at all (besides a couple temp variables) you could simply read 1 entry at a time and add it to the output file if it doesn't already exist in the output file. This would be slower on time though since you'd have to read 1 entry at a time from the output file.

You could insert the entry into the output file in alphabetical order though and then you would be able to see if the entry already exists or not in O(logn) time via binary search per entry being inserted. To actually insert you need to re-create the file though which is O(nlogn). You do this n times for each input string, so overall the algorithm would run in O(n^2logn) (which includes lookup to find insertion pos + insertion) and use O(1) RAM memory.

Since your output file is already in alphabetical order though future lookups would also only be O(logn) via binary search.

You could also minimize the re-creation phase of the file by leaving excessive space between entries in the file. WHen the algorithm was done you could do a vacuum on the file. This would bring it down to O(nlogn).

Another way to reduce memory:

If it's common that your strings share common prefixes then you can use a trie and probably use less memory overall since you mentioned your strings are > length 36. This would still use a lot of memory though.

Brian R. Bondy 2009-05-15 02:22:10

Inserting in alphabetical order requires sorting properly first (O(n) storage, on disk or in RAM) - and then you need only compare against the prior entry each time.

bdonlan 2009-05-15 02:37:38

Ya O(1) RAM usage but O(n) disk usage. I'm assuming that you insert in alphabetical order each time to the file. But this requires you to re-write the file each time unfortunately because you can't just insert into the middle of the file. So each lookup is O(logn) but each insertion is O(nlogn) and for the whole file it is O(n^2logn).

Brian R. Bondy 2009-05-15 02:44:18

Insertion sort is O(n^2), unless you're using a file format such as a B*tree to effectively do an on-disk sort - in which case I'd expect just doing a normal external sort to be much more efficient (due to less seeking) unless the proportion of uniques is very low.

bdonlan 2009-05-15 02:46:13

Good points about storing with a smart file format and also about it not being so bad if you have not a lot of unique entries.

Brian R. Bondy 2009-05-15 02:49:26

Answer 2

+1 A:

You could probably just write an in-place sort for it. You're still going to have to load the file to memory though, because in-place sorting with IO wouldn't be efficient. You'd have to read and write for every comparison.

hypoxide 2009-05-15 02:22:43

Sorting with IO is obviously less efficient than in-memory sorting, but it can use a heck of a lot less memory if you're willing to make the tradeoff. Google "external sort" - there are a bunch of clever algorithms.

Charlie Tangora 2009-05-15 02:34:12

I'm sure there are, but the disk seek time on an O(1) algorithm performed on 10^36 values is still going to be ridiculous. Just put it in memory and sort it, I say.

hypoxide 2009-05-15 02:38:29

correction, O(n)*

hypoxide 2009-05-15 02:45:41

Answer 3

+3 A:

Well, std::set might be slightly faster and use less memory than std::map.

Dingo 2009-05-15 02:27:27

Answer 4

+6 A:

This depends a bit on the characteristics of your dataset. In the worse case, where all strings are unique, you will need either O(n) memory to record your seen-set, or O(n^2) time to re-scan the entire file on each word. However, there are improvements that can be made.

First off, if your dataset only consists of 3-digit integers, then a simple array of 1000 bools will be much more memory effieicnt than a map.

If you're using general data, then another good approach would be to sort the set, so copies of the same string end up adjacent, then simply remove adjacent words. There are well-researched algorithms for sorting a dataset too large to fit in memory. This is most effective when a large percentage of the words in the set are unique, and thus holding a set of seen words in memory is prohibitively expensive.

Incidentally, this can be implemented easily with a shell pipeline, as GNU sort does the external sort for you:

tr " " "\n" < testdata | LC_ALL=C sort | uniq

Passing LC_ALL=C to sort disables locale processing and multibyte character set support, which can give a significant speed boost here.

bdonlan 2009-05-15 02:30:22

Note that 'sort -u' can be more efficient than 'sort | uniq'. That's what I was about to post as an answer.

Darius Bacon 2009-05-15 02:53:20

Ah, nice, didn't know about -u. Thanks for the tip :)

bdonlan 2009-05-15 02:56:09

Answer 5

+1 A:

std::set would be better, but you should look into hash sets. Those are not available in the current C++ standard library (although it is supposed to be in C++0x's) but some compilers have implementations. Visual C++ has stdext::hash_set and boost has some kind of stuff for this, see Boost Multi-index Containers Library.

Trillian 2009-05-15 02:31:55

The C++0x name for these containers is `std::tr1::unordered_set`. Recent versions of Microsoft and GNU C++ compilers (and probably others) have TR1 support out of the box.

Bklyn 2009-05-15 02:45:57

Answer 6

+2 A:

It seems given that large number of entries, there will be a reasonable amount of overlap in the sequences of symbols. You could build tree using the entries at each position each sequence as nodes. Say you have an entry 12345 and 12346 then the first four entries in the sequence overlap and thus could be stored in a tree with 6 nodes.

You could walk the tree to see if a given symbol is contained at a given position, if not you would add it. When you reach the end of the given string you would just mark that node as terminating a string node. Reproducing the unique entries would be a matter of a depth first traversal of the tree the path from the root node to each terminator represents a unique entry.

If you partition the dataset, say into X thousand line chunks and aggregate the trees it would make a nice map-reduce job.

You're looking at a node space of 10^36 so if the data is entirely random you're looking at a large possible number of nodes. If there's a good deal of overlap and a smaller number of unique entries you will probably find you use a good deal less

sgargan 2009-05-15 03:55:11

+1 for you. You beat me to it. I was going to suggest the exact same thing :)

Shing Yip 2009-05-15 04:26:29

Answer 7

A:

Try STXXL as an external stl for huge datasets.

piotr 2009-05-15 06:15:47

Answer 8

+1 A:

The solution I would suggest is to use memory mapped file to access the data and radix sort to sort the list. This is trivial if the strings are of same length. If the strings are of different lengths, use radix sort for a fast presorting using the n first characters, then sort the unsorted sub lists with whatever method is most appropriate. With very short sublists, a simple bubble sort would do it. With longer lists use quick sort or use the STL set with a custom class providing the compare operator.

With memory mapping and radix sort, the memory requirement and performance would be optimal. All you need is the mapping space (~size of file) and a table of 32bit values with one value per string to hold the linked list for radix sort.

You could also save memory and speed up the sorting by using a more compact encoding of the strings. A simple encoding would use 2 bits per character, using values 1,2 and 3 for the three letters and 0 to signal the end of string. Or more efficient, use a base 3 encoding, packed into integers and encode the length of string in front. Let say you have characters c1, c2, C3, c4 the encoding would yield the integer c1*3^3 + c2*3^2 + c3*3^1 + c4*3^0. This suppose you assign a value from 0 to 2 to each character. By using this encoding you'll save memory and will optimize sorting if the number of strings to sort is huge.

chmike 2009-05-15 07:22:06

Answer 9

+1 A:

You can look at the excellent library Judy arrays. A compact trie based, very fast and memory efficient data structure witch scale to billons strings. Better than any search-tree. You can use the JudySL functions to your purpose. You can use it similar to your program, but it is much faster and more memory efficient.

bill 2009-06-14 11:18:36

ansaurus

tags:

views:

answers:

Memory Efficient Methods To Find Unique Strings

related questions