views:

1774

answers:

7

There is 1TB data on a disk with around 1KB per data record. How to find duplicates using 512MB RAM and infinite disk space?

+1  A: 

Load the data into memory 512M at a time then sort that chunk and write it out to disk (as its own file). Once the entire 1T has been done this way, merge-sort the individual files into one big honkin' file, then read that big (sorted) file sequentially, writing it to the final file while removing the duplicate records.

1T, 512M at a time, will be about 2.1 million files (assuming the binary definitions of SI units rather than decimal). 512M of 1K records will only allow 524,288 records in memory at one time so you'll probably have to do the merge sort in two stages. In other words, merge-sort the 2.1 million files in four groups to create four bigger files, then merge-sort those four into the big sorted file. Then that's the one you process sequentially to remove duplicates.

A merge-sort is simply merging multiple already-sorted files by just selecting the first remaining record from each file and choosing the "lowest". For example, the two files a and b:

a   b
7   6
3   5
1   4
    2
 \_/
  1 (a)
  2 (b)
  3 (a)
  4 (b)
  5 (b)
  6 (b)
  7 (a)
paxdiablo
If you're only trying to find duplicates, sorting the whole 1TB of data is overkill.
Mark Hurd
Really? It seems to me that a duplicate test of an unsorted dataset is O(n*n) whereas any decent sort will be more like O(n log n). If you have a better way, I'd like to see it.
paxdiablo
If you load the data in to RAM 512MB at a time, the leaves you no space in RAM for your code, nor any stack space for your code to run. Even if you don't have a non-recursive sort algorithm, you'll still stack space for your local variables, and probably some memory for your heap, and possibly any global variables.
Ants
@Ants, I considered it unclear as to whether that memory limitation was for everything or just for the data. In any case, if you have less memory, you just do it in smaller chunks. The theory still holds.
paxdiablo
Finding duplicates is a O(n) operation with a hash set.
Rex Kerr
With respect, Rex, no it's not. Unless you guarantee that every hash is unique then you're still checking n/X entries against n/X where X is roughly the number of items in each bucket - that's still an O(n^2) operation since the constants are irrelevant in big-O notation. You can greatly reduce the working set by only having to check for duplicates within a bucket but claiming it's O(n) is not correct.
paxdiablo
Constants matter if they're large enough.
Mark Ransom
Notably, there are actually only 2048 chunks of size 512MiB in 1TiB.Your solution (merge sort, then check for duplicates) would be ideal when a lot of duplicates are expected. Even through it would read from and write to disk a total of 22TiB, it could do so as contiguous reads, rather than the random lookups that, e.g., my solution would require.
jemfinch
+3  A: 

Find a suitable hashing function and hash each record, storing the list of hashes with indices into a file. Now sort the hash file. Finally check the whole records of matching hashes for real duplicates.

Of course it depends upon how many duplicates you expect to find and what your going to do with the information afterwards.

Mark Hurd
This is like a Rabin-Karp algorithm, but for records, not string searching. Except, why are you sorting the hash file? You already have a set of buckets containing possible duplicates. Why not just check the buckets?
Alex
@Alex You're thinking I placed the records into a hashtable, which is possibly an option.
Mark Hurd
Ah, right. (padding)
Alex
A: 

Generate a hash of each record; record the record number and the hash in memory, spilling to file when necessary (sort the data in hash order before writing to file). As you come up with a new hash, check whether it already exists in memory - that's an early detection. (That may or may not be a major benefit.).

When you've read all the data, you will have several files of hashes plus record numbers, already sorted. Merge these files together, spotting duplicates as you go. You don't even need to do more than record the duplicates for this application, so you can discard hashes once they're proven unique.

Given the sizes - 0.5 GB memory, 1000 GB data, 1 KB per record, so around 1 billion records - assuming a 256-bit hash (though 128-bit might well be adequate), we'd be using 32 bytes for the hash and 4 bytes for the record number, and about 1 billion records, we need about 36 GB for the sort files, generated in 500 MB files (corresponding to the memory that's available), so there'd be 70-80 files to merge at the end, which seems pretty manageable. The list would give you the record numbers - you'd then have to access the 1 TB file to read the records. You need to give some thought to what you're going to do with the duplicates; do you need the information about the initial record and the extras, and does it matter which of the duplicates you keep and which you reject. Etc.

Jonathan Leffler
'128 might well be adequate' - if it's not, I'll eat my hat.
Nick Johnson
@Nick - leave your hat alone; it looks better on your head than on your plate. I agree. For this purpose, rather than for cryptographic security, MD5 should be fine.
Jonathan Leffler
Right. My point was that with a hash length of 128 bits, there's no plausible collection of data large enough to make a collision even remotely likely.
Nick Johnson
+9  A: 

That's a lot of records ;-) in the order of 1,000,000,000. 'd better be smart about it...

The nature of the records is unspecified: do we just discover them, one at at time by reading them sequentially, or is there some kind of index, or maybe are they stored as files in various directories? Also unspecified in the question is the availability of a dbms which we can use for index-like data (rather than having to sort it with our own code). Also a [even rough] idea of the number of duplicates would help direct some of the choices towards an efficient process.

If no index exists, we can/should create one; this could be done as the first pass through the data. The same pass would be used to produce a message digest (a hash) of sorts for each record (or possibly, for efficiency purposes, for the first few hundred bytes of the record).

The general idea is to quickly produce an index that can be used to identify possible duplicates, and to finalize the list of actual duplicate, possibly through parallel processing.

The info useful in the index would be:

  • length of record
  • first few bytes of the text
  • hash code (more on this below)
  • also the offset in file or whatever pointer to the data but of course unlike the above 3 elements, this can't be used for identifying potential matches.

The choice of the hash is critical: should favor a fast algorithm at the expense of one that is perfectly distributed; the number of bytes hashed for each record is also a compromise, maybe 100 to 200 bytes (i.e. circa 10 to 20% of the average record size) is a good value, depending on the expected ratio of duplicates, and depending on the time saving this provides (compared with hashing the whole record). (see edit below)

Once such an index is available, we can [relatively quickly/effortlessly] obtain a count of possible duplicates; based on this result a second pass aimed at improving the quality of the index, if it is not deemed selective enough, can be done (leaving out the records which are readily deemed unique). This second pass can compute another hash, on the whole record (excluding the first x bytes of the first hash), or on yet another subset of the record. Note that thanks to the index, this second pass can be multi-threaded if possible.

The second or final pass requires sorting the records within a group of possible matches (same length, same hash code(s), same first x bytes). This can be achieved as describe by Pax Diablo, the advantage of the index is that such operation can, again, be multi-threaded and involves much smaller sets (many of them). Added: Here again Nick Johnson makes a great point that the second pass could possibly be unnecessary would we use a long hash code (he suggests 128 bytes long SHA1). Assuming that there is no gain in partially hashing the records, this is a very plausible solution since the index could reside on disk and yet be more quickly sorted and stored than if we were sorting/storing the whole records.


Edit: Nick Johnson makes the excellent point that the latency of seeks in disk storage may be such that a plain sequential read be faster and that the bottleneck being Disk I/O bound, a fast hash function ran concurrently may effectively be faster than the sequential read, and hence not add to the overall process. This is a likely possibility (particularly if a sequential read if effectively required to detect each record start/end etc.), and that's why I "edged my bet" by writing "depending on the time saving this provides...". This said the actual structure of the records on disk is one of the open parameters of the question (for example if we're just reading from individual files in directories, hence imposing a non sequential read) and also a TeraByte-sized storage is likely supported by a fancy RAID where seek latency while remaining a concern is typically much improved.
I stand by my suggestion that a two passes approach may be more efficient than one where each record is completely hashed, but I wish I had stressed the possibility and benefits of the a single pass approach. As with many interview questions, several characteristics of the situation at hand were unspecified; the idea is not so much to see the applicant supply the absolute right answer (although some answers may be quite wrong!) but instead to gain insight in his/her thought process and ability to identify options and decision points.

mjv
Good answer. +1
Alex
Generally a good strategy, but I think you underestimate the ratio of CPU time to disk seek latency. Given the characteristics of spinning-rust storage, there's definitely no point in only hashing some of each record - you have to read past the whole record in order to get to the next one anyway - and since your CPU is so much faster than your disk, you may as well use a well distributed hash function like SHA1.
Nick Johnson
Also, if you're going to go to all the trouble of writing an index like this to disk, you may as well include a hash long enough to be fairly certain that if H(a) == H(b) then a == b - eg, at least 128 bytes - thus allowing you to skip subsequent passes.
Nick Johnson
@Alex: I'd be interested if you could qualify your comment, As far as I can tell this particular solution is just as worse as the others (barring Potatocorn), and I can't seem to understand why people are voting this answer up.
Monomer
@Monomer, it seems fairly efficient (from a high level read). Though I do agree with @Nick that it may not be worth the effort in saving CPU cycles by only computing the partial hash. Care to point out the major problems you find in it?
Alex
@Nick Johnson: you make two excellent points. See my edits. I certainly failed to stress the plausibility of a one-pass approach. I stand by the equally plausible approach of partially hashing, in a multi-pass approach. The eventual decision towards a solution would be driven by many unsaid parameters some of which I hinted throughout the answer, some I even left unturned (estimated percentage of duplicates, structure of record set: one big sequential blob, files on disk..., frequency of the operation, is this one-time, will we add to this repository... etc.)
mjv
@Monomer: could you please qualify *your* comment? What in particular do you find wrong with the suggestions here? A Bloom filter *may* be an alternative but even with Potatocom’s settings you will get an average of **one error** which may simply be unacceptable.
Konrad Rudolph
@Nick Johnson: ahh, good old spinning rust.
mskfisher
+15  A: 

Use a Bloom filter: a table of simultaneous hashes. According to Wikipedia, the optimal number of hashes is ln(2) * 2^32 / 2^30 ≈ 2.77 ≈ 3. (Hmm, plugging in 4 gives fewer false positives but 3 is still better for this application.) This means that you have a table of 512 megabytes, or 4 gigabits, and processing each record sets three new bits in that vast sea. If all three bits were already set, it's a potential match. Record the three hash-values to a file. Otherwise, record them to another file. Note the record index along with each match.

(If a 5% error rate is tolerable, omit the large file and use the small file as your results.)

When finished, you should have a file of about 49M possible positive matches and a file of 975M negatives which yet may match positives. Read the former into a vector<pair<vector<uint32_t>,vector<uint32_t> > > (indexes in the latter vector, the former can be an array) and sort it. Put the indexes in another vector<uint32_t>; they're already sorted. Read the large file but instead of setting bits a table, find the hash values in the vector. (For example, use equal_range.) Use the list of positive-file indices to track the index of the current record in the negative file. If no match found, ignore. Otherwise, append the record's index match->second.push_back(current_negative_record_index).

Finally, iterate through the map and the vectors of record-indices. Any bucket with more than one entry is "almost" certain to contain a set of duplicates, but you've come this far, so look them up and compare them completely to be sure.

Total synchronous disk I/O: (one pass = 1 TiB) + (96 hash bits per record = 12 GiB) + (32 index bits per positive = ~200 MiB).

Final edit (seriously): On second thought, the Bloom Filter aspect might not really be helping here. The amount of hash data is more of a limiting factor than the number of false positives. With just one hash function, the total amount of hash data would be 4 GiB and the indexes of the 124 million expected false positives would be ~500 MiB. That should globally optimize this strategy.

Clarification (got a downvote): there's a distinction between a false positive from the Bloom filter and a hash collision. A hash collision can't be resolved except by returning to the original records and comparing, which is expensive. A Bloom false positive can be resolved by returning to the original hash values and comparing them, which is what the second pass of this algorithm does. So on second thought, the one-hash filter described in the "final" edit would unduly cause disk seeks. A two-hash Bloom filter would increase the number of false positives ending up in a single bucket of the match map, and would bring the number of false positives back down to the tens of millions.

Potatoswatter
@Potatocom: Remember that if the bloom filter is correctly sized, then the worst case false positive probability will only occur once the final unique element is inserted. Up until then the false positive probability of a query can potentially be orders of magnitudes less - but does increase gradually as more elements are added.
Monomer
@Monomer: Does that change anything here?
Potatoswatter
@Potatocom: It sure does, it only requires one pass. Imagine you've constructued a BF with a false positive prob of 1/2n where n is the number of items you will insert into the BF. The second item you query will see a fasle positive prob of ~ 1/(2n^e*delta_0) the third will see 1/(2n^e*delta_1) where delta_i = m*i+c with a limit towards 1/e. The point being only when you insert/query the last unique element will you actually see the desried false positive probability, all the other queries have had the pleasure of a false positive prob roughly of at least an order of magnitude less ...
Monomer
@Monomer: I understand that (although "all the other queries" is not precisely correct). But it doesn't affect my answer any. I said "the first pass tells you with high certainty about all duplicates besides the 'first.'" If you want absolute certainty or identification of *all* duplicates, you need one pass plus a 24 GiB second pass (20 GiB with five hash functions; less if fewer hashes are cleverly combined). I think the main issue with my answer is storage of the `map`… it's too risky to assume its size is negligible.
Potatoswatter
The false-positive chance of the last insertion *alone* is about 22%. The expected number of false positives total is about 58 million or 5.8%. That's enough to need to account for storing the `map`, and enough to seriously limit the usefulness of the single-pass version.
Potatoswatter
+1  A: 

First, configure computer with infinitely large swap file on infinitely large drive...

brone
I was tempted to give you a +1, but then somebody might think this was a serious answer.
Mark Ransom
+11  A: 

I'm late to the party, so unfortunately I doubt my solution will get much consideration, but I think it deserves to be offered.

The solutions offered so far seem too complicated. A bloom filter, while being the data structure du jour for the last several years, isn't best applied in a situation like this: because no data can be associated with the hashed content, you must not only maintain the bloom filter, but you must still record each (only 6-bit!) hash value and record to disk, destroying the benefit of the bloom filter and having a preposterously high collision rate.

On the other hand, merge sorting the entire terabyte is not only going to take O(n log n) comparisons, but O(n log n) disk traffic, since the majority of the intermediate files would have to be merged from disk, rather than from memory. Any real solution should try to reduce disk traffic as much as possible, since that's our primary bottleneck.

My solution is simple, making one assumption: that the terabyte of data is recorded in what's effectively one file.

Iterate through the records of the terabyte file and hash them. A cryptographic hash is unnecessary, costly and too large here; instead, use something like the 64-bit version of murmurhash. It can hash more than 2GiB/sec (far faster than we'll likely need, given the speed of storage these days) and has excellent (though not cryptographically secure) collision resistance. With a 64-bit hash, we would expect our first collision at 2^32, so it's probable that our approximately one billion records will not have any collisions at all.

Write the hashes and their associated record offsets out to another file. Since the records contain arbitrary binary data, we can't rely on UNIX's sort(1) to sort it, because some of the hashes and offsets may contain what sort(1) will interpret as newlines. We'll simply write the records out as fixed-width (probably 16 bytes: 8 bytes for the murmur2 64-bit hash, and 8 bytes for the offset in the terabyte file) records. The resulting file should be about 16GB, given our number of records.

We can sort this file by reading the number of records which will safely fit into memory and sorting them, flushing the sorted chunks back to disk. We can fit more records into memory with a heapsort (it uses O(1) space) than with a quicksort (which uses O(log n) memory for the call stack) but in most implementations, quicksort wins by virtue of its memory locality and lower instruction count. These intermediate files (there should be 35-40 of them) will be written out to disk .

The last step is to merge these files (in memory; there's no need to store a result on disk for this) collecting all hash collisions and looking up the associated records in the terabyte file, comparing the records for duplication and emitting the records (or their offsets) in whatever way the problem specifies.

As far as I can tell, this task hits the disk significantly less than any other offered solution and it's very conceptually simple: hash the records, look for duplicates in the hashes, and verify in the actual records.

For disk IO, it would read the terabyte data file, write 16GB to disk, read that 16GB from disk and write it back sorted, then read it and return the duplicates. As an optimization, the process hashing the records could accumulate them in memory before flushing them out to disk, sorting them before doing so: that cuts out the 16GB intermediate file, and allows the process to move from hashing directly to merging and reporting duplicates.

jemfinch
It's not a 6-bit hash. Each record sets six bits within a ~4-gigabit vector, and you'd store the indices of those bits in the event of a collision. In more traditional hash terms, it's closer to 6*32=192 bits.
Boojum
You have got 10^9 blocks and you expect your first collision at 2^32 (is this actually true? I believe you under-estimate the collision rate of murmurhash but I didn’t check this). How can you conclude that you’ll have no collision at all? In fact, you’ve got a 1 in 4 chance of collision: 2^32 ≈ 4e9.
Konrad Rudolph
@Red-nosed unicorn: Did you read the article I linked on the birthday paradox? That covers the math involved in my claim and has a specific section devoted to collisions in hash functions. As a general rule, given a uniformly distributed hash with N bits, you should expect (by which I mean "there's a greater than 50% chance of) a collision around the 2^(N/2)th object. So with the 64 bit hash I recommend, chances are you won't see a collision (apart from a genuine duplicate) in the billion things hashed.In whatever case, collisions will be unlikely, and my algorithm accounts for them.
jemfinch
@Boojum: You're right, I didn't intend to imply that the hash was only effectively 6 bits.The fact remains, however, that a bloom filter is just not the right data structure for this problem, as evinced by its 58 (now 49?) *million* expected collisions. Just checking that many collisions would take more than 80 hours on a disk with a 5ms seek time.
jemfinch
@jemfinch Didn’t read it but I *know* the birthday paradox and I know the math. However, my remark still stands, since this 2^(N/2) happens to be in the same order of magnitude as the 1e9 elements so chances are, you *will perhaps* see a collision (although my number of 1/4 was of course wrong – it’s 1/4 of 50%, i.e. 1/8). A 1 in 8 chance of a collision is pretty bad if we want to *guarantee* that the algorithm works without errors.
Konrad Rudolph
@Red-nosed unicorn: Yes, I know that the algorithm *can* see a collision. It *will* see a collision for every duplicate record, of course, as well. That's why, after sorting the hash records, the algorithm verifies by actual comparison whether the collided records are indeed duplicates ("collecting all hash collisions and looking up the associated records in the terabyte file, comparing the records for duplication"). My algorithm does not depend on a small number of collisions for *correctness*, but for *efficiency*, since false positives and their random reads are relatively expensive.
jemfinch
The efficiency of this algorithm is very much predicated on the presumption that *actual* duplicates are uncommon. If duplicates are common (I'm not exactly sure of the break-even point) the random reads necessary to verify them with a hash-based solution like this one would utterly swamp the computation time involved here. If it's known ahead of time that duplicates are common, this algorithm should be switched to a cryptographically secure hash and duplicate hashes should be reported without verification, or the data should simply be merge sorted, with one final pass reporting duplicates.
jemfinch
+1, not bad. But you're not correctly accounting for the disk I/O of sorting the 16GiB file using just 0.5GiB of memory. You'll need `log n` passes (a factor of 5?). Also, my algo doesn't need a disk seek to filter out every Bloom false positive (which number ~50M), only collisions between the 192 bit six-hash catenations, which should never happen.
Potatoswatter
@Potatocorn Remember, though, that merging sorted files doesn't have to go two files at a time; we can merge an arbitrary number of files in O(number_of_files) memory by using a min-heap of (head record, file descriptor) pairs. We emit the smallest record, read a new record from the given file, and sift that (new record, file descriptor) pair down in the heap as appropriate. That merge would still be O(n log n) time, but wouldn't require any additional writes to disk.
jemfinch
To optimize our disk usage (contiguous reads rather than many random seeks), we could actually make the heap maintain (top_records, current_index_into_top_records, file descriptor) triples, where "top records" is an array of 512k/number_of_files (about 15k) records read in chunks from the file descriptor. The triples would be compared by `top_records[current_index_into_top_records]`.
jemfinch
Ah, I see. The optimization sounds basically like a cache. Why write the entire sorted list back to disk? Can't you read from the 32 files in 16 MB chunks and output only potential duplicates? Well, that's pretty cool that there are two such distinct solutions leading to "about" one pass plus one percent.
Potatoswatter
I never intended to write the entire sorted list back to disk; I only intended to read from the 32(ish) files in chunks and output the real duplicates :)I actually wrote a heap-based merge of the kind I described above in C++ back in the day; maybe I'll see if I can dig it up and put it on github.
jemfinch