views:

285

answers:

11

Say there is a 3TB TXT file, in which every line is a string, how to find those duplicated strings in them? It's an interview question from a friend of mine. We'd better make those questions clear enough after an interview, in case of the next one.

PS: If I'm the interviewee, I will tell the interviewer: How can you guys store so many strings in a TXT file? It's really a bad idea!

A: 

Does speed count?

The obvious solution that comes to mind is to load up say, the first 1000 lines into some kind of Set class, and then read the remaining lines one at a time and check if they're contained in the set. Then read the next 1000 lines, and repeat. That way you're only storing 1000 lines in memory at any one time.

I don't think you'll score many brownie points for telling the interviewer that storing that much data in a text file is bad idea. Who knows how this text file came to be... maybe it's the result of some legacy system, or who knows what. There's perfectly legitimate reasons for its existence.

Mark
This solution is O((n^2-n)/2), which is worse than other proposed solutions that involve sorting then removing duplicates, whose complexity is O(n log n + n)
Giuseppe Cardone
Did I get down-voted for that? I didn't say it was efficient.. it was just the first solution that came to my mind.
Mark
O(n.log(n) + n) is O(n.log(n)).
Ricky Clarkson
@Mark: Merge sort. You sort memory-sized chunks, then merge them together by streaming.
Steven Sudit
To sort the file you'll have to read it completely of course, but you don't need to completely load it up in the primary memory. Merge sort is the classic example of sorting algorithm that can sort data on disk that is too large to fit entirely into primary memory.
Giuseppe Cardone
@Ricky Of course it is, I was nitpicking :) For the same reason O((n^2-n)/2) is O(n^2), I just wanted to be precise.
Giuseppe Cardone
Meh :p At 80 chars/line, we're only talking 13 billion lines... square that and we're looking on the order of 94,447,329,657,392,904,273 operations.... efficiency doesn't matter, right? :P
Mark
Opposed to 43,980,465,111,040.
Mark
+3  A: 

One possibility is to use bloom filter.

According to wikipedia

The Bloom filter, conceived by Burton Howard Bloom in 1970,[1] is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.the probability of false positives.

This way, you can store the duplicates much more efficient, but at the cost of precisness.

Ikke
This doesn't seem to fit the problem.
Steven Sudit
+1  A: 

A fairly straightforward way off the top of my head:

You could merge sort (good performance for data too large to fit into main memory) the text file. Then you can identify duplicates in a single pass through the file. O(nlogn). Of course this will either modify the original text file, or you could make a copy.

Greg Sexton
+4  A: 

sort bigfile.txt | uniq -d

Thorbjørn Ravn Andersen
The `sort` command is a DOS command or something?
Danny Chen
Standard Unix command.
Thorbjørn Ravn Andersen
How is sort implemented? Does it handle 3TB files?
Thilo
@Thilo: GNU sort uses an n-way merge sort and temporary files. Thus, given enough disk space, it is able to sort a 3TB file.
Giuseppe Cardone
The Windows/DOS version of `sort` also uses a merge-sort if needed. However, there's no `uniq` filter.
Steven Sudit
Just out of interest (and I'm not disparaging this answer at all), has anyone here actually _tried_ to sort a 3TB file. I'd test it out but I don't have that much disk space (not just free space, I don't have that much space at _all_). Timing 1M, 10M, 100M and 1G words file takes user+system of 0.276s, 3.052s, 36.978s and 455s, roughly linear (each 10x size is about 12x time) though that may be all in-memory, it may get substantially worse once it has to use disk. That would put a 3TB file, _at best_, clocking in at 2,358,720 secs or just a touch over 27 days.
paxdiablo
Typically sorting goes in O(n log n) so a factor of 12 to 10 sounds reasonable. Also you should do this under a Unix operating system, not Windows.
Thorbjørn Ravn Andersen
@paxdiablo I've never sorted a 3TB file, but `sort` is well suited to huge files. `sort`ing <10 gigabytes takes no more than a minutes or so on a system with a suitably fast hard disk (i.e., 10k SATA RAIDs). This is something that I used to have to do a lot.
Seth
+1  A: 

If you've got plenty of extra disk space, something like this should be workable:

for every line in the file:
    calculate a hash function for that line.
    append to a file named based on that hash (create if new).
for every file created:
    sort it.
    for every line in sorted file:
        if first line in file:
            set count to 0.
            set lastline to line.
        else
            if line identical to lastline:
                add 1 to count.
                if count is 1:
                    Output line.
            else:
                set count to 0.
        set lastline to line.

Assuming your hash function is relatively balanced, the sorts shouldn't be too onerous.

paxdiablo
That...... sounds like a really dirty hack. Creating files for every line? And relying on the OS?
Mark
Given the size of the file, we should expect that the number of lines will far exceed what any OS can comfortably handle. Therefore, this is not at all a viable solution.
Steven Sudit
Mark/Steven, I'm not proposing one line per file. With a 3TB file, I'd suspect there'd be quite a lot of hash collisions. The idea is to get the individual file sizes down to an more easily sortable lot and ensure all identical lines are within one file (since they'd have the same hash).
paxdiablo
Might not even need something as complex as a hash. The concept is easier introduced as "Copy all strings starting with "A" to A.tmp, "B" to B.tmp, etcetera. The 26 resulting files will still be 3 TB total, but all duplicates can now be found in a smaller file". From there, it's easy to shot that "T.tmp" will have a lot of entries starting with "The", while Q.tmp will be small. The hash ensures a more fine-grained, more even distribution.
MSalters
@MSalters <pedantic> hash function need not be complex. A simple hash is as u described the first letter of the string. so ur solution is the same as paxdiablo only with a specified hash function. </pedanctic>
emory
Actually, I have to admit, I didn't think of the more simplistic "starts with a letter" option. It's may be faster than a hash over the whole string and, assuming a reasonably balanced distribution, using the first three letters would give you about 4000 files each about 750M, certainly doable (those calcs all assume 26 letters, actual figures might be slightly different if the character set is larger).
paxdiablo
Yup - and for the interview context, explaining that you understand all such tradeoffs is more important then the precise numbers. (But being pedantic isn't a good strategy)
MSalters
+2  A: 

hi

if there is just one word per line, why you don't just dump the text file in a database table with following columns id, text and do some

select text, count(text) 
from table 
group by text
having count(text)>1

then you should get the right answers in a very easy way.

nWorx
Presumably, the dbms has already optimally solved this problem. So why reinvent the wheel.
emory
I suspect this would count as "cheating".
Steven Sudit
why cheating? i don't see any constraints to any language? and this is very fast and simple solution
nWorx
@nWorx we can presume by the java tag that the result should be written in java. So to make ur solution fully compliant, add some jdbc commands:)
emory
:-) that's why the solution is a dos command :-)
nWorx
A: 

Sort this file, duplicates will sort together. Alternatively, create a second file and hash (md5 ?) each line into it, then sort that.

Jaydee
The latter does not seem to be an improvement over the former.
Steven Sudit
+1  A: 
SELECT String
FROM TextFile
GROUP BY String
HAVING COUNT(*) > 1
ORDER BY String
devio
A: 

I'd propose 2 solutions.

The first would be to place each of the lines into sets then look though the sets looking for ones with more than one element. I'd have the solution write the sets to disk to save on memory space.

The second would be to sort the text file like others have been suggesting.

James Raybould
A: 

A Probabilistic Solution

The below technique tries to use hash functions to identify string which are proven unique. After the first pass, the strings will be divided into (1) proven unique and (2) possibly duplicate.

There will be many unique strings labelled possibly duplicate because of hash code collision. Subsequent passes will only work with the possibly duplicate strings to reduce the rate of collision.

This technique does not guarantee to get rid of all duplicates (just most of them).

Let

  1. s[1], s[2], ..., s[n] be the input strings.
  2. h[1], h[2], ..., h[m] be different hash functions of size k.
  3. a[1,...n] be an array of bits having values 0, 1.
  4. b[1,...,m][1,...,k] be an array of flags having values 0, 1, 2.

Then

  1. For i=1 to k:
    1. For j=1 to n:
      1. if a[j]==0 // this string may/ may not be unique
        1. Let x be h[i] (s[j]).
        2. if b[i][x]==0 then b[i][x]==1
        3. else if b[i][x]==1 then b[i][x]=2
      2. else if a[j]==1, this string has been proven to be unique, skip it.
    2. For j=1 to n:
      1. if a[j]==0 // this string may/ may not be unique
        1. Let x be h[i] (s[j])
        2. if b[i][x]==1 then set a[j]=1 // we have proven the string to be unique
        3. else if b[i][x]==2 this string may/ may not be unique
        4. else if b[i][x]==0 there is an implementation problem
      2. else if a[j]==1, this string has been proven to be unique, skip it
emory
A: