ansaurus

Question

Answer 1

A:

Does speed count?

The obvious solution that comes to mind is to load up say, the first 1000 lines into some kind of Set class, and then read the remaining lines one at a time and check if they're contained in the set. Then read the next 1000 lines, and repeat. That way you're only storing 1000 lines in memory at any one time.

I don't think you'll score many brownie points for telling the interviewer that storing that much data in a text file is bad idea. Who knows how this text file came to be... maybe it's the result of some legacy system, or who knows what. There's perfectly legitimate reasons for its existence.

Mark 2010-09-09 08:05:14

This solution is O((n^2-n)/2), which is worse than other proposed solutions that involve sorting then removing duplicates, whose complexity is O(n log n + n)

Giuseppe Cardone 2010-09-09 08:19:34

Did I get down-voted for that? I didn't say it was efficient.. it was just the first solution that came to my mind.

Mark 2010-09-09 08:32:25

O(n.log(n) + n) is O(n.log(n)).

Ricky Clarkson 2010-09-09 08:36:13

@Mark: Merge sort. You sort memory-sized chunks, then merge them together by streaming.

Steven Sudit 2010-09-09 08:36:30

To sort the file you'll have to read it completely of course, but you don't need to completely load it up in the primary memory. Merge sort is the classic example of sorting algorithm that can sort data on disk that is too large to fit entirely into primary memory.

Giuseppe Cardone 2010-09-09 08:38:04

@Ricky Of course it is, I was nitpicking :) For the same reason O((n^2-n)/2) is O(n^2), I just wanted to be precise.

Giuseppe Cardone 2010-09-09 08:40:10

Meh :p At 80 chars/line, we're only talking 13 billion lines... square that and we're looking on the order of 94,447,329,657,392,904,273 operations.... efficiency doesn't matter, right? :P

Mark 2010-09-09 08:45:32

Opposed to 43,980,465,111,040.

Mark 2010-09-09 08:46:32

Answer 2

+3 A:

One possibility is to use bloom filter.

According to wikipedia

The Bloom filter, conceived by Burton Howard Bloom in 1970,[1] is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.the probability of false positives.

This way, you can store the duplicates much more efficient, but at the cost of precisness.

Ikke 2010-09-09 08:06:05

This doesn't seem to fit the problem.

Steven Sudit 2010-09-09 08:32:44

Answer 3

+1 A:

A fairly straightforward way off the top of my head:

You could merge sort (good performance for data too large to fit into main memory) the text file. Then you can identify duplicates in a single pass through the file. O(nlogn). Of course this will either modify the original text file, or you could make a copy.

Greg Sexton 2010-09-09 08:06:34

Answer 4

+4 A:

sort bigfile.txt | uniq -d

Thorbjørn Ravn Andersen 2010-09-09 08:07:42

The `sort` command is a DOS command or something?

Danny Chen 2010-09-09 08:21:55

Standard Unix command.

Thorbjørn Ravn Andersen 2010-09-09 08:24:31

How is sort implemented? Does it handle 3TB files?

Thilo 2010-09-09 08:49:28

@Thilo: GNU sort uses an n-way merge sort and temporary files. Thus, given enough disk space, it is able to sort a 3TB file.

Giuseppe Cardone 2010-09-09 09:13:40

The Windows/DOS version of `sort` also uses a merge-sort if needed. However, there's no `uniq` filter.

Steven Sudit 2010-09-09 09:23:27

Just out of interest (and I'm not disparaging this answer at all), has anyone here actually _tried_ to sort a 3TB file. I'd test it out but I don't have that much disk space (not just free space, I don't have that much space at _all_). Timing 1M, 10M, 100M and 1G words file takes user+system of 0.276s, 3.052s, 36.978s and 455s, roughly linear (each 10x size is about 12x time) though that may be all in-memory, it may get substantially worse once it has to use disk. That would put a 3TB file, _at best_, clocking in at 2,358,720 secs or just a touch over 27 days.

paxdiablo 2010-09-09 12:28:19

Typically sorting goes in O(n log n) so a factor of 12 to 10 sounds reasonable. Also you should do this under a Unix operating system, not Windows.

Thorbjørn Ravn Andersen 2010-09-09 13:45:16

@paxdiablo I've never sorted a 3TB file, but `sort` is well suited to huge files. `sort`ing <10 gigabytes takes no more than a minutes or so on a system with a suitably fast hard disk (i.e., 10k SATA RAIDs). This is something that I used to have to do a lot.

Seth 2010-09-09 16:21:26

Answer 5

+1 A:

If you've got plenty of extra disk space, something like this should be workable:

for every line in the file:
    calculate a hash function for that line.
    append to a file named based on that hash (create if new).
for every file created:
    sort it.
    for every line in sorted file:
        if first line in file:
            set count to 0.
            set lastline to line.
        else
            if line identical to lastline:
                add 1 to count.
                if count is 1:
                    Output line.
            else:
                set count to 0.
        set lastline to line.

Assuming your hash function is relatively balanced, the sorts shouldn't be too onerous.

paxdiablo 2010-09-09 08:08:25

That...... sounds like a really dirty hack. Creating files for every line? And relying on the OS?

Mark 2010-09-09 08:26:09

Given the size of the file, we should expect that the number of lines will far exceed what any OS can comfortably handle. Therefore, this is not at all a viable solution.

Steven Sudit 2010-09-09 08:35:50

Mark/Steven, I'm not proposing one line per file. With a 3TB file, I'd suspect there'd be quite a lot of hash collisions. The idea is to get the individual file sizes down to an more easily sortable lot and ensure all identical lines are within one file (since they'd have the same hash).

paxdiablo 2010-09-09 10:26:01

Might not even need something as complex as a hash. The concept is easier introduced as "Copy all strings starting with "A" to A.tmp, "B" to B.tmp, etcetera. The 26 resulting files will still be 3 TB total, but all duplicates can now be found in a smaller file". From there, it's easy to shot that "T.tmp" will have a lot of entries starting with "The", while Q.tmp will be small. The hash ensures a more fine-grained, more even distribution.

MSalters 2010-09-09 10:58:46

@MSalters <pedantic> hash function need not be complex. A simple hash is as u described the first letter of the string. so ur solution is the same as paxdiablo only with a specified hash function. </pedanctic>

emory 2010-09-09 11:43:33

Actually, I have to admit, I didn't think of the more simplistic "starts with a letter" option. It's may be faster than a hash over the whole string and, assuming a reasonably balanced distribution, using the first three letters would give you about 4000 files each about 750M, certainly doable (those calcs all assume 26 letters, actual figures might be slightly different if the character set is larger).

paxdiablo 2010-09-09 11:54:39

Yup - and for the interview context, explaining that you understand all such tradeoffs is more important then the precise numbers. (But being pedantic isn't a good strategy)

MSalters 2010-09-09 12:09:03

Answer 6

+2 A:

hi

if there is just one word per line, why you don't just dump the text file in a database table with following columns id, text and do some

select text, count(text) 
from table 
group by text
having count(text)>1

then you should get the right answers in a very easy way.

nWorx 2010-09-09 08:09:16

Presumably, the dbms has already optimally solved this problem. So why reinvent the wheel.

emory 2010-09-09 08:39:03

I suspect this would count as "cheating".

Steven Sudit 2010-09-09 09:24:32

why cheating? i don't see any constraints to any language? and this is very fast and simple solution

nWorx 2010-09-09 09:41:59

@nWorx we can presume by the java tag that the result should be written in java. So to make ur solution fully compliant, add some jdbc commands:)

emory 2010-09-09 11:46:11

:-) that's why the solution is a dos command :-)

nWorx 2010-09-09 13:39:40

Answer 7

A:

Sort this file, duplicates will sort together. Alternatively, create a second file and hash (md5 ?) each line into it, then sort that.

Jaydee 2010-09-09 08:10:14

The latter does not seem to be an improvement over the former.

Steven Sudit 2010-09-09 08:34:03

Answer 8

+1 A:

SELECT String
FROM TextFile
GROUP BY String
HAVING COUNT(*) > 1
ORDER BY String

devio 2010-09-09 08:11:50

Answer 9

A:

I'd propose 2 solutions.

The first would be to place each of the lines into sets then look though the sets looking for ones with more than one element. I'd have the solution write the sets to disk to save on memory space.

The second would be to sort the text file like others have been suggesting.

James Raybould 2010-09-09 08:29:53

Answer 10

A:

A Probabilistic Solution

The below technique tries to use hash functions to identify string which are proven unique. After the first pass, the strings will be divided into (1) proven unique and (2) possibly duplicate.

There will be many unique strings labelled possibly duplicate because of hash code collision. Subsequent passes will only work with the possibly duplicate strings to reduce the rate of collision.

This technique does not guarantee to get rid of all duplicates (just most of them).

Let

s[1], s[2], ..., s[n] be the input strings.
h[1], h[2], ..., h[m] be different hash functions of size k.
a[1,...n] be an array of bits having values 0, 1.
b[1,...,m][1,...,k] be an array of flags having values 0, 1, 2.

Then

For i=1 to k:
1. For j=1 to n:
  1. if a[j]==0 // this string may/ may not be unique
    1. Let x be h[i] (s[j]).
    2. if b[i][x]==0 then b[i][x]==1
    3. else if b[i][x]==1 then b[i][x]=2
  2. else if a[j]==1, this string has been proven to be unique, skip it.
2. For j=1 to n:
  1. if a[j]==0 // this string may/ may not be unique
    1. Let x be h[i] (s[j])
    2. if b[i][x]==1 then set a[j]=1 // we have proven the string to be unique
    3. else if b[i][x]==2 this string may/ may not be unique
    4. else if b[i][x]==0 there is an implementation problem
  2. else if a[j]==1, this string has been proven to be unique, skip it

emory 2010-09-09 09:53:15

Answer 11

A:

2010-09-13 14:11:47

ansaurus

tags:

views:

answers:

Duplicated strings in a 3TB TXT file

related questions