ansaurus

Question

Remove duplicate rows from a large file in Python

Answer 1

+6 A:

If you want a really simple way to do this, just create a sqlite database:

import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1,  f2,  f3,  f4,  f5,  f6,  f7,  
            f8,  f9,  f10,  f11,  f12,  f13,  f14,  f15))
"""
conn.commit()

#simplified/pseudo code
for row in reader:
    #assuming row returns a list-type object
    try:
        cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?, 
                       ?, ?, ?, ?, ?, ?, ?, ?)''', row)
        conn.commit()
    except IntegrityError:
        pass

conn.commit()
cur.execute('select * from test')

for row in cur:
    #write row to csv file

Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.

Wayne Werner 2010-08-10 21:00:09

Thanks WW. This is a good, practical answer that I'll upvote when my reputation gets high enough. I was more curious about the theoretical solution... "Oh, use this algorithm with these data structures!" I'll edit the post to reflect this.

JonC 2010-08-10 21:15:00

+1 If it's not going to fit into memory, it's not going to fit in memory :) So you'll have to store your results on disk! SQLite will index your data so it'll be FASSSTTTT.

Matt Williamson 2010-08-10 21:26:28

Consider using SQLITE parameters...`cur.execute("insert into XXX values (?,?,?,?,?)", (1,2,3,4,5))`

Joe Koberg 2010-08-10 21:34:43

(1) committing after each row, while presumably necessary to make it check the PK constraint, is going to make it rather slow, isn't it? (2) Wouldn't it be easier to have just ONE column in the database?

John Machin 2010-08-11 03:27:47

@John, I'm not sure how fast it is, I've not done any speed comparisons so I can't say. If you're interested you could always write your own tests, post a question and then answer it so folks can benefit ;) As for one column - I mentioned that since the OP is already converting the data to string...

Wayne Werner 2010-08-11 12:47:37

@Joe, I thought about that, then figured that this is more of a one-off solution. But now that you mention it, it's possible that the data isn't properly sanitized just by accident. Not to mention if someone stumbles across this solution with no prior SQL experience they may use it for a secure purpose. I've changed my answer.

Wayne Werner 2010-08-11 12:55:41

It's easier and faster; not just more secure. The SQLITE engine will cache the already-parsed query text and just re-execute the plan with the new parameters.

Joe Koberg 2010-08-11 14:45:46

@Joe, good to know! I haven't had more than an incidental need to use SQLite, so my knowledge is pretty basic.

Wayne Werner 2010-08-11 15:07:26

Answer 2

+2 A:

You are basically doing a merge sort, and removing duplicated entries.

Breaking the input into memory-sized pieces, sorting each of piece, then merging the pieces while removing duplicates is a sound idea in general.

Actually, up to a couple of gigs I would let the virtual memory system handle it and just write:

input = open(infilename, 'rb')
output = open(outfile, 'wb')

for key,  group in itertools.groupby(sorted(input)):
    output.write(key)

Joe Koberg 2010-08-10 21:33:34

Answer 3

+2 A:

Your current method is not guaranteed to work properly.

Firstly, there is the small probability that two lines that are actually different can produce the same hash value. hash(a) == hash(b) does not always mean that a == b

Secondly, you are making the probability higher with your "reduce/lambda" caper:

>>> reduce(lambda x,y: x+y, ['foo', '1', '23'])
'foo123'
>>> reduce(lambda x,y: x+y, ['foo', '12', '3'])
'foo123'
>>>

BTW, wouldn't "".join(['foo', '1', '23']) be somewhat clearer?

BTW2, why not use a set instead of a dict for htable?

Here's a practical solution: get the "core utils" package from the GnuWin32 site, and install it. Then:

write a copy of your file without the headings to (say) infile.csv
c:\gnuwin32\bin\sort --unique -ooutfile.csv infile.csv
read outfile.csv and write a copy with the headings prepended

For each of steps 1 & 3, you could use a Python script, or some of the other GnuWin32 utilities (head, tail, tee, cat, ...).

John Machin 2010-08-10 22:26:11

Ah, thanks for catching me in that collision "caper". A good point. Is a membership test faster in a set than in a dict?

JonC 2010-08-10 23:42:24

As far as I know there is no reason to expect a significant difference in speed of membership tests. The cost of creating a hash value using Python code instead of C code is probably worth investigation if you plan to persist with your original method.

John Machin 2010-08-11 00:43:34

Answer 4

+1 A:

Your original solution is slightly incorrect: you could have different lines hashing to the same value (a hash collision), and your code would leave one of them out.

In terms of algorithmic complexity, if you're expecting relatively few duplicates, I think the fastest solution would be to scan the file line by line, adding the hash of each line (as you did), but also storing the location of that line. Then when you encounter a duplicate hash, seek to the original place to make sure that it is a duplicate and not just a hash collision, and if so, seek back and skip the line.

By the way, if the CSV values are normalized (i.e., records are considered equal iff the corresponding CSV rows are equivalent byte-for-byte), you need not involve CSV parsing here at all, just deal with plain text lines.

Gintautas Miliauskas 2010-08-10 22:29:21

Answer 5

A:

Since I suppose you'll have to do this on a somewhat regular basis (or you'd have hacked a once-over script), and you mentioned you were interested in a theoretical solution, here's a possibility.

Read the input lines into B-Trees, ordered by each input line's hash value, writing them to disk when memory fills. We take care to store, on the B-Trees, the original lines attached to the hash (as a set, since we only care about unique lines). When we read a duplicate element, we check the lines set on the stored element and add it if it's a new line that happens to hash to the same value.

Why B-Trees? They requires less disk reads when you only can (or want) to read parts of them to memory. The degree (number of children) on each node depends on the available memory and number of lines, but you don't want to have too many nodes.

Once we have those B-Trees on disk, we compare the lowest element from each of them. We remove the lowest of all, from all B-Trees that have it. We merge their lines sets, meaning that we have no duplicates left for those lines (and also that we have no more lines that hash to that value). We then write the lines from this merge into the output csv structure.

We can separate half of the memory for reading the B-Trees, and half to keep the output csv in memory for some time. We flush the csv to disk when its half is full, appending to whatever has already been written. How much of each B-Tree we read on each step can be roughly calculated by (available_memory / 2) / number_of_btrees, rounded so we read full nodes.

In pseudo-Python:

ins = DictReader(...)
i = 0
while ins.still_has_lines_to_be_read():
    tree = BTree(i)
    while fits_into_memory:
        line = ins.readline()
        tree.add(line, key=hash)
    tree.write_to_disc()
    i += 1
n_btrees = i

# At this point, we have several (n_btres) B-Trees on disk
while n_btrees:
    n_bytes = (available_memory / 2) / n_btrees
    btrees = [read_btree_from_disk(i, n_bytes)
              for i in enumerate(range(n_btrees))]
    lowest_candidates = [get_lowest(b) for b in btrees]
    lowest = min(lowest_candidates)
    lines = set()
    for i in range(number_of_btrees):
        tree = btrees[i]
        if lowest == lowest_candidates[i]:
            node = tree.pop_lowest()
            lines.update(node.lines)
        if tree.is_empty():
        n_btrees -= 1

    if output_memory_is_full or n_btrees == 0:
        outs.append_on_disk(lines)

rbp 2010-08-11 01:51:20

Answer 6

A:

How about using heapq module to read pieces of file up to memory limit and write them out the sorted pieces (heapq keeps things always in sorted order).

Or you could catch the first word in line and divide the file to pieces by that. Then you can read the lines (maybe do ' '.join(line.split()) to unify the spacing/tabs in line if it is OK to change spacing) in set in alphabetic order clearing the set between the pieces (set removes duplicates) to get things half sorted (set is not in order, if you want you can read in to heap and write out to get sorted order, last occurrence in set replacing old values as you go.) Alternatively you can also sort the piece and remove duplicate lines with Joe Koberg's groupby solution. Lastly you can join pieces back together (you can of course do the writing as you go piece by piece to final file during sorting of pieces)

Tony Veijalainen 2010-08-11 08:59:29

ansaurus

tags:

views:

answers:

Remove duplicate rows from a large file in Python

related questions