I've a csv file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way.
Each row contains 15 fields and several hundred characters, and all fields are needed to determine uniqueness. Instead of comparing the entire row to find a duplicate, I'm comparing hash(row-as-a-string)
in an attempt to save memory. I set a filter that partitions the data into a roughly equal number of rows (e.g. days of the week), and each partition is small enough that a lookup table of hash values for that partition will fit in memory. I pass through the file once for each partition, checking for unique rows and writing them out to a second file (pseudo code):
import csv
headers={'DayOfWeek':None, 'a':None, 'b':None}
outs=csv.DictWriter(open('c:\dedupedFile.csv','wb')
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
outs.writerows(headers)
for day in days:
htable={}
ins=csv.DictReader(open('c:\bigfile.csv','rb'),headers)
for line in ins:
hvalue=hash(reduce(lambda x,y:x+y,line.itervalues()))
if line['DayOfWeek']==day:
if hvalue in htable:
pass
else:
htable[hvalue]=None
outs.writerow(line)
One way I was thinking to speed this up is by finding a better filter to reduce the number of passes necessary. Assuming the length of the rows is uniformly distributed, maybe instead of
for day in days:
and
if line['DayOfWeek']==day:
we have
for i in range(n):
and
if len(reduce(lambda x,y:x+y,line.itervalues())%n)==i:
where 'n' as small as memory will allow. But this is still using the same method.
Wayne Werner provided a good practical solution below; I was curious if there was better/faster/simpler way to do this from an algorithm perspective.
P.S. I'm limited to Python 2.5.