views:

157

answers:

7

Hi folks,

I'm trying to a somewhat sophisticated diff between individual rows in two CSV files. I need to ensure that a row from one file does not appear in the other file, but I am given no guarantee of the order of the rows in either file. As a starting point, I've been trying to compare the hashes of the string representations of the rows (i.e. Python lists). For example:

import csv

hashes = []
for row in csv.reader(open('old.csv','rb')):
  hashes.append( hash(str(row)) )

for row in csv.reader(open('new.csv','rb')):
  if hash(str(row)) not in hashes:
    print 'Not found'

But this is failing miserably. I am constrained by artificially imposed memory limits that I cannot change, and thusly I went with the hashes instead of storing and comparing the lists directly. Some of the files I am comparing can be hundreds of megabytes in size. Any ideas for a way to accurately compress Python lists so that they can be compared in terms of simple equality to other lists? I.e. a hashing system that actually works? Bonus points: why didn't the above method work?

EDIT:

Thanks for all the great suggestions! Let me clarify some things. "Miserable failure" means that two rows that have the exact same data, after being read in by the CSV.reader object are not hashing to the same value after calling str on the list object. I shall try hashlib at some suggestions below. I also cannot do a hash on the raw file, since two lines below contain the same data, but different characters on the line:

1, 2.3, David S, Monday
1, 2.3, "David S", Monday

I am also already doing things like string stripping to make the data more uniform, but it seems to no avail. I'm not looking for an extremely smart diff logic, i.e. that 0 is the same as 0.0.

EDIT 2:

Problem solved. What basically worked is that I needed to a bit more pre-formatting like converting ints and floats, and so forth AND I needed to change my hashing function. Both these changes seemed to do the job for me.

+1  A: 

More information would be needed on what exactly "failing miserably" means. If you are just not getting correct comparison between the two, perhaps Hashlib might solve that.

I've run into trouble previously when using the built in hash library, and solved it with that.

Edit: As someone suggested on another post, the issue could be with assuming that the two files are required to have each line be EXACTLY the same. You might want to try parsing the csv fields and appending them to a string with identical formatting (maybe trim spaces, force lowercase, etc) before computing the hash.

Tyler
A: 

This is likely a problem with (mis)using hash. See this SO question; as the answers there point out, you probably want hashlib.

Hank Gay
Just reading in the file assumes that there are not two rows that are identical datawise but represented differently, ie with different quoting, escaping, spacing, etc.
intuited
@intuited If you're immediately invoking `str()` on the result of the read, aren't you still running into those same problems?
Hank Gay
@Hank Gay: Feeding it through csv.reader() normalizes it; see the question edit. eg `>>> cr = csv.reader(['1,"2",3', '1,2,3']); str(cr.next()) == str(cr.next())` gives `True`
intuited
@intuited Hadn't noticed the edit, thanks.
Hank Gay
+4  A: 

It's hard to give a great answer without knowing more about your constraints, but if you can store a hash for each line of each file then you should be ok. At the very least you'll need to be able to store the hash list for one file, which you then would sort and write to disk, then you can march through the two sorted lists together.

The only reason why I can imagine the above not working as written would be because your hashing function doesn't always give the same output for a given input. You could test that a second run through old.csv generates the same list. It may have to do with errant spaces, tabs-instead-of-spaces, differing capitalization, "automatic

Mind, even if the hashes are equivalent you don't know that the lines match; you only know that they might match. You still need to check that the candidate lines do match. (You may also get the situation where more than one line in the input file generates the same hash, so you'll need to handle that as well.)

After you fill your hashes variable, you should consider turning it into a set (hashes = set(hashes)) so that your lookups can be faster than linear.

dash-tom-bang
+1 for using a set. This also prevents storing the same hash more than once, thereby saving memory.
intuited
Whoops, actually I misread that. Wouldn't it be better to make `hashes` a set from the beginning? He says that memory is critical, not processing time.
intuited
all of this is true, but requires `{hashval: [row1, row2, ... rowN]}` be stored so the collision cases can be checked which has the net effect of forming a implied set of hashvals as the keys to the dict. `rowN` would be better stored as `file_offsetN` to avoid myriad walks back through the file.
msw
@msw I am actually trying out the hash:row dictionaries now.
daveslab
Yes, the dict approach (hash: set_of_rows) seems like it'd be the most efficient approach all around. The thinking behind my answer in general though was to provide similar functionality to that demonstrated by the OP, but then the middle paragraph addresses possible considerations that were not in the OP's implementation. As memory is a concern, it is not automatically feasible to store all of the input data in memory, but at the very least there needs to be a mapping of hash value to information-which-can-be-used-to-find-the-row.
dash-tom-bang
+2  A: 

Given the loose syntactic definition of CSV, it is possible for two rows to be semantically equal while being lexically different. The various Dialect definitions give some clue as two how two rows could be individually well-formed but incommensurable. And this example shows how they could be in the same dialect and not string equivalent:

0, 0
0, 0.0

More information would help yield a better answer your question.

msw
A: 

You need to say what your problem really is. Your description "I need to ensure that a row from one file does not appear in the other file" is consistent with the body of your second loop being if hash(...) in hashes: print "Found (an interloper)" rather that what you have.

We can't tell you "why didn't the above method work" because you haven't told us what the symptoms of "failed miserably" and "didn't work" are.

John Machin
+1  A: 

I'm pretty sure that the "failing miserably" line refers to a failure in time that comes from your current algorithm being O(N^2) which is quite bad for how big your files are. As has been mentioned, you can use a set to alieviate this problem (will become O(N)) or if you aren't able to do that for some reason then you can sort the list of hashes and use a binary search on it (will become O(N log N) which is also doable. You can use the bisect module if you go the binary search route.

Also, it has been mentioned that you may have the problem of a clash in the hashes: two lines yielding the same hash when the lines aren't exactly the same. If you discover that this is a problem that you are experiencing, you will have to store info with each hash about where to seek the line corresponding to the hash in the old.csv file and then seek the line out and compare the two lines.

An alternative to your current method is to sort the two files beforehand (using some sort of merge sort to disk perhaps or shell sort) and, keeping pointers to lines in each file, compare the two lines. Check if they match, and if not then advance the line that is measured as being lesser. This algorithm is also O(N log N) as long as an O(N log N) method is used for sorting. The sorting could also be done by putting each file into a database and having the database sort them.

Justin Peel
A: 

Have you perhaps considered running a sort (if possible) - you'll have to go over twice of course - but might solve the mem problem.

flyingcrab