ansaurus

Question

Efficient and accurate way to compact and compare Python lists?

Answer 1

+1 A:

More information would be needed on what exactly "failing miserably" means. If you are just not getting correct comparison between the two, perhaps Hashlib might solve that.

I've run into trouble previously when using the built in hash library, and solved it with that.

Edit: As someone suggested on another post, the issue could be with assuming that the two files are required to have each line be EXACTLY the same. You might want to try parsing the csv fields and appending them to a string with identical formatting (maybe trim spaces, force lowercase, etc) before computing the hash.

Tyler 2010-06-08 01:28:50

Answer 2

A:

This is likely a problem with (mis)using hash. See this SO question; as the answers there point out, you probably want hashlib.

Hank Gay 2010-06-08 01:29:10

Just reading in the file assumes that there are not two rows that are identical datawise but represented differently, ie with different quoting, escaping, spacing, etc.

intuited 2010-06-08 01:43:14

@intuited If you're immediately invoking `str()` on the result of the read, aren't you still running into those same problems?

Hank Gay 2010-06-09 13:45:01

@Hank Gay: Feeding it through csv.reader() normalizes it; see the question edit. eg `>>> cr = csv.reader(['1,"2",3', '1,2,3']); str(cr.next()) == str(cr.next())` gives `True`

intuited 2010-06-09 14:35:45

@intuited Hadn't noticed the edit, thanks.

Hank Gay 2010-06-09 14:53:45

Answer 3

+4 A:

It's hard to give a great answer without knowing more about your constraints, but if you can store a hash for each line of each file then you should be ok. At the very least you'll need to be able to store the hash list for one file, which you then would sort and write to disk, then you can march through the two sorted lists together.

The only reason why I can imagine the above not working as written would be because your hashing function doesn't always give the same output for a given input. You could test that a second run through old.csv generates the same list. It may have to do with errant spaces, tabs-instead-of-spaces, differing capitalization, "automatic

Mind, even if the hashes are equivalent you don't know that the lines match; you only know that they might match. You still need to check that the candidate lines do match. (You may also get the situation where more than one line in the input file generates the same hash, so you'll need to handle that as well.)

After you fill your hashes variable, you should consider turning it into a set (hashes = set(hashes)) so that your lookups can be faster than linear.

dash-tom-bang 2010-06-08 01:37:36

+1 for using a set. This also prevents storing the same hash more than once, thereby saving memory.

intuited 2010-06-08 01:44:06

Whoops, actually I misread that. Wouldn't it be better to make `hashes` a set from the beginning? He says that memory is critical, not processing time.

intuited 2010-06-08 01:52:58

all of this is true, but requires `{hashval: [row1, row2, ... rowN]}` be stored so the collision cases can be checked which has the net effect of forming a implied set of hashvals as the keys to the dict. `rowN` would be better stored as `file_offsetN` to avoid myriad walks back through the file.

msw 2010-06-08 01:59:31

@msw I am actually trying out the hash:row dictionaries now.

daveslab 2010-06-08 14:08:17

Yes, the dict approach (hash: set_of_rows) seems like it'd be the most efficient approach all around. The thinking behind my answer in general though was to provide similar functionality to that demonstrated by the OP, but then the middle paragraph addresses possible considerations that were not in the OP's implementation. As memory is a concern, it is not automatically feasible to store all of the input data in memory, but at the very least there needs to be a mapping of hash value to information-which-can-be-used-to-find-the-row.

dash-tom-bang 2010-06-09 16:49:00

Answer 4

+2 A:

Given the loose syntactic definition of CSV, it is possible for two rows to be semantically equal while being lexically different. The various Dialect definitions give some clue as two how two rows could be individually well-formed but incommensurable. And this example shows how they could be in the same dialect and not string equivalent:

0, 0
0, 0.0

More information would help yield a better answer your question.

msw 2010-06-08 01:51:21

Answer 5

A:

You need to say what your problem really is. Your description "I need to ensure that a row from one file does not appear in the other file" is consistent with the body of your second loop being if hash(...) in hashes: print "Found (an interloper)" rather that what you have.

We can't tell you "why didn't the above method work" because you haven't told us what the symptoms of "failed miserably" and "didn't work" are.

John Machin 2010-06-08 02:53:23

Answer 6

+1 A:

I'm pretty sure that the "failing miserably" line refers to a failure in time that comes from your current algorithm being O(N^2) which is quite bad for how big your files are. As has been mentioned, you can use a set to alieviate this problem (will become O(N)) or if you aren't able to do that for some reason then you can sort the list of hashes and use a binary search on it (will become O(N log N) which is also doable. You can use the bisect module if you go the binary search route.

Also, it has been mentioned that you may have the problem of a clash in the hashes: two lines yielding the same hash when the lines aren't exactly the same. If you discover that this is a problem that you are experiencing, you will have to store info with each hash about where to seek the line corresponding to the hash in the old.csv file and then seek the line out and compare the two lines.

An alternative to your current method is to sort the two files beforehand (using some sort of merge sort to disk perhaps or shell sort) and, keeping pointers to lines in each file, compare the two lines. Check if they match, and if not then advance the line that is measured as being lesser. This algorithm is also O(N log N) as long as an O(N log N) method is used for sorting. The sorting could also be done by putting each file into a database and having the database sort them.

Justin Peel 2010-06-08 05:17:08

Answer 7

A:

Have you perhaps considered running a sort (if possible) - you'll have to go over twice of course - but might solve the mem problem.

flyingcrab 2010-06-08 09:28:57

ansaurus

tags:

views:

answers:

Efficient and accurate way to compact and compare Python lists?

related questions