ansaurus

Question

Efficient way to verify that records are unique in Python/PyTables

Answer 1

A:

I don't know much about PyTables, but I would try this approach

For each userID, get all (userID, date) pairs
assert len(rows)==len(set(rows)) - this assertion holds true if all (userID, date) tuples contained in the rows list are unique

Otto Allmendinger 2009-08-22 14:05:16

Answer 2

+1 A:

It seems that indexes in PyTables are limited to single columns.

I would suggest adding a hash column and putting an index on it. Your unique data is defined as the concatenation of other columns in the DB. Separators will ensure that there aren't two different rows that yield the same unique data. The hash column could just be this unique string, but if your data is long you will want to use a hash function. A fast hash function like md5 or sha1 is great for this application.

Compute the hashed data and check if it's in the DB. If so, you know you hit some duplicate data. If not, you can safely add it.

Craig Younkins 2009-08-22 16:36:40

ansaurus

tags:

views:

answers:

Efficient way to verify that records are unique in Python/PyTables

related questions