tags:

views:

193

answers:

2

I have a table in PyTables with ~50 million records. The combination of two fields (specifically userID and date) should be unique (i.e. a user should have at most one record per day), but I need to verify that this is indeed the case.

Illustratively, my table looks like this:

userID |   date
A      |    1
A      |    2
B      |    1
B      |    2
B      |    2   <- bad! Problem with the data!

Additional details:

  • The table is currently 'mostly' sorted.
  • I can just barely pull one column into memory as a numpy array, but I can't pull two into memory at the same time.
  • Both userID and date are integers
A: 

I don't know much about PyTables, but I would try this approach

  1. For each userID, get all (userID, date) pairs
  2. assert len(rows)==len(set(rows)) - this assertion holds true if all (userID, date) tuples contained in the rows list are unique
Otto Allmendinger
+1  A: 

It seems that indexes in PyTables are limited to single columns.

I would suggest adding a hash column and putting an index on it. Your unique data is defined as the concatenation of other columns in the DB. Separators will ensure that there aren't two different rows that yield the same unique data. The hash column could just be this unique string, but if your data is long you will want to use a hash function. A fast hash function like md5 or sha1 is great for this application.

Compute the hashed data and check if it's in the DB. If so, you know you hit some duplicate data. If not, you can safely add it.

Craig Younkins