I have a table in PyTables with ~50 million records. The combination of two fields (specifically userID and date) should be unique (i.e. a user should have at most one record per day), but I need to verify that this is indeed the case.
Illustratively, my table looks like this:
userID | date
A | 1
A | 2
B | 1
B | 2
B | 2 <- bad! Problem with the data!
Additional details:
- The table is currently 'mostly' sorted.
- I can just barely pull one column into memory as a numpy array, but I can't pull two into memory at the same time.
- Both userID and date are integers