views:

91

answers:

4

I have a 15 integer column with 5,000,000 rows in a table. Given a input record containing 15 integers I need to compare the input record with the 5,000,000 record table and obtain all matching rows.

Note1: All integers within a row are unique
Note2: the order of columns matching and the input record is not important.
for example: 1, 10, 15, 23, 9, 22, 99, 11, 19, 32, 45, 21, 76, 12, 33 and 33, 10, 15, 99, 11, 19, 32, 45, 21, 23, 9, 22, 76, 12, 1 should yield a match result

Is it possible to implement a hashing function / bitwise operation to generate a unique index for each row. The function can return the same index for 2 rows if the values in the records are same

+2  A: 

This isn't much, but should get you started.

You probably want a hash function that generates as few collisions as possible; but it also has to be commutative (ie: the order in which you add numbers to the hash is irrelevant). You can accomplish that by using a combination of XOR's and bit shifts (see this page).

You might want to store the hash in another column. Then you can hash the input you are looking for and lookup the hash on your database. Note that hashes allow for false positives, so you'd still need to check if the candidate rows are actually what you want (ie: sort everything and compare).

NullUserException
A: 

For fast queries, you can preprocess the table. I would create a HashMap where the sorted array of the 15 values is the key, and a list of column indices where sorting results into the same array are the values. For example, an entry might look like this:

[1,9,10,11,12,15,19,21,22,23,32,33,45,76,99] => [12, 33]

so the 15 values are in column 12, and 33.

For the key, you have to create a custom hash and equals function.

  • A simple method for the hash: sort the query, and calculate hash *= 120941 + x for each entry. See e.g. here for much better hash functions.
  • For equality check, just compare the numbers of each index of the sorted query with the key.
martinus
+1  A: 

Do the job properly and sort the integers in each row and sort the rows in the table. Over the lifetime of the table's use the cost of sorting will be less than all the hashing and unhashing that you are leaning towards. And while you are at it, build an index into the table, probably from the first 2 or 3 integers in each row.

High Performance Mark
Yes, thats the right approach..
Amrinder Arora
A: 

Exactly as "High Performance Mark" suggested (+1 from my side) - indeed, that is the right approach. You should keep the rows sorted (so that 15 integers are in columns in sorted order). This way, when comparing two rows, you can easily find if they are identical or not (start from any end, and continue till you find mismatch - if you all 15 numbers match then it is a match).

If you just need a hash function for indexing, then also the same idea helps you out: sort the 15 numbers in a row, and create a hash that is equal to:

Sum for i=1 to 15 (a_i * k^i) // k is a positive integer - see below

This gives you a pretty decent index. If you can keep k to be very large, this becomes provably collision free, but the size of the indexed value increases. Even if k is 2, it is largely collision free for 5 million rows and 15 columns assuming integer range is 2^16.

Another idea - since you are largely looking at heuristics, you can also consider a simpler approach:

Keep 3 more columns, for min, max and the sum of the 15 columns. Checking if these 3 match for 2 rows will eliminate a LARGE number of true negatives. Some false positives will still remain. (It is easy to note that using k=1 in the scheme above is the same as keeping the sum of columns as the index value, which is one of 3 values mentioned in this subsolution.)

[A perhaps off-limits question - is your DB design flexible? This does not appear to be a stable design as columns seem to be representing child entities, but I don't have details to be able to say that conclusively.]

Amrinder Arora