I've got a table of 5,651,744 rows, with a primary key made of 6 columns (int x 3, smallint, varchar(39), varchar(2)). I am looking to improve the performance with this table and another table which shares this primary key plus an additional column added but has 37m rows.
In anticipation of adding a column to create the hash key, I did an analysis and found 18,733 collisions.
SELECT SUM(CT)
FROM (
SELECT HASH_KEY
,COUNT(*) AS CT
FROM (
SELECT CHECKSUM(DATA_DT_ID, BANK_NUM, COST_CTR_NUM,
GL_ACCT_NUM, ACCT_NUM, APPN_CD) AS HASH_KEY
FROM CUST_ACCT_PRFTBLT
) AS X
GROUP BY HASH_KEY
HAVING COUNT(*) > 1
) AS Y
SELECT COUNT(*)
FROM CUST_ACCT_PRFTBLT
It's about twice as bad with BINARY_CHECKSUM()
Does this seem too high (.33%) given the smaller relative amount of the destination space I'm covering? And if the collisions are this high, is there a benefit in joining on this manufactured key first in joins for the cost of the extra 4 bytes per row, given that you still have to join on the regular columns to handle the occasional collision?