tags:

views:

70

answers:

4

In my line of work, we have to do a lot of duplicate checking on name and address data. In the past, I've always just created an index on the relevant columns and queried against those columns (firstname, lastname, address, city, state, zip, etc...) directly. Lately, I've been wondering if it makes more sense to hash those values before inserting and index only the hash for comparison. Assuming I use something like SHA-256 for the hash, I shouldn't have to worry about collisions (my data set isn't that large), and my index size would be only 32 bytes per record, or 64 bytes if I store it as a string, where as the index on NACSZ data might be 200-300 bytes.

Is this a good idea? Am I stupid for not considering it before? Is there a valid reason not to do so? And as a matter of style, what would be a good name for this column?

+2  A: 

Good idea, sure - are you stupid for not considering it before, definitely not. I also can't think of any valid reason not to do so, particularly if your dataset isn't that large. Of course, that may make someone wonder if there is a valid reason to do so (i.e. if your dataset isn't that large, why do you care that the NACSZ index would be 200-300 bytes in size).

As always, there are lots of things to consider to determine what is best for your scenario (i.e. usage patterns, selectivity, read vs. write ratios, etc.), so it's very hard for anyone to provide any sort of definitive answer on something like this.

One thing to consider given that you are using Sql Server (not sure which version, but I'll assume 2k5 or later) - though there is a built-in function for generating SHA/MDx hashes, it isn't as easy to use as the similar checksum hashing function. Using the hashbytes function requires you to build the input string to be hashed explicitly, whereas the checksum function can simply take a valid column list - a simple example shows the differences:

-- CHECKSUM - easily hash an entire record over a set, regardless of column datatypes
select checksum(c.*)
from sys.columns c

-- CHECKSUM - easily hash a subset of columns for a record over a set, regardless of column datatypes
select checksum(c.name, c.object_id, c.column_id)
from sys.columns c

-- HASHBYTES - this DOES NOT work
select hashbytes('MD5', c.*)
from sys.columns c

-- HASHBYTES - this DOES NOT work either
select hashbytes('MD5', c.name, c.object_id, c.column_id)
from sys.columns c

-- HASHBYTES - you have to explicitly build the string, casting to valid character-based datatypes
select hashbytes('MD5', a.name + cast(a.object_id as varchar(50)) + cast(a.column_id as varchar(50)))
from sys.columns a

The ease of use also comes in very handy when performing joins/unions/etc. across different tables/sets of data.

Obviously, the checksum hash is just a 32-bit algorithm and will likely lead to collisions, however this may not make any difference if your primary goal is to create a seekable index to improve performance and then perform additional secondary checks. For example, if you create a checksum() hash column on the table and index that column alone, you'll end up with a small index that can still be used to seek into and then perform secondary/residual comparisons on the NACSZ values on the small subset of columns that met the checksum() match. A query in this scenario might look like this:

declare @hash int

select @hash = checksum(@first_name,@last_name,@address,@city,@state,@zip)

select t.firstname, t.lastname, t.address, t.city, t.state, t.zip
from TableName t
where t.record_hash = @hash
and  t.firstname = @first_name
and  t.lastname = @last_name
and  t.address = @address
and  t.city = @city
and  t.state = @state
and  t.zip = @zip

The optimizer will seek primarily on the hash index and perform a residual check on the NACSZ values for the subset of records that match the hash value.

Naturally, if you are going to generate the hash in App code, this is likely less of an issue altogether.

As for naming conventions/style, can't say as I've ever heard of anything specific for this type of use, however for those I've seen/used, the column names typically include a designation of 'hash' and the type of hash it is - for example 'record_checksum_hash' or 'record_sha1_hash' or 'md5_hash'.

chadhoc
Well when I say my dataset isn't that large, I mean it's only ~20-50 million rows. Still pretty large, but not approaching terabyte+ capacity. Given an index key size of 300bytes per record, I'm looking at 5.7gb of index data for 20 million rows vs 610mb for a 32 byte key. The latter will be much better performance wise, I think, because the entire index might potentially be kept in memory.
Chris
Well, smaller indexes are definitely always better, but whether or not the entire index might fit into memory would open up a whole bunch of other points, questions, etc. - it likely doesn't matter if the entire index fits into memory for one (various reasons), what stays in memory is a very broad topic, etc. I'd consider 50m rows a fairly small dataset, but certainly a ~600mb index vs. ~6gb index has potential for being much more efficient, depending on what you are doing with them.
chadhoc
+1  A: 

It's not a bad idea, but if your data set is small I'm not sure you'll gain anything tangible other than some duplicate fields (hashes of already existing fields). Are you considering doing this because the dupe finding is running slow, or just out of a desire to optimize?

If it's the latter I'd maybe not do it, don't forget that you now have to update two fields instead of just one when you make a change to those fields, and it's easy to forget to do, especially if you ever do updates in SQL directly vs. through your front end.

Donnie
I'm trying to optimize the query that checks for duplicates because for every 1,000 inserts we may have 10,000 duplicate checks.
Chris
10k rows (ie, 10k checks) is not a lot. Like I said, is it actually running slowly? If not, I'd not worry about it yet.
Donnie
A: 

If your comparisons look for an exact match, and if the comparisons happen often (or if the speed of comparison is more important than speed of insert), then I think computing a hash would be a very good idea. Not only would the index be smaller, but the per-row comparison will be faster. In fact, you should be able to get away with SHA-1, which is only 160 bits, instead of SHA-256. The extra bits help from a cryptological perspective; they don't really improve uniqueness.

Of course, be sure to canonicalize your data before computing the hash, "LIKE" comparisons don't work with a hash as they would with strings.

RickNZ
A: 

Most hashes won't detect duplicates where case or accented letters differ. Please compare:

select checksum('Cafe'), checksum('cafe'), checksum('Café'), checksum(' Cafe');
select hashbytes('SHA1', 'Cafe'), hashbytes('SHA1', 'cafe'), hashbytes('SHA1', 'Café'), hashbytes('SHA1', ' Cafe');
Álvaro G. Vicario
Yeah I'd .ToUpperInvariant() everything before hashing to avoid this problem.
Chris