tags:

views:

538

answers:

5

I would like to calculate some sort of hash or checksum value for a column value in Informix SQL.

The problem:

We need to anonymize personal information in test data, and would like to do so by hashing the relevant values, as in:

UPDATE personal_data SET name=HASH(name), employee_no=HASH(employee_no)

We want to use a hash, rather than just a fixed value, because it is useful for different values to (generally) map to different results, and for the same values to map to the same result. That makes it possible to still do comparisons on the test data (for example it is possible to have several records with the same employee_no, and it is useful to preserve that information).

Most DBMS have some kind of hashing function (PostgreSQL and MySQL have MD5(), Oracle has DBMS_UTILITY.GET_HASH_VALUE), but I could not find anything for Informix (Informix IDS 9). Is this a missing feature?

+1  A: 

sleske,

A few years ago, I wrote a T-SQL function to compute SHA-1 hashes. If that will work for you, maybe you can rework my function for Informix. This newsgroup thread has a little more information about its limitations.

create function S( 
  @N int, 
  @x bigint 
) returns binary(4) as begin 
  declare @two_N bigint 
  set @two_N = power(cast(2 as bigint), @N) 
  declare @two_32_N bigint 
  set @two_32_N = power(cast(2 as bigint), 32-@N) 
  return cast(@x%@two_32_N*@two_N + @x/@two_32_N as binary(4)) 
end 
go 
create function f( 
  @t bigint, 
  @B bigint, 
  @C bigint, 
  @D bigint 
) returns bigint as begin 
  declare @2_32 bigint set @2_32 = power(cast(2 as bigint),32) 
  if @t between 0 and 19 
    return (@B & @C) | ((@2_32-@B-1) & @D) 
  if @t between 20 and 39 
    return @B ^ @C ^ @D 
  if @t between 40 and 59 
    return (@B & @C) | (@B & @D) | (@C & @D) 
  return @B ^ @C ^ @D 
end 
go 
create function SHA1 ( 
  @s varchar(55) 
) returns binary(20) as begin 
  declare @b varbinary(55) 
  set @b = cast(@s as varbinary(55)) 
  declare @zeros binary(64) 
  set @zeros = 0x 
  declare @padded binary(64) 
  set @padded = 
    @b + 0x80 + substring(@zeros,1,55-datalength(@b)) 
       + cast(8*datalength(@b) as binary(8)) 
  declare @H5 binary(20) 
  set @H5 = 0x67452301EFCDAB8998BADCFE10325476C3D2E1F0 
  declare @K4 binary(16) 
  set @K4 = 0x5A8279996ED9EBA18F1BBCDCCA62C1D6 
  declare @ABCDE binary(20) set @ABCDE = @H5 
  declare @W80   varbinary(320) set @W80 = @padded 
  declare @TEMP  binary(4) set @TEMP = 0x 
  declare @2_32 bigint set @2_32 = power(cast(2 as bigint),32) 
  declare @t int 
  set @t = 16 
  while @t < 80 begin 
    set @W80 = @W80 + 
      dbo.S(1,cast(substring(@W80,(@t-3)*4+1,4) as bigint) 
             ^cast(substring(@W80,(@t-8)*4+1,4) as bigint) 
             ^cast(substring(@W80,(@t-14)*4+1,4) as bigint) 
             ^cast(substring(@W80,(@t-16)*4+1,4) as bigint)) 
    set @t = @t + 1 
  end 
  set @t = 0 
  while @t < 80 begin 
    set @TEMP = cast((cast(dbo.S(5,substring(@ABCDE,1,4)) as bigint) 
          + dbo.f(@t,substring(@ABCDE,5,4) 
                    ,substring(@ABCDE,9,4) 
                    ,substring(@ABCDE,13,4)) 
          + cast(substring(@ABCDE,17,4) as bigint) 
          + cast(substring(@W80,4*@t+1,4) as bigint) 
          + cast(substring(@K4,4*(@t/20)+1,4) as bigint))%@2_32 as 
binary(4)) 
    set @ABCDE = @TEMP+substring(@ABCDE,1,4) 
                  +dbo.S(30,substring(@ABCDE,5,4)) 
                  +substring(@ABCDE,9,8) 
    set @t = @t + 1 
  end 
  set @H5 
  = cast((cast(substring(@H5, 1,4) as bigint) + cast(substring(@ABCDE, 1,4) 
as bigint))% @2_32 as binary(4)) 
  + cast((cast(substring(@H5, 5,4) as bigint) + cast(substring(@ABCDE, 5,4) 
as bigint))% @2_32 as binary(4)) 
  + cast((cast(substring(@H5, 9,4) as bigint) + cast(substring(@ABCDE, 9,4) 
as bigint))% @2_32 as binary(4)) 
  + cast((cast(substring(@H5,13,4) as bigint) + cast(substring(@ABCDE,13,4) 
as bigint))% @2_32 as binary(4)) 
  + cast((cast(substring(@H5,17,4) as bigint) + cast(substring(@ABCDE,17,4) 
as bigint))% @2_32 as binary(4)) 
  return @H5 
end
Steve Kass
Interesting! There'd be quite a lot of work to do to translate that into Informix - the bitwise operations are not directly supported, for example.
Jonathan Leffler
I don't think it makes sense for me to re-implement SHA-1. If I cannot do it inside Informix, I'll probably just write an external program to do it :-/.
sleske
+1  A: 

IDS does not have any of the hash functions exposed as built-in functions, AFAICR.

Assuming you are using IBM Informix Dynamic Server (IDS) version 10.00 or later (earlier versions are not supported - well, 7.31 is supported until 2009-09-30, but not beyond), then it is possible to create a UDR (user-defined routine) that does the job. This is not incredibly difficult - but neither is it trivial.

Jonathan Leffler
Thanks, that's what I feared. I'll probably have to write a UDR or a helper program to do the job :-(.
sleske
Sorry about that. Can I suggest that you consider carefully whether your UDR should be called `MD5()` or whether it would be better to call it, say, `app_hash()` so that if/when IDS acquires an MD5 (or SHA-1, or SHA-256, or ...) function, you won't run into a naming problem? Or you can risk that IDS does provide a function with the same signature as your UDR, but that the output will be incompatible in the future.
Jonathan Leffler
@Jonathan: Good point. I'd probably just call it myHash or similar, because I don't really care about the algorithm, I just need a hash. At any rate, doing it as an external program will probably be simpler, as I know little about UDRs (and we don't use them at present).
sleske
A: 

Why not just use ROWID instead of HASH value?

Frank Rotolo
I don't see how this answers my question.
sleske
I also explained in the question that I want to use a hash to keep identical values identical. A ROWID wouldn't do that.
sleske
A: 

If you can alter the table structure then you can use WITH VERCOLS option. See Using the WITH VERCOLS Option. In short: this option gives you 2 additional hidden columns: checksum for the originally inserted row and the number of changes (updates) on the row.

robsosno
Interesting. But I don't see how this would help me, as it apparently generates a checksum for the whole row, not for the value in one column.
sleske
+1  A: 

You can use ENCRYPT_TDES. Encrypted string is much longer than original but this shouldn't be a problem - value will be truncated. Encryption + truncation should get result similar to hash value. I'm not quite sure if encryption is available in 9.40, probably not. In that case you have to calculate hash value on the client.

robsosno
Good solution. Yes, encryption + truncation would most probably do it. It's listed as a new feature in the feature list for IDS 10.0, so unfortunately it's probably not available in 9.40, which is where I'd need it :-/ .
sleske
BTW: Welcome to stackoverflow. Your first answer is a good start :-).
sleske