views:

682

answers:

3

I'm looking for a Perl string checksum function with the following properties:

  • Input: Unicode string of undefined length ($string)
  • Output: Unsigned integer ($hash), for which 0 <= $hash <= 2^32-1 holds (0 to 4294967295, matching the size of a 4-byte MySQL unsigned int)

Pseudo-code:

sub checksum {
    my $string = shift;
    my $hash;
    ... checksum logic goes here ...
    die unless ($hash >= 0);
    die unless ($hash <= 4_294_967_295);
    return $hash;
}

Ideally the checksum function should be quick to run and should generate values somewhat uniformly in the target space (0 .. 2^32-1) to avoid collisions. In this application random collisions are totally non-fatal, but obviously I want to avoid them to the extent that it is possible.

Given these requirements, what is the best way to solve this?

+4  A: 

Don't know how quick it is, but you might try String::CRC.

Pim
+1  A: 

Any hash function will be sufficient - simply truncate it to 4-bytes and convert to a number. Good hash functions have a random distribution, and this distribution will be constant no matter where you truncate the string.

I suggest Digest::MD5 because it is the fastest hash implementation that comes with Perl as standard. String::CRC, as Pim mentions, it also implemented in C and should be faster.

Here's how to calculate the hash and convert it to an integer:

use Digest::MD5 qw(md5);
my $str = substr( md5("String-to-hash"), 0, 4 );
print unpack('L', $str);  # Convert to 4-byte integer (long)
rjh
+3  A: 

From perldoc -f unpack:

        For example, the following computes the same number as the
        System V sum program:

            $checksum = do {
                local $/;  # slurp!
                unpack("%32W*",<>) % 65535;
            };
Randal Schwartz