views:

564

answers:

3

I am creating a file-oriented database of some test results performed by various users. For this I need to generate unique id for every entry in the database. The ids must satisfy following requirements:

  • Ids should be fairly small (6 characters at most)
  • For every test case and user combination each time same id should be generated

What I tried was a simple BKDR hash function with seed value 31 and used ord() function as follows:

@chars = split(//,$hash_var);

$hash = 0;
$seed = 31;

foreach $char ( @chars ) {
   if( $char !~ m/\d/ ) {
       $hash = ( $seed * $hash ) + ord( $char );
   }  
   else {
       $hash = ( $seed * $hash ) + $char ;
   }
}

$hash = ( $hash & 0x7FFFFFFF ) % 1000;
$hash = "$chars[0]$chars[$#chars]$hash" ;

This sometimes leads to same results for various combinations i.e uniqueness is not observed. Is their any other way to accomplish this? Does changing seed value help accomplish uniqueness.

+5  A: 

Do you have more than 256 users and/or more than 65536 test cases per user? If not, you can just index users from 0 .. 255 and test cases from 0 .. 65535 and encode it as a string of hexadecimal digits so six characters would be fine.

If you have more users or test cases than that, I would again index the users and test cases and then combine them into a 32-bit integer which would actually only take 4 bytes and be trivial to implement but slightly harder on humans.

In any case, I am assuming you are given user name and test case information. Just keep two tied hashes: %users and %cases to map users and test cases to their index numbers.

Sinan Ünür
+1  A: 

If you don't have a lot of users/testcases a simple solution like this might be enough. You'd have to add the limit (and probably pack the integer when storing it).

vinko@parrot:~# more hash.pl
use strict;
use warnings;

my %hash;
my $count = 0;

sub getUniqueId {

        my $_user = shift;
        my $_test = shift;
        my $val;

        my $key = $_user."|".$_test;
        if (defined $hash{$key}) {
                $val = $hash{$key};
        } else {
                $hash{$key} = $count;
                $val = $count;
                $count = $count + 1;
        }
        return $val;
}

my @users = qw{ user1 user2 user3 user4 user5 user3 user5 };
my @testcases = qw{ test1 test2 test3 test1 test1 };

for my $user (@users) {
        for my $test (@testcases) {
                print "$user $test: ".getUniqueId($user,$test)."\n";
        }
}
vinko@parrot:~# perl hash.pl
user1 test1: 0
user1 test2: 1
user1 test3: 2
user1 test1: 0
user1 test1: 0
user2 test1: 3
user2 test2: 4
user2 test3: 5
user2 test1: 3
user2 test1: 3
user3 test1: 6
user3 test2: 7
user3 test3: 8
user3 test1: 6
user3 test1: 6
user4 test1: 9
user4 test2: 10
user4 test3: 11
user4 test1: 9
user4 test1: 9
user5 test1: 12
user5 test2: 13
user5 test3: 14
user5 test1: 12
user5 test1: 12
user3 test1: 6
user3 test2: 7
user3 test3: 8
user3 test1: 6
user3 test1: 6
user5 test1: 12
user5 test2: 13
user5 test3: 14
user5 test1: 12
user5 test1: 12
Vinko Vrsalovic
What exactly do you think you are getting out of that prototype? You may want read the following link before using another one: http://www.perl.com/language/misc/fmproto.html
Chas. Owens
The problem of mixing languages. I had written getUniqueId() and then decided upon fixing the prototype rather than removing it.
Vinko Vrsalovic
+3  A: 

Part of your problem may be that you are using floating point math and BKDR is almost certainly wanting integer math. You can fix that bug by saying

my @chars = split(//,$hash_var);

my $hash = 0;
my $seed = 31;

for my $char ( @chars ) {
   use integer;
   if( $char !~ m/\d/ ) {
       $hash = ( $seed * $hash ) + ord( $char );
   }  
   else {
       $hash = ( $seed * $hash ) + $char ;
   }
}

$hash = ( $hash & 0x7FFFFFFF ) % 1000;
$hash = "$chars[0]$chars[$#chars]$hash" ;

Another tweak that might help is using characters other than the first and last. If the first and last characters tend to be the same, they add no uniqueness to the hash.

You may also want to use a better hash function like MD5 (available in Digest::MD5) and trim the result to your desired size. However, the fact that you are using a hash at all means that you run the risk of having a collision.

Chas. Owens
The last sentence (hashing risks collision) is worth +1 all by itself.
Michael Carman