tags:

views:

331

answers:

7

Hi there!

I need to generate a 10 character unique id (SIP/VOIP folks need to know that it's for a param icid-value in the P-Charging-Vector header). Each character shall be one of the 26 ASCII letters (case sensitive), one of the 10 ASCII digits, or the hyphen-minus.

It MUST be 'globally unique (outside of the machine generating the id)' and sufficiently 'locally unique (within the machine generating the id)', and all that needs to be packed into 10 characters, phew!

Here's my take on it. I'm FIRST encoding the 'MUST' be encoded globally unique local ip address into base-63 (its an unsigned long int that will occupy 1-6 characters after encoding) and then as much as I can of the current time stamp (its a time_t/long long int that will occupy 9-4 characters after encoding depending on how much space the encoded ip address occupies in the first place).

I've also added loop count 'i' to the time stamp to preserve the uniqueness in case the function is called more than once in a second.

Is this good enough to be globally and locally unique or is there another better approach?

Gaurav

#include <stdio.h>
#include <string.h>
#include <sys/time.h>

//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";

// b63() returns the next vacant location in char array x
int b63(long long longlong,char *x,int index){
    if(index > 9)
     return index+1;

    //printf("index=%d,longlong=%lld,longlong%63=%lld\n",index,longlong,longlong%63);
    if(longlong < 63){
     x[index] = set[longlong];
     return index+1;
    }  

    x[index] = set[longlong%63];
    return b63(longlong/63,x,index+1);
}

int main(){
    char x[11],y[11] = {0}; /* '\0' is taken care of here */

    //let's generate 10 million ids
    for(int i=0; i<10000000; i++){

     /* add i to timestamp to take care of sub-second function calls,
      3770168404(is a sample ip address in n/w byte order) =                84.52.184.224 */
     b63((long long)time(NULL)+i,x,b63((long long)3770168404,x,0));

     // reverse the char array to get proper base-63 output
     for(int j=0,k=9; j<10; j++,k--)
      y[j] = x[k];

     printf("%s\n",y);
    }

    return 0;
}
+1  A: 

Can't you just have a distributed ID table ?

Aurélien Vallée
I love the "KISS" solution.
ssteidl
+1  A: 

Machines on NAT'ed LANs will often have an IP from a small range, and not all of the 32-bit values would be valid (think multicast, etc). Machines may also grab the same timestamp, especially if the granularity is large (such as seconds); keep in mind that the year is very often going to be the same, so it's the lower bits that will give you the most 'uniqueness'.

You may want to take the various values, hash them with a cryptographic hash, and translate that to the characters you are permitted to use, truncating to the 10 characters.

But you're dealing with a value with less than 60 bits; you need to think carefully about the implications of a collision. You might be approaching the problem the wrong way...

retracile
+5  A: 

It MUST be 'globally unique (outside of the machine generating the id)' and sufficiently 'locally unique (within the machine generating the id)', and all that needs to be packed into 10 characters, phew!

Are you in control of all the software generating ids? Are you doling out the ids? If not...

I know nothing about SIP, but there's got to be a misunderstanding that you have about the spec (or the spec must be wrong). If another developer attempts to build an id using a different algorithm than the one you've cooked up, you will have collisions with their ids, meaning they will know longer be globally unique in that system.

I'd go back to the SIP documentation, see if there's an appendix with an algorithm for generating these ids. Or maybe a smarter SO user than I can answer what the SIP algorithm for generating these id's is.

Doug T.
A: 

Hmm, using the system clock may be a weakness... what if someone sets the clock back? You might re-generate the same ID again. But if you are going to use the clock, you might call gettimeofday() instead of time(); at least that way you'll get better resolution than one second.

Jeremy Friesner
+1  A: 

Well, if I cast aside the fact that I think this is a bad idea, and concentrate on a solution to your problem, here's what I would do:

You have an id range of 10^63, which correspond to roughly 60 bits. You want it to be both "globally" and "locally" unique. Let's generate the first N bits to be globally unique, and the rest to be locally unique. The concatenation of the two will have the properties you are looking for.

First, the global uniqueness : IP won't work, especially local ones, they hold very little entropy. I would go with MAC addresses, they were made for being globally unique. They cover a range of 256^6, so using up 6*8 = 48 bits.

Now, for the locally unique : why not use the process ID ? I'm making the assumption that the uniqueness is per process, if it's not, you'll have to think of something else. On Linux, process ID is 32 bits. If we wanted to nitpick, the 2 most significant bytes probably hold very little entropy, as they would at 0 on most machines. So discard them if you know what you're doing.

So now you'll see you have a problem as it would use up to 70 bits to generate a decent (but not bulletproof) globally and locally unique ID (using my technique anyway). And since I would also advise to put in a random number (at least 8 bits long) just in case, it definitely won't fit. So if I were you, I would hash the ~78 generated bits to SHA1 (for example), and convert the first 60 bits of the resulting hash to your ID format. To do so, notice that you have a 63 characters range to chose from, so almost the full range of 6 bits. So split the hash in 6 bits pieces, and use the first 10 pieces to select the 10 characters of your ID from the 63 character range. Obviously, the range of 6 bits is 64 possible values (you only want 63), so if you have a 6 bits piece equals to 63, either floor it to 62 or assume modulo 63 and pick 0. It will slightly bias the distribution, but it's not too bad.

So there, that should get you a decent globally and locally pseudo-unique ID.

A few last points: according to the Birthday paradox, you'll get a ~ 1 % chance of collisions after generating ~ 142 million IDs, and a 99% chance after generating 3 billions IDs. So if you hit great commercial success and have millions of IDs being generated, get a larger ID.

Finally, I think I provided a "better than the worse" solution to your problem, but I can't help but think you're attacking this problem in the wrong fashion, and possibly as other have mentioned, misreading the specs. So use this if there are no other ways that would be more "bulletproof" (centralised ID provider, much longer ID ... ).

Edit: I re-read your question, and you say you call this function possibly many times a second. I was assuming this was to serve as some kind of application ID, generated once at the start of your application, and never changed afterwards until it exited. Since it's not the case, you should definitely add a random number and if you generate a lot of IDs, make that at least a 32 bits number. And read and re-read the Birthday Paradox I linked to above. And seed your number generator to a highly entropic value, like the usec value of the current timestamp for example. Or even go so far as to get your random values from /dev/urandom . Very honestly, my take on your endeavour is that 60 bits is probably not enough...

Florian
+2  A: 

I would have a serious look at RFC 4122 which describes the generation of 128-bit GUIDs. There are several different generation algorithms, some of which may fit (MAC address-based one springs to mind). This is a bigger number-space than yours 2^128 = 3.4 * 10^38 compared with 63^10 = 9.8 * 10^17, so you may have to make some compromises on uniqueness. Consider factors like how frequently the IDs will be generated.

However in the RFC, they have considered some practical issues, like the ability to generate large numbers of unique values efficiently by pre-allocating blocks of IDs.

Robert Tuck
GUID's are only globally unique as long as everyone agrees on the algorithms used to generate them. The OP sounds like everyone might mash up their own ad hoc algorithm to generate something that's supposed to be "globally unique", which means it's going to fall over.
jalf
A: 

Hi there again!

@Doug T. No, I'm not in control of all the software generating the ids. I agree without a standardized algorithm there maybe collisions, I've raised this issue in the appropriate mailing lists.

@Florian Taking a cue from you're reply. I decided to use the /dev/urandom PRNG for a 32 bit random number as the space unique component of the id. I assume that every machine will have its own noise signature and it can be assumed to be safely globally unique in space at an instant of time. The time unique component that I used earlier remains the same.

These unique ids are generated to collate all the billing information collected from different network functions that independently generated charging information of a particular call during call processing.

Here's the updated code below:

Gaurav

 #include <stdio.h>
 #include <string.h>
 #include <time.h>

 //base-63 character set
 static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";

 // b63() returns the next vacant location in char array x
 int b63(long long longlong, char *x, int index){
     if(index > 9)
      return index+1;

     if(longlong < 63){
      x[index] = set[longlong];
      return index+1;
     }  

     x[index] = set[longlong%63];
     return b63(longlong/63, x, index+1);
 }

 int main(){
     unsigned int number;
     char x[11], y[11] = {0};

     FILE *urandom = fopen("/dev/urandom", "r");
     if(!urandom)
      return -1;

     //let's generate a 1 billion ids
     for(int i=0; i<1000000000; i++){

      fread(&number, 1, sizeof(number), urandom);

      // add i to timestamp to take care of sub-second function calls, 
      b63((long long)time(NULL)+i, x, b63((long long)number, x, 0));

      // reverse the char array to get proper base-63 output
      for(int j=0, k=9; j<10; j++, k--)
       y[j] = x[k];

      printf("%s\n", y);
     }

     if(urandom)
  fclose(urandom);

     return 0;
 }
Gaurav
Please not that I originated this post earlier as an unregistered user.
Gaurav