views:

81

answers:

7

I've not given this much thought yet, so I might turn out to be a silly question.

How can I take unique 5 ASCII character string and convert into a unique and reproducable (i.e needs to be the same every time) 32 bit integer?

Any ideas?

+2  A: 

If they're guaranteed to be alphanumeric only, and case-insensitive ([A-Z][0-9]) you can treat it as a base-36 number.

egrunin
You can go up to base-84: 84^5 is less than 2^32. That's both cases, digits, and 22 symbols, which is pretty decent.
Tom Anderson
+2  A: 

If you need to handle extended ASCII you are out of luck, as you would need 5 full chars which is 40 bits. Even with non-extended chars (top bit not used), you are still out of luck as you are trying to encode 35 bits of ASCII data into 32 bits of integer.

Steve Townsend
+2  A: 

ascii goes from 0-255, which takes 8 bits... In 32 bits, you have 4 of those, not 5. So, to make it short and sweet, you can't do this.

Even if you are willing to ignore the high-order (values 128-255) ascii (use only ascii characters 0-127) and just use 7 bits per character, you are still 3 bits short (7*5 = 35 and you only have 32 available.

Charles Bretana
+1  A: 

Assuming it is in fact ASCII (i.e., no characters with ordinal values greater than 127), you have five characters of 7 bits, or 35 bits of information. There is no way to generate a 32-bit code from 35 bits that is guaranteed to be unique; you're missing three bits, so each code will also represent 7 other valid ASCII strings. However, you can make it very, very unlikely that you will ever see a collision by being careful in how you calculate the code so that input strings that are very similar have very different codes. I see another answer has suggested CRC-32. You could also use a hash function such as MD5 or SHA-1 and use only the first 32 bits; this is probably best because hash functions are specifically designed for this purpose.

If you can further constrain the values of the input string (say, only alphanumeric, no lowercase, no control characters, or something of the sort), you can probably eliminate that extra data and generate guaranteed unique 32-bit codes for each string.

kindall
+1  A: 

One way is to treat the 5 characters as numerals in base N, where N is the number of characters in your alphabet (the set of allowed characters). From there on, it's just simple base conversion.

Given that you have 32 bits available, and 5 characters to store, that means you can have 32^(1/5)=84 characters in your alphabet. Assuming you only include basic ASCII, not extended ASCII (>127), you have 7 bits of information in a single character, so that's a bit of a problem - there are too many possibilities to create unique values for every string. However, the first 32 characters, as well as the last character, are control characters, and if you exclude those, you're down to 95 characters.

You still have to cut 11 characters, though. Wikipedia has a nice chart of the characters in ASCII which you can use to determine which characters you need.

Michael Madsen
+2  A: 

If all five characters will belong to a set of 84 or fewer distinct characters, then you can squish five of them into a longword. Convert each character into a value 0..83, then

  intvalue = ((((char4*84+char1)*83+char2)*82+char3)*81+char0)
  char0 = intvalue % 84
  char1 = (intvalue / 84) % 84;
  char2 = (intvalue / (84*84)) % 84;
  char3 = (intvalue / (84*84L*84)) % 84;  
  char4 = (intvalue / (84*84L*84*84L) % 84;

BTW, I wonder if anyone uses base-84 encoding as a standard; on many platforms it could be easier to handle than base-64, and the results would be more compact.

supercat
A: 

+1 to kindall's answer... yes use hash functions

Dhana