views:

1069

answers:

7

Hi,

I have a table with a column of unique string values. The max length of the string value is 255 char. I want to generate a unique id with the string value as input. In other words I am looking for a compact representation for a string. The unique id generated can be alpha-numeric. A useful feature to have would be to be able to regenerate the string value from the unique id.

Is there an efficient function to generate such an unique id. Some ways could be using checksum or hash functions. I want to know if there is a standard way to do this.

I am using MySql database and java.

Thanks!

--edit: I am looking for a more compact representation. Than just using the string itself.

A: 
public String getUniqueId(String uniqueString) {
    return uniqueString;
}

Unless the ID has any other constraints on it than "be unique".

Sean Nyman
he said "a compact representation" - presumably a shorter version.
philfreo
+1  A: 

If your database requires that the column contain unique values, then why not use the string itself? Anything else is just another step to encode/decode it.

FrustratedWithFormsDesigner
I am looking for a more compact representation.
pkrish
@pkrish: Ah ok, so a lossless compression of a string, so you don't need to display the full 255 characters? Have you looked into ZIP compression?
FrustratedWithFormsDesigner
+1  A: 

You have much much more possibilities for a 255 long string than a 64 (or whatever) bit long number. It is impossible. Add an auto_increment field.

Notinlist
A: 

If you have a limited number of strings that occur frequently, creating a reference table with a numeric (auto-increment) ID, and a FK to that reference table in your main table could be an option.

If not, you could run your strings through GZIP or any other compression algorithm if you need to retrieve the original.

If you don't need to retrieve the original, a hash function such as MD5 is what you're looking for.

Henning
The problem with compression is that since I will be using it as in ID, I want this to be human readable. So may be a good hash function like MD5 is the way to go.
pkrish
+1  A: 

How unique is "unique"? Using any good hashing function (MD5 is decent for most uses, and easily implemented via java.security.MessageDigest.getInstance("MD5") can get you to a 128-bit number that's very very likely to be unique. Using a subset of the hash gets you a smaller ID, with a higher chance of collision.

Using an auto_increment field in the DB, if it fits your design, might be easier to implement, will truly guarantee uniqueness, and will use smaller IDs than the 16 bytes of MD5. You can also then meet your requirement of finding the string by the key, which you can't do for a hash.

Dagon
My use case cannot use auto_increment. I like the idea of using the MD5 value as the id. Also found out from @philfreo post that there is an MD5 function in mysql which is good. I guess it is ok that I cannot get the string back using the MD5 hash.
pkrish
A: 

This is related to compression. The simplest way would be to bit-pack and get each character down to the bear minimum number of bits.

A-Z is 26 chars which is less than 32 (5 bits)

add a-z and it's 6 bits (with somewhere around 12 bit-patterns left over to represent other characters).

Let's say that is enough for you. So you have 6x255 bits which is 1530 bits to store your string. (191 bytes)

Going with only caps would reduce that a little (to 159 bytes)

You can optimize it more, but then you have to go into a compression algorithm that expects a specific language or patterns in the Strings and optimizes those patterns.

Unless you can further specify the contents of the strings, you're just not going to get what you want. Sorry. (If you can tell more about the contents of the strings, do so. One of us may see patterns that will allow much better "Compression")

This lack of ability to do what you want is why hashtables are so cool. They get a "Mostly Unique" number and then have a second level of resolution to test cases where two strings hashed to the same number.

Bill K
A: 

Since you're using MySQL, take a look at CRC32

http://www.bitbybit.dk/carsten/blog/?p=191

philfreo