views:

133

answers:

7

This is for the purpose of having a nice short URL which refers to an md5 hash in a database. I would like to convert something like this:

a7d2cd9e0e09bebb6a520af48205ced1

into something like this:

hW9lM5f27

Those both contain about the same amount of information. The method doesn't have to be direct and reversible but that would be nice (more flexible). At the least I would want a randomly generated string with the hex hash as the seed so it is reproducible. I'm sure there are many possible answers, I am curious to see how people would do it in an elegant way.

Oh, this doesn't have to have perfect 1:1 correspondence with the original hash but that would be a bonus (I guess I already implied that with the reversibility criteria). And I would like to avoid collisions if possible.

EDIT I realized my initial calculations were totally wrong (thanks to the people answering here but it took me awhile to clue in) and you can't really reduce the string length very much by throwing in all the lower case and uppercase letters into the mix. So I guess I will want something that doesn't directly convert from hex to base 62.

A: 

It depends on what a7d2cd9e0e09bebb6a520af48205ced1 is. Assuming you're talking about a hex number since it's coming from md5, you could just run a base64_encode. If you have the hex in string form, you'd want to run hexdec. Be careful you don't run into maxint issues though.

Steven Xu
+1  A: 

You could just do plain old base conversion. The hash is expressed in hexadecimal, and you can then create an alphabet of the size you want to express the hash. Base64 works well for this purpose, although you will probably want to write your own function so you end up encoding the value, not the string.

Note, however, that standard Base64 contains characters you wouldn't want to put in a URL; +, / and the padding character =. You can replace those characters with something else when converting back and forth to get a URL-safe Base64 encoding (or use a safe set of characters to begin with if you write your own function).

Michael Madsen
+6  A: 

Here's a little function for consideration:

/** Return 22-char compressed version of 32-char hex string (eg from PHP md5). */
function compress_md5($md5_hash_str) {
    // (we start with 32-char $md5_hash_str eg "a7d2cd9e0e09bebb6a520af48205ced1")
    $md5_bin_str = "";
    foreach (str_split($md5_hash_str, 2) as $byte_str) { // ("a7", "d2", ...)
        $md5_bin_str .= chr(hexdec($byte_str));
    }
    // ($md5_bin_str is now a 16-byte string equivalent to $md5_hash_str)
    $md5_b64_str = base64_encode($md5_bin_str);
    // (now it's a 24-char string version of $md5_hash_str eg "VUDNng4JvrtqUgr0QwXOIg==")
    $md5_b64_str = substr($md5_b64_str, 0, 22);
    // (but we know the last two chars will be ==, so drop them eg "VUDNng4JvrtqUgr0QwXOIg")
    $url_safe_str = str_replace(array("+", "/"), array("-", "_"), $md5_b64_str);
    // (Base64 includes two non-URL safe chars, so we replace them with safe ones)
    return $url_safe_str;
}

Basically you have 16-bytes of data in the MD5 hash string. It's 32 chars long because each byte is encoded as 2 hex digits (i.e. 00-FF). So we break them up into bytes and build up a 16-byte string of it. But because this is no longer human-readable or valid ASCII, we base-64 encode it back to readable chars. But since base-64 results in ~4/3 expansion (we only output 6 bits per 8 bits of input, thus requiring 32 bits to encode 24 bits), the 16-bytes becomes 22 bytes. But because base-64 encoding typically pads to lengths multiples of 4, we can take only the first 22 chars of the 24 character output (the last 2 of which are padding). Then we replace non-URL-safe characters used by base-64 encoding with URL-safe equivalents.

This is fully reversible, but that is left as an exercise to the reader.

I think this is the best you can do, unless you don't care about human-readable/ASCII, in which case you can just use $md5_bin_str directly.

And also you can use a prefix or other subset of the result from this function if you don't need to preserve all the bits. Throwing out data is obviously the simplest way to shorten things! (But then it's not reversible)

P.S. for your input of "a7d2cd9e0e09bebb6a520af48205ced1" (32 chars), this function will return "VUDNng4JvrtqUgr0QwXO0Q" (22 chars).

dkamins
According to my calculations 9 characters of a-zA-Z0-9 should be adequate to store an md5 hash, so 22 characters isn't as good as I was hoping for. I don't quite understand base64, why does it increase the size? Isn't there something more suitable that will actually shrink the size of the string?
Moss
OK, my calculations must be wrong and you do need 22 characters to express the hash but I can't figure out where my math is wrong. If each character in an md5 hash represents 16 bits and there are 32 characters that should be 16*32 = 512 bits (but Wikipedia says md5 is 128 bits). And so 62*9 = 558 bits. It looks like 9 digits should be able to contain the supposed 512 bits of of an md5. -- BAH, ok, I just realized that one character in hex is in fact 4 bits, not 16. Why is this confusing me so much...
Moss
Each hex digit char = 4 bits. 32 hex chars = 128 bits = 16 bytes. Base-64 only uses 6 bits of each output byte (to maintain ASCII-safe output), so it takes 4 bytes (6+6+6+6) to encode 3 bytes (8+8+8). This is why 16 raw bytes requires 22 encoded bytes. Base-64 sacrifices space-efficiency to achieve wider medium compatibility.
dkamins
+1  A: 

I would advise against a 1-1 correspondence:

With base-64 encoding you will only be able to decrease the input to (4/8)/(6/8) -> 4/6 ~ 66% in size (and this is assuming you deal with the "ugly" base64 characters without adding anything new).

I would likely consider a (secondary) look-up method to get truly "pretty" values. Once you have this alternate method established, choosing how to generate values in that range -- e.g. random numbers -- can be free of the source hash value (because correspondence is lost anyway) and an arbitrary "pretty" target-set can be used, perhaps [a-z][A-Z][0-9].

You can convert to the base (62 above) by simply following the divide-and-carry method and a look-up into an array. It should be fun little exercise.

Note: If you choose the random number from [0, 62^5) then you will get a value that will completely pack the encoded output (and fit within 32-bit integer values). You can then perform this process multiple times in a row to get a nice multiple of-5 result value, such as xxxxxyyyyyzzzzzz (where x,y,z are different groups and the total value is in the range (62^5)^3 -> 62^15 -> "a huge value")

Edit, for comment:

Because without the 1-1 correspondence you can make truly short pretty things -- perhaps as "small" as 8 characters long -- with base62, 8 characters can store up to 218340105584896 values, which is likely more than you will ever need. Or even 6 characters which "only" allows storage of 56800235584 different values! (And you still can't store that number in a plain 32-bit integer :-) If you drop to 5 characters, you once again reduce the space (to just under one billion: 916,132,832), but now you have something that can fit in a signed 32-bit integer (albeit it is somewhat wasteful).

The DB should ensure no duplicates, albeit an index on this value will be "fast-fragmenting" with a random source (but you could use counters or whatnot). A well-distributed PRNG should have minimal conflicts (read: retries) in a large enough range (assuming you keep the seed rolling and don't reset it, or reset it appropriately) -- Super 7 can even guarantee NO duplicates during a cycle (of only ~32k), but as you can see above, the target space is still big. See the math at the top of what maintaining a 1-1 relationship requires in terms of minimum encoded size.

The divide-and-carry method just explains how to get your source number into a different base -- perhaps base62. The same general method can be applied to go from the "natural" base (base10 in PHP) to any base.

pst
Why would you recommend against 1-1 correspondence? I don't know what the divide-and-carry method is you are talking about but it sounds interesting.
Moss
A: 

If you're doing a URL shortener why not just add URLs sequentially so you don't run into hash collisions, you can start at 0 and goto Z, and you don't "run out" when you get to 8 chars.

A: 

Of course if I want a function to satisfy my needs perfectly I better make it myself. Here is what I came up with.

//takes a string input, int length and optionally a string charset
//returns a hash 'length' digits long made up of characters a-z,A-Z,0-9 or those specified by charset
function custom_hash($input, $length, $charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUFWXIZ0123456789'){
    $output = '';
    $input = md5($input); //this gives us a nice random hex string regardless of input 

    do{
        foreach (str_split($input,8) as $chunk){
            srand(hexdec($chunk));
            $output .= substr($charset, rand(0,strlen($charset)), 1);
        }
        $input = md5($input);

    } while(strlen($output) < $length);

    return substr($output,0,$length);
}

This is a very general purpose random string generator, however it is not just any old random string generator because the result is determined by the input string and any slight change to that input will produce a totally different result. You can do all sort of things with this:

custom_hash('1d34ecc818c4d50e788f0e7a9fd33662', 16); // 9FezqfFBIjbEWOdR
custom_hash('Bilbo Baggins', 5, '0123456789bcdfghjklmnpqrstvwxyz'); // lv4hb
custom_hash('', 100, '01'); 
// 1101011010110001100011111110100100101011001011010000101010010011000110000001010100111000100010101101

Anyone see any problems with it or any room for improvement?

Moss
i don't see why you keep calculating hd5 of the input... $input = md5($input);in each iteration of the DO loop
Peter Perháč
Because otherwise the random digits would repeat if your output is larger than 32 digits. I used str_shuffle originally but even that caused repetition on a larger scale.
Moss
+1  A: 

Here are two conversion functions for Base-16 to Base-64 conversion and the inverse Base-64 to Base-16 for arbitrary input lengths:

function base16_to_base64($base16) {
    return base64_encode(pack('H*', $base16));
}
function base64_to_base16($base64) {
    return implode('', unpack('H*', base64_decode($base64)));
}

If you need Base-64 encoding with the URL and filename safe alphabet , you can use these functions:

function base64_to_base64safe($base64) {
    return strtr($base64, '+/', '-_');
}
function base64safe_to_base64($base64safe) {
    return strtr($base64safe, '-_', '+/');
}

If you now want a function to compress your hexadecimal MD5 values using URL safe characters, you can use this:

function compress_hash($hash) {
    return base64_to_base64safe(rtrim(base16_to_base64($hash), '='));
}

And the inverse function:

function uncompress_hash($hash) {
    return base64_to_base16(base64safe_to_base64($hash));
}
Gumbo
Very nice. This looks like the best method to do a pure, reversible conversion. I was looking at pack/unpack in the PHP manual but I couldn't comprehend it.I decided for my needs to go with a 'lossy' compression method. Does stackoverflow allow two accepted answers?
Moss
@Moss: No, you can only accept one answer.
Gumbo