views:

1847

answers:

10

I'm currently using MD5 hashes but I would like to find something that will create a shorter hash that uses just [a-z][A-Z][0-9]. It only needs to be around 5-10 characters long.

Is there something out there that already does this?

Update:

I like the CRC32 hash. Is there a clean way of calculating it in .NET?

Update2:

I'm using the CRC32 function from the link Joe provided. How can I convert the uInt into the characters defined above?

+1  A: 

You can use CRC32, it is 8 bytes long and similar to MD5. Unique values will be supported by adding timestamp to actual value.

So its will look like http://foo.bar/abcdefg12.

or, from another way, you can use alphabetical increment. The keys gonna be like this:/a, /b, ... /z, /a0, /aa, /ab, /ac, ... /az, /aba, /abb, /abc, ...
check this article - http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net
-1: You can't use CRC32, it's an error checking code.
Simeon Pilgrim
When prefixing or suffixing a timestamp to the hashed value, then what is the use of the hash?
Arjan
+1  A: 

You could take the first alphanumeric 5-10 characters of the MD5 hash.

M4N
That's not very unique. The following snippet of code shows that the sequence of numbers from 1 - 1000 has 30 collisions in the first 5 characters:`for f in `seq 0 10000` ; do md5 -s $f ; done | awk '{print substr($4, 0, 5)}' | sort | uniq -c | sort -n`
Stig Brautaset
Since he's looking for hash with a length of only 5 characters, I thought that uniqueness is not a strong requirement.
M4N
Well, referring to TinyURL.com suggest a 100% uniqueness requirement to me. So: no *short* hashes (or *any* hash if I'd program it).
Arjan
A: 

If you don't care about cryptographic strength, any of the CRC functions will do.

Wikipedia lists a bunch of different hash functions, including length of output. Converting their output to [a-z][A-Z][0-9] is trivial.

Kevin Montrose
-1: a CRC only provides error checking, not unique collision avoidance.
Simeon Pilgrim
If you don't need cryptographic guarantees, they do a pretty good job for damn cheap in terms of CPU.
Kevin Montrose
but two urls will make the same CRC, and therefore have the same short-url, which is useless for a shorting service.
Simeon Pilgrim
Two urls could also conceivably make the same md5, or sha1, or sha256. In practice these are rare occurrences, but are possible for all hashing schemes given the pigeon-hole principle. More likely with a non-cryptographic hash than with one certainly, but its a case you have to handle regardless of hash function.
Kevin Montrose
That's why tinyUrl etc dont hash or crc the url, they just assign the next number. Really depends on what actually trying to be solved here.
Simeon Pilgrim
True, URL shortening services almost certainly publish a counter not a hash; but the question is for a hash function with a short output.
Kevin Montrose
+8  A: 

I dont think URL shortening services use hashes, I think they just have a running alphanumerical string that is increased with every new URL and stored in a database. If you really need to use a hash function have a look at this link: some hash functions Also, a bit offtopic but depending on what you are working on this might be interesting: Coding Horror article

jörg
+2  A: 

You can decrease the number of characters from the MD5 hash by encoding them as alphanumerics. Each MD5 character is usually represented as hex, so that's 16 possible values. [a-zA-Z0-9] includes 62 possible values, so you could encode each value by taking 4 MD5 values.

EDIT:

here's a function that takes a number ( 4 hex digits long ) and returns [0-9a-zA-Z]. This should give you an idea of how to implement it. Note that there may be some issues with the types; I didn't test this code.

char num2char( unsigned int x ){
    if( x < 26 ) return (char)('a' + (int)x);
    if( x < 52 ) return (char)('A' + (int)x - 26);
    if( x < 62 ) return (char)('0' + (int)x - 52);
    if( x == 62 ) return '0';
    if( x == 63 ) return '1';
}
Colin
do you have any more details on how this can be done?
Arron
See codymanix's answer, http://stackoverflow.com/questions/1116860/whats-the-best-way-to-create-a-short-hash-similiar-to-what-tiny-url-does/1117008#1117008
Arjan
Hmmm, wouldn't the variable length make it hard to reverse the encoding? When invoking `num2char` multiple times for longer numbers, the result would need some separator between each encoded value, to tell them apart while decoding again. That makes the result much longer than when using a fixed-length encoding. If one doesn't mind using the + and / characters, then Base 64 encoding is easier I guess.
Arjan
According to the question, he's looking for some hash that's shorter than the MD5 that he's currently using, and that uses alphanumerics. So, the current hash is irreversible; I think that's a requirement, or at the least not a problem. And this doesn't have 'variable length' - you take 4 hex digits from the MD5 hash, then pass it to num2char. Then take the next 4, pass that number to num2char, etc. The MD5 hash has 32 hex digits. The string you get out of my algorithm uses 32/4=8 alphanumeric characters.
Colin
Of course, the MD5 is irreversible, but isn't the idea that *your* mapping should be able to decode back to that MD5 value? As for variable length: I was wrong indeed. (I thought 0 would yield "a0", while 25 would yield "a25", but that's obviously "a" and "z" -- don't know how I could be so confused.) However, returning "0" and "1" for 62 and 63 will yield duplicates from the 3rd if(..), right? Base 64 needs the + and / characters for a reason... ;-) (And I guess the 3rd `if` reads `(int)x - 52` instead?)
Arjan
hmmm. I didn't consider that *my* mapping should be decodable back to MD5... I do realize that returning "0" and "1" for 62 and 63 create possible duplicates, which could be a problem, but I was just outlining an idea here. If I can think of a better way, that's easy to interpret and/or elegant, I'll edit my post. Thanks for pointing out my error on the third if statement btw :)
Colin
+3  A: 

You cannot use a short hash as you need a one-to-one mapping from the short version to the actual value. For a short hash the chance for a collision would be far too high. Normal, long hashes, would not be very user-friendly (and even though the chance for a collision would probably be small enough then, it still wouldn't feel "right" to me).

TinyURL.com seems to use an incremented number that is converted to Base 36 (0-9, A-Z).

Arjan
Of course you *can*. Maybe you *shouldn't*, but it's perfectly possible.
Treb
You're very right indeed. :-) One *surely* shouldn't use a short hash in this situation though. I'll edit my answer and rewrite my *"You cannot create a short hash"*.
Arjan
A: 

You could encode your md5 hash code with base64 instead of hexadecimal, this way you get a shorter url using exactly the characters [a-z][A-Z][0-9].

codymanix
Though we do not know what Arron wants to use this for: *if* the URLs are to be entered by humans, then I would make them case insensitive without special characters. Base 36 does this (if the script on the server treats them as such). Unfortunately, Base 36 encoding yields longer URLs than Base 64, but they are less prone to errors. Again: *if* humans would need to type them.
Arjan
+3  A: 

Is your goal to create a URL shortener or to create a hash function?

If you goal is to create a URL shortener, then you don't need a hash function. In that case, you just want to pre generate a sequence of cryptographically secure random numbers, and then assign each url to be encoded a unique number from the sequence.

You can do this using code like:

using System.Security.Cryptography;

const int numberOfNumbersNeeded = 100;
const int numberOfBytesNeeded = 8;
var randomGen = RandomNumberGenerator.Create();
for (int i = 0; i < numberOfNumbersNeeded; ++i)
{
     var bytes = new Byte[numberOfBytesNeeded];
     randomGen.GetBytes(bytes);
}

Using the cryptographic number generator will make it very difficult for people to predict the strings you generate, which I assume is important to you.

You can then convert the 8 byte random number into a string using the chars in your alphabet. This is basically a change of base calculation (from base 256 to base 62).

Scott Wisniewski
*"difficult for people to predict the strings you generate, which I assume is important to you"* -- aha, that might be true, given Arron's *"It only needs to be around 5-10 characters long"*. This would not be like TinyURL.com then, so it's about time Arron gives us some more details!
Arjan
I'm not sure why I got a downvote for this... seems silly.
Scott Wisniewski
A: 

There's a wonderful but ancient program called btoa which converts binary to ASCII using upper- and lower-case letters, digits, and two additional characters. There's also the MIME base64 encoding; most Linux systems probably have a program called base64 or base64encode. Either one would give you a short, readable string from a 32-bit CRC.

Norman Ramsey
+1  A: 

Just take a Base36 (case-insensitive) or Base64 of the ID of the entry.

So, lets say I wanted to use Base36:

(ID - Base36)
1 - 1
2 - 2
3 - 3
10 - A
11 - B
12 - C
...
10000 - 7PS
22000 - GZ4
34000 - Q8C
...
1000000 - LFLS
2345000 - 1E9EW
6000000 - 3KLMO

You could keep these even shorter if you went with base64 but then the URL's would be case-sensitive. You can see you still get your nice, neat alphanumeric key and with a guarantee that there will be no collisions!

KingNestor