tags:

views:

656

answers:

6

I am trying to understand crc32 to generate the unique url for web page.

If we use the crc32, what is the maximum number of urls can be used so that we can avoid duplicates?

What could be the approximative string length to keep the checksum to be 2^32?

When I tried UUID for an url and convert the uuid bytes to base 64, I could reduce to 22 chars long. I wonder I can reduce still further.

Mostly I want to convert the url (maximum 1024 chars) to shorted id.

+3  A: 

There is no such number as the "maximum number of urls can be used so that we can avoid duplicates" for CRC32.

The problem is that CRC32 can produce duplicates, and it's not a function of how many values you throw at it, it's a function of what those values look like.

So you might have a collision on the second url, if you're unlucky.

You should not base your algorithm on producing a unique hash, instead produce a unique value for each url manually.

Lasse V. Karlsen
While I do agree with your conclusion, quantity does matter. Even if CRC32 had perfect distribution, the birthday paradox would make a collision likely as you get near 2^16 items. Check out http://en.wikipedia.org/wiki/Birthday_attack for a handy table, as well as more math.
Steven Sudit
I agree that by feeding it "random" data or at least data without a discernible pattern, the likelihood of a collision will increase with quantity. What I meant is that the function isn't such that it allows you to feed it a fixed number of items and get guaranteed unique values out of it for each.
Lasse V. Karlsen
That's certainly true: aside from perfect hashes, which exist only for inputs known in advance, there can be no guarantee of uniqueness, except in a statistical sense. For example, it is extremely unlikely for the SHA-2 value of your comment and mine to be indentical, and we can quantify that probability if we wish.
Steven Sudit
+2  A: 

If you're already storing the full URL in a database table, an integer ID is pretty short, and can be made shorter by converting it to base 16, 64, or 85. If you can use a UUID, you can use an integer, and you may as well, since it's shorter and I don't see what benefit a UUID would provide in your lookup table.

Joel Mueller
+1  A: 

CRC32 means cyclic redundancy check with 32 bits where any arbitrary amount of bits is summed up to a 32 bit check sum. And check sum functions are surjective, that means multiple input values have the same output value. So you cannot inverse the function.

Gumbo
I think you might mean non-injective rather than surjective. A function is surjective if every value in the 'target set' (codomain) of your function gets hit by something in your 'source set' (domain).This means that surjectivity very much depends on the choice of codomain (by just deciding to use a bigger codomain your function stops being surjective)
+1  A: 

No, even you use md5, or any other check sum, the URL CAN BE duplicate, it all depends on your luck.

So don't make an unique url base on those check sum

Fu4ny
A: 

The right way to make a short URL is to store the full one in the database and publish something that maps to the row index. A compact way is to use the Base64 of the row ID, for example. Or you could use a UID for the primary key and show that.

Do not use a checksum, because it's too small and very likely to conflict. A cryptographic hash is larger and less likely, but it's still not the right way to go.

Steven Sudit
A: 

The quickest (and perhaps best!) way to solve things may be to simply use a hash of the local path and query of a given URI, as follows:

using System;

namespace HashSample
{
    class Program
    {
        static void Main(string[] args)
        {
            Uri uri = new Uri(
                "http://host.com/folder/file.jpg?code=ABC123");

            string hash = GetPathAndQueryHash(uri);

            Console.WriteLine(hash);
        }

        public static string GetPathAndQueryHash(Uri uri)
        {
            return uri.PathAndQuery.GetHashCode().ToString();
        }
    }
}

The above presumes that the URI scheme and host remain the same. If not GetHashCode will work with any string.

For a great discussion on CRC32 Hash Collision visit: http://episteme.arstechnica.com/eve/forums/a/tpc/f/6330927813/m/821008399831

lsb
+1 for linking to an article on collisions, -1 for nonetheless offering code that is susceptible to these collisions.
Steven Sudit
-1 since it depends on String.GetHashCode remaining constant, which is not a safe bet.
Steven Sudit