ansaurus

Question

Pros and cons of using md5 hash of URI as the primary key in a database

Answer 1

A:

Maybe this document is something you want to read:

http://www.hpl.hp.com/techreports/2002/HPL-2002-216.pdf

MysticSlayer 2008-10-21 08:39:35

Interesting read, but slightly different problem. Furthermore, in my case the content may well be edited, but I'd want to keep the URI unchanged. The approach in the HP tech report would require me to change the URI with each edit.

rdmpage 2008-10-21 08:52:02

Answer 2

A:

Often lots of different urls point to the same page. http://example.com/ example.com http://www.example.com/ http://example.com/index.html http://example.com/. https://example.com/ etc.

This might or might not be a problem for you.

2008-10-21 08:49:32

I do indeed have multiple identifiers for the same object (articles often have more than one presence on the web). That's not a huge problem, I'm not relying on people generating the hash to find out whether the URI exists in my database.

rdmpage 2008-10-21 08:53:49

Answer 3

A:

After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.

rdmpage 2008-10-21 09:02:22

Answer 4

+1 A:

Multiple strings can produce the same md5 hash. Primary keys must be unique. So using the hash as the primary key is not good. Better is to use the GUID directly.

Is a GUID suitable for use in a URL. Sure. Here's a GUID (actually, a UUID) I jsut created using Java: 1ccb9467-e326-4fed-b9a7-7edcba52be84

The url could be:

http://example.com/view?id=1ccb9467-e326-4fed-b9a7-7edcba52be84

It's longish but perfectly usable and achieves what you describe.

Steve McLeod 2008-10-21 09:07:07

But if I want to use the primary key in a URL I'll need to reformat it, as opposed to just using the md5 hash. I'm looking to keep things simple.

rdmpage 2008-10-21 09:29:20

Why would you need to reformat it? The Internet Archive uses URLs verbatim, as substrings of the URLs of archived pages.

Hugh Allen 2008-10-21 10:03:39

rdmpage -- Md5 is not unique. Hence, it's not fit for a primary key.

MichaelGG 2008-10-21 10:03:56

But if the probability of a collision is low, then this isn't a big issue. Plus, unlike a UUID I can guarantee that I can regenerate the same md5 hash from the same URI, which is useful if the GUID is generated at different times, for example. UUIDs seem to address a different problem.

rdmpage 2008-10-21 12:20:37

Hugh -- But some URIs are very messy, may contain parts that are url encoded and parts that aren't, etc, making life difficult. A hash seems a lot easier to manage.

rdmpage 2008-10-21 12:24:19

Answer 5

+1 A:

It's perfectly fine. Accidental collision of MD5 is practically impossible (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second for 100 years).

porneL 2008-11-13 22:44:25

Exactly. You still need to be aware that deliberate MD5 hash collisions are *easy* to do. If this is going to be a problem you would be better off using SHA-1 or some other newer hash function that isn't as broken as MD5.

paranoidgeek 2009-01-01 03:20:31

Answer 6

A:

MD5 is considered deprecated - at least for cryptographic purposes, but I would suggest only using md5 for backwards compatibility with existing stuff. You should have a good reason to go with md5 when we do have other hash algos out there that aren't (at least yet) broken.

Problems I see with the approach:

Duplicate objects, because the url identifier is different (As arend mentioned)
URLs changing

The latter being the one that might be important - this could be done as simply as a remove and an add. That is, if these ids are never visible/storable outside the database. (Like as a component of a URL.)

I guess these won't be a problem for DOIs.

How would it work with a non-autonumber integer id setup, but where the offline inserter agent creates the numbers? (Can use a dedicated range of numbers, maybe?) Might have a problem with duplication should two users independently add the same url?

MaHuJa 2009-08-03 11:34:52

ansaurus

tags:

views:

answers:

Pros and cons of using md5 hash of URI as the primary key in a database

related questions