views:

1635

answers:

6

I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is

http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d

where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.

Does anybody have views on the pros and cons of this approach?

For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.

I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.

A: 

Maybe this document is something you want to read:

http://www.hpl.hp.com/techreports/2002/HPL-2002-216.pdf

MysticSlayer
Interesting read, but slightly different problem. Furthermore, in my case the content may well be edited, but I'd want to keep the URI unchanged. The approach in the HP tech report would require me to change the URI with each edit.
rdmpage
A: 

Often lots of different urls point to the same page. http://example.com/ example.com http://www.example.com/ http://example.com/index.html http://example.com/. https://example.com/ etc.

This might or might not be a problem for you.

I do indeed have multiple identifiers for the same object (articles often have more than one presence on the web). That's not a huge problem, I'm not relying on people generating the hash to find out whether the URI exists in my database.
rdmpage
A: 

After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.

rdmpage
+1  A: 

Multiple strings can produce the same md5 hash. Primary keys must be unique. So using the hash as the primary key is not good. Better is to use the GUID directly.

Is a GUID suitable for use in a URL. Sure. Here's a GUID (actually, a UUID) I jsut created using Java: 1ccb9467-e326-4fed-b9a7-7edcba52be84

The url could be:

http://example.com/view?id=1ccb9467-e326-4fed-b9a7-7edcba52be84

It's longish but perfectly usable and achieves what you describe.

Steve McLeod
But if I want to use the primary key in a URL I'll need to reformat it, as opposed to just using the md5 hash. I'm looking to keep things simple.
rdmpage
Why would you need to reformat it? The Internet Archive uses URLs verbatim, as substrings of the URLs of archived pages.
Hugh Allen
rdmpage -- Md5 is not unique. Hence, it's not fit for a primary key.
MichaelGG
But if the probability of a collision is low, then this isn't a big issue. Plus, unlike a UUID I can guarantee that I can regenerate the same md5 hash from the same URI, which is useful if the GUID is generated at different times, for example. UUIDs seem to address a different problem.
rdmpage
Hugh -- But some URIs are very messy, may contain parts that are url encoded and parts that aren't, etc, making life difficult. A hash seems a lot easier to manage.
rdmpage
+1  A: 

It's perfectly fine. Accidental collision of MD5 is practically impossible (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second for 100 years).

porneL
Exactly. You still need to be aware that deliberate MD5 hash collisions are *easy* to do. If this is going to be a problem you would be better off using SHA-1 or some other newer hash function that isn't as broken as MD5.
paranoidgeek
A: 

MD5 is considered deprecated - at least for cryptographic purposes, but I would suggest only using md5 for backwards compatibility with existing stuff. You should have a good reason to go with md5 when we do have other hash algos out there that aren't (at least yet) broken.

Problems I see with the approach:

  • Duplicate objects, because the url identifier is different (As arend mentioned)
  • URLs changing

The latter being the one that might be important - this could be done as simply as a remove and an add. That is, if these ids are never visible/storable outside the database. (Like as a component of a URL.)

I guess these won't be a problem for DOIs.


How would it work with a non-autonumber integer id setup, but where the offline inserter agent creates the numbers? (Can use a dedicated range of numbers, maybe?) Might have a problem with duplication should two users independently add the same url?

MaHuJa