views:

457

answers:

3

We all know that for relational databases it is best practice to use numerical IDs for the primary key.

In couchdb the default ID that is generated is a UUID. Is it best to stick with the default, or use an easily memorable identifier that will be used in the application by the user?

For example, if you were designing the stackoverflow.com database in couchdb, would you use the question slug (eg. what-is-best-practice-when-creating-document-ids-in-couchdb) or a UUID for each document?

A: 

The primary key in a DB should never have any "meaning" except maybe to encode sequence. You might want to change the SLUG but not the primary key.

There might be an good argument to use something starting with a timestamp to have inherent ordering in your keys. I often use "%f@%s" % (time(), hostname()) to get ordered, unique keys. (This works only if your time() implementation never returns the same value twice.)

For other stuff (e.g. images) , where I want to avoid duplicates I often use sha(data) as the key.

mdorseif
+6  A: 

I'm no couchdb expert, but after having done a little research this is what I've found.

The simple answer is, use UUIDs unless you have a good reason not to.

The longer answer is, it depends on:

Cost of changing ID Vs How likely the ID is to change

Low cost of changing and likely to change ID

An example of this might be a blog with a denormalized design such as jchris' blog (sofa code available on git hub).

Every time another website links to a blog post, this is another reference to the id, so the cost of changing the id increases.

High cost of changing ID and an ID that will never change

An example of this is any DB design that is highly normalized that uses auto-increment IDs. Stackoverflow.com is a good example with its auto-incrementing question IDs that you see in every URL. The cost of changing the ID is extremely high since every foreign key would need to be updated.

How many references, or "foreign keys" (in relational DB language) will there be to the id?

Any "foreign keys" will greatly increase the cost of changing the ID. Having to update other documents is a slow operation and definitely should be avoided.

How likely is the ID to change?

If you are not wanting to use UUIDs you probably already have an idea of what ID you want to use.

If it is likely to change, the cost of changing the ID should be low. If it is not, pick a different ID.

What is your motivation for wanting to use an easily memorable ID?

Don't say performance.

Benchmarks show that "CouchDB’s view key lookups are almost, but not quite, as fast as direct document lookups". This means that having to do a search to find a record is no big deal. Don't choose friendly ids just because you can do a direct lookup on a document.

Will you be doing many bulk inserts?

If so, it is better to use incremental UUIDs for better performance.

See this post about bulk inserts. Damien Katz comments and says:

"If you want to have the fastest possible insert times, you should give the _id's ascending values, so get a UUID and increment it by 1, that way it's always inserting in the same place in the index, and being cache friendly once you are dealing with files larger than RAM. For an easier way to do the same thing, just sequentially number the documents but make it fixed length with padding so that they sort correctly, "0000001" instead of "1" for example."

andyuk
This answer seems predicated on the notion that conflict avoidance is always desirable; however, sometimes conflicts are a natural part of the problem domain, and rather than simply being avoided, they should be proactively detected and resolved. In such cases, a natural ID is an excellent choice. For example, don't use the title of a blog post as an ID on a massively multi-user system, but do use the fully qualified domain name and IP address when modeling DNS address records.
+3  A: 

The _id is used a lot in the CouchDB internals and any extra hashing cost is going to slow down a bunch of the internals so it's best to stick with the UUID provided.

mikeal
I'm confused. What do you mean by "extra hashing cost"? Are you saying a user-generated ID will end up hashed, internally, whereas an auto-generated UUID will not?