views:

593

answers:

4

We're considering using UUID values as primary keys for our MySQL database. The data being inserted is generated from dozens, hundreds, or even thousands of remote computers and being inserted at a rate of 100-40,000 inserts per second, and we'll never do any updates.

The database itself will typically get to around 50M records before we start to cull data, so not a massive database, but not tiny either. We're also planing to run on InnoDB, though we are open to changing that if there is a better engine for what we're doing.

We were ready to go with Java's Type 4 UUID, but in testing have been seeing some strange behavior. For one, we're storing as varchar(36) and I now realize we'd be better off using binary(16) - though how much better off I'm not sure.

The bigger question is: how badly does this random data screw up the index when we have 50M records? Would we be better off if we used, for example, a type-1 UUID where the leftmost bits were timestamped? Or maybe we should ditch UUIDs entirely and consider auto_increment primary keys?

I'm looking for general thoughts/tips on the performance of different types of UUIDs when they are stored as an index/primary key in MySQL. Thanks!

+5  A: 

A UUID is a Universally Unique ID. It's the universally part that you should be considering here.

Do you really need the IDs to be universally unique? If so, then UUIDs may be your only choice.

I would strongly suggest that if you do use UUIDs, you store them as a number and not as a string. If you have 50M+ records, then the saving in storage space will improve your performance (although I couldn't say by how much).

If your IDs do not need to be universally unique, then I don't think that you can do much better then just using auto_increment, which guarantees that IDs will be unique within a table (since the value will increment each time)

Dancrumb
Our reason for considering UUIDs is because in some situations we will have 1000+ machines dumping data in, and I didn't want us getting blocked on central ID generation - though perhaps I'm being pennywise, pound foolish :)
Patrick Lightbody
Interesting point; this would parallelize the generation of the keys. I believe that this would increase the performance of key generation.However, you're choosing INSERT performance over SELECT performance if you use VARCHAR for storing the UUID. You most definitely should choose VARBINARY for storing to ensure SELECT performance. The extra step *may* impact INSERT performance, but you'll be paid off with the SELECT performance improvement.
Dancrumb
We ended up doing some benchmarking on real data and GUIDs w/o keys was pretty fast, GUIDs w/ keys was horrible (even when stored as BINARY), and int w/ AUTO_COMPLETE was the fastest. I think in our case, we were indeed missing the forest from the trees, as the sequence generation seemed inconsequential compared to the cost of storing more data + having a really crappy BTREE due to the randomness of the GUIDs
Patrick Lightbody
+1  A: 

Something to take into consideration is that Autoincrements are generated one at a time and cannot be solved using a parallel solution. The fight for using UUIDs eventually comes down to what you want to achieve versus what you potentially sacrifice.

On performance, briefly:

A UUID like the one above is 36 characters long, including dashes. If you store this VARCHAR(36), you're going to decrease compare performance dramatically. This is your primary key, you don't want it to be slow.

At its bit level, a UUID is 128 bits, which means it will fit into 16 bytes, note this is not very human readable, but it will keep storage low, and is only 4 times larger than a 32-bit int, or 2 times larger than a 64-bit int. I will use a VARBINARY(16) Theoretically, this can work without a lot of overhead.

I recommend reading the following two posts:

I reckon between the two, they answer your question.

Kyle Rozendo
Actually, I read both those articles prior to posting this question, and I still didn't have a good answer here. For example, neither talk about type 1 vs type 4 UUIDS :(
Patrick Lightbody
Fair that, I updated my answer a touch. I don't think it provides too much extra insight however.
Kyle Rozendo
@Patrick: you put too many different topics into your question.
hop
@Hop it's a complex topic :)
Patrick Lightbody
A: 

What about some hand crafted UID? Give each of the thousands of servers an ID and make primary key a combo key of autoincrement,MachineID ???

MindStalker
I've thought about that and might need to run some benchmarks. Even a temporary local sequence on each of the 1000 machines, combined with timestamp, might be a enough. Ex: machine_id + temp_seq + timestamp
Patrick Lightbody
Is it possible to have a temp_sequence that resets every timestamp tick? I'm not sure.
MindStalker
A: 

Since the primary key is generated decentralised, you don't have the option of using an auto_increment anyway.

If you don't have to hide the identity of the remote machines, use Type 1 UUIDs instead of UUIDs. They are easier to generate and can at least not hurt the performance of the database.

The same goes for varchar (char, really) vs. binary: it can only help matters. Is it really important, how much performance is improved?

hop