views:

1473

answers:

2

Assuming that I want to use a hash as an ID instead of a numeric. Would it be an performance advantage to store them as BINARY over non-binary?

CREATE TABLE `test`.`foobar` (
  `id` CHAR(32) BINARY CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
  PRIMARY KEY (`id`)
)
CHARACTER SET ascii;
+1  A: 

From the manual:

The BINARY and VARBINARY types are similar to CHAR and VARCHAR, except
that they contain binary strings rather than non-binary strings. That is,
they contain byte strings rather than character strings. This means that
they have no character set, and sorting and comparison are based on the
numeric values of the bytes in the values.

Since CHAR(32) BINARY causes a BINARY(32) column to be created under the hood, the benefit is that it will take less time to sort by that column, and probably less time to find corresponding rows if the column is indexed.

Allain Lalonde
Thanks for your answer. You wouldn't happen to know some figures about that?
Gumbo
+6  A: 

Yes. Often a hash digest is stored as the ASCII representation of hex digits, for example MD5 of the word 'hash' is:

0800fc577294c34e0b28ad2839435945

This is a 32-character ASCII string.

But MD5 really produces a 128-bit binary hash value. This should require only 16 bytes to be stored as binary values instead of hex digits. So you can gain some space efficiency by using binary strings.

CREATE TABLE test.foobar (
  id BINARY(16) NOT NULL PRIMARY KEY
);

INSERT INTO test.foobar (id) VALUES (UNHEX(MD5('hash')));


Re. your comments that you are more concerned about performance than space efficiency:

I don't know of any reason that the BINARY data type would be speedier than CHAR.

Being half as large can be an advantage for performance if you use cache buffers effectively. That is, a given amount of cache memory can store twice as many rows worth of BINARY data if the string is half the size of the CHAR needed to store the same value in hex. Likewise the cache memory for the index on that column can store twice as much.

The result is a more effective cache, because a random query has a greater chance of hitting the cached data or index, instead of requiring a disk access. Cache efficiency is important for most database applications, because usually the bottleneck is disk I/O. If you can use cache memory to reduce frequency of disk I/O, it's a much bigger bang for the buck than the choice between one data type or another.

As for the difference between a hash string stored in BINARY versus a BIGINT, I would choose BIGINT. The cache efficiency will be even greater, and also on 64-bit processors integer arithmetic and comparisons should be very fast.

I don't have measurements to support the claims above. The net benefit of choosing one data type over another depends a lot on data patterns and types of queries in your database and application. To get the most precise answer, you must try both solutions and measure the difference.


Re. your supposition that binary string comparison is quicker than default case-insensitive string comparison, I tried the following test:

mysql> SELECT BENCHMARK(100000000, 'foo' = 'FOO');
1 row in set (5.13 sec)

mysql> SELECT BENCHMARK(100000000, 'foo' = BINARY 'FOO');
1 row in set (4.23 sec)

So binary string comparison is 17.5% faster than case-insensitive string comparison. But notice that after evaluating this expression 100 million times, the total difference is still less than 1 second. While we can measure the relative difference in speed, the absolute difference in speed is really insignificant.

So I'll reiterate:

  • Measure, don't guess or suppose. Your educated guesses will be wrong a lot of the time. Measure before and after every change you make, so you know how much it helped.
  • Invest your time and attention where you get the greatest bang for the buck.
  • Don't sweat the small stuff. Of course, a tiny difference adds up with enough iterations, but given those iterations, a performance improvement with greater absolute benefit is still preferable.
Bill Karwin
I’ve already thought of that. It would save me half of the space although 2^128 are way enough (BIGINT has just 2^64 and that’s already enough).
Gumbo
I’ve updated my question as you requested. I would be nice if you take another look at it.
Gumbo
I thought that the binary would be faster as there are no additional comparisons of upper and lower case letters like normal comparison do so that “'foo'='FOO'” is true but “BINARY 'foo'='FOO"” false.
Gumbo
Thanks for your reply. I’ve forgot about it. And thanks for your hint of benchmarking. I didn’t think about that, too. I think I’ll build two test case and measure it myself. Although I think I’ll definetly go with the 16 byte long BINARY variant or even shorten it to 8 byte.
Gumbo
An MD5 hash is always 16 bytes. If you shorten the storage to 8 bytes you'll be storing a truncated hash value. If that's okay, go ahead.
Bill Karwin