ansaurus

Question

What would the most efficient index type and table engine be for md5 lookups?

Answer 1

+1 A:

well md5 hashes have 128bit binary. its common to write them in hexa-decimals of 32 digits. so going for any char field and store the hexa decimal string (e.g. char 32) would be just stupid, only simple. you could go for two combined bigint 64 unsigned, which would be good if you needed some sort of sorting - which you don't. so the winner is: binary(16) ... which is exactly 128 and exactly what you need.

now which index should you use? thats a tough one. theoretically if you have solely and exclusevly only equality operators, you can be faster with hash indexes. but the thing is that btree is almost exclusively used and you cant even define hash in innodb any more. the implementations of hash may be sloppy. and theres relly not much difference. btree is more reliably.

i would worry more about the database engine. myisam generally performs faster because it lacks certain functions innodb has (such as rollback...), but it has only table locking. inndbo can do row locking and if you have a lot of updates and writes it will probably perform better.

okay... so far so good. now i'd like to suggest thinking about using something different than md5. why exactly do you need it? may it be possible to index a crc sum or something that is smaller? i guess you are indxing files and check them for existance etc...

and finally. i would consider sharding your database! sharding is mostly a touth thing and a measure of last resort, but in this case it could be pretty easy.

eveyrthing that ends with 00 goes to server 1, 01-> server 2, 10->3, 11->4 etc (use modulo arithmetic for that, its the fastest!) and so on... if you now check for an md5 hash in the database you exactly know which server to look on and vice versa where to store it! then you can split your databsae to as many servers as you like, you don't even need to replicate them any further and in this way you are eliminating any bottleneck...

well it of course depends on your application i dunno what additional data may be linked :)

Joe Hopfgartner 2010-08-21 02:25:40

Answer 2

A:

You have worry about the re-balancing of BTree index, this implies you have frequent inserts or updates, so you should avoid MyISAM (due to table level locking).
BTree is the only supported index type for MyISAM/InnoDB, you really don't have too much option. If you are going with InnoDB, make sure the hash is NOT the primary key (due to clustered index)

tszming 2010-08-21 05:33:05

"If you are going with InnoDB, make sure the hash is NOT the primary key (due to clustered index)" ... can you explain to me whats wrong with that? to my knowledge the dbms just trys to store data with keys inclose proximity as well in close proximity. this is of no use in this case but i don't think it will make it notable slower? i would say having a second column indexed would be a greater performance loss (two indexes caluclated, stored etc)... but i may be wrong..

Joe Hopfgartner 2010-08-21 13:40:25

InnoDB use clustered index, rows are physically stored in the index's leaf page. Since MD5 (or UUID) result in non sequential values, so insertions are random which lead to very poor insert performance.

tszming 2010-08-21 15:32:50

ansaurus

tags:

views:

answers:

What would the most efficient index type and table engine be for md5 lookups?

related questions