views:

34

answers:

2

I have a table that contains a few columns and one of them is an md5 hash which is a unique key in the table.

What would be the most efficient engine and index type (hash/b-tree) for the purposes of determining if a hash already exists in the table or not? I expect to have billions of rows across 200 partitions (mysql5.1)

Right now I have it as myisam with a unique btree index on that hash column however I worry about the constant rebalancing of a b-tree with random hashes being inserted constantly.

pseudocode:

if hash not in table:
  process
else:
  skip, record already exists
+1  A: 

well md5 hashes have 128bit binary. its common to write them in hexa-decimals of 32 digits. so going for any char field and store the hexa decimal string (e.g. char 32) would be just stupid, only simple. you could go for two combined bigint 64 unsigned, which would be good if you needed some sort of sorting - which you don't. so the winner is: binary(16) ... which is exactly 128 and exactly what you need.

now which index should you use? thats a tough one. theoretically if you have solely and exclusevly only equality operators, you can be faster with hash indexes. but the thing is that btree is almost exclusively used and you cant even define hash in innodb any more. the implementations of hash may be sloppy. and theres relly not much difference. btree is more reliably.

i would worry more about the database engine. myisam generally performs faster because it lacks certain functions innodb has (such as rollback...), but it has only table locking. inndbo can do row locking and if you have a lot of updates and writes it will probably perform better.

okay... so far so good. now i'd like to suggest thinking about using something different than md5. why exactly do you need it? may it be possible to index a crc sum or something that is smaller? i guess you are indxing files and check them for existance etc...

and finally. i would consider sharding your database! sharding is mostly a touth thing and a measure of last resort, but in this case it could be pretty easy.

eveyrthing that ends with 00 goes to server 1, 01-> server 2, 10->3, 11->4 etc (use modulo arithmetic for that, its the fastest!) and so on... if you now check for an md5 hash in the database you exactly know which server to look on and vice versa where to store it! then you can split your databsae to as many servers as you like, you don't even need to replicate them any further and in this way you are eliminating any bottleneck...

well it of course depends on your application i dunno what additional data may be linked :)

Joe Hopfgartner
A: 
  1. You have worry about the re-balancing of BTree index, this implies you have frequent inserts or updates, so you should avoid MyISAM (due to table level locking).

  2. BTree is the only supported index type for MyISAM/InnoDB, you really don't have too much option. If you are going with InnoDB, make sure the hash is NOT the primary key (due to clustered index)

tszming
"If you are going with InnoDB, make sure the hash is NOT the primary key (due to clustered index)" ... can you explain to me whats wrong with that? to my knowledge the dbms just trys to store data with keys inclose proximity as well in close proximity. this is of no use in this case but i don't think it will make it notable slower? i would say having a second column indexed would be a greater performance loss (two indexes caluclated, stored etc)... but i may be wrong..
Joe Hopfgartner
InnoDB use clustered index, rows are physically stored in the index's leaf page. Since MD5 (or UUID) result in non sequential values, so insertions are random which lead to very poor insert performance.
tszming