tags:

views:

96

answers:

4

I have url column with unique key over it - but its performance on updates is absolutely atrocious. I suspect that's because the index doesn't all fit in memory.

So I was thinking, how about adding a column of md5(url) with 16 bytes of binary data and unique-keying that instead.

What would be the best datatype for that? I'd love to be able to just see 32-character hex hash, while mysql would convert it to/from 16 binary bytes and index that, as programs using the database might have some troubles with arbitrary binary data that I'd rather avoid if possible (also I'm a bit afraid that mysql might get some strange ideas about character sets and for example overalocating storage for that by 3:1 because it thinks it might need utf8, how do I avoid that for cure?).

It seems some kind of solution would be binary(16) null for storage, unhex(md5(url)) for insertion/comparison, hex(url_hash) for retrieval (not that it really needs retrieval, there will be unindexed url column there anyway). Is this the best way?

A: 

The index is probably already using a hash, in a more efficient way than your hand-made MD5 solution would.

Nicolás
vladr
+3  A: 

MD5 is not guaranteed unique, so you cannot create a unique index on it unless your business model permits you to flat out deny insertions and updates in case of collison. Is this the case? I am asking because working around collisions (no matter how unlikely) will prove extremely complex from a performance standpoint.

In any case, I find it hard to believe (not to say that it may not turn out to be true) that a properly structured query, properly planned by MySQL to use the proper index (even over 500M rows), would have to suffer from atrocious performance -- but then again it is hard to tell without knowing what your query looks like and what your numbers are.

If I were you, before even considering a workaround (such as the MD5 approach) to existing index lookup I'd make absolutely sure of where my problem really lies:

  • use EXPLAIN to confirm that your UPDATE statement is indeed using the proper index
    • you cannot EXPLAIN an UPDATE statement, but you can EXPLAIN its equivalent SELECT statement (you essentially care about the WHERE clause, JOINs etc.)
    • even with 500M rows, a btree index should only require a handful of pages per matching row
      • how many rows do you expect each one of your UPDATE statements to update? how many rows are actually updated?
      • do you have additional conditions in your WHERE clause in addition to url=? The planner may pick the less selective index first and thrash your cache -- find out from the EXPLAIN plan
    • when you actually run (not EXPLAIN) them: is UPDATE systematically slower than its corresponding SELECT? you may be having a write bottleneck, possibly due to locking issues. how many sessions are active at the time of a slow UPDATE? how many indices defined on your table incorporate the url column?
    • have you analyzed your table recently?

So anyway, before proceeding, please let us know:

  • are you doing bulk UPDATE? how many UPDATE second (or how many milliseconds per UPDATE) would meet your performance requirements?
  • how many sessions are active at the time of the UPDATE?
  • have you analyzed your table?
  • what would be a sample UPDATE query? (please provide specific values for its parameters)
  • what is the explain plan for the corresponding SELECT? (using the same specific values)
  • how long does the corresponding SELECT (using the same specific values) actually take to complete when executed (not EXPLAINed), and which row(s) did it actually return?
  • how long does the actual UPDATE (using the same specific values) take when executed? (not EXPLAINed)

Cheers, V.

vladr
Chance of random conflict is acceptably small, 1 conflict is expected on 2**64 records by birthday paradox, so that's just not going to happen.There is no prefix common to all URLs.EXPLAIN only works with SELECTs, not UPDATEs, so it's not helpful.show table status says there's 122GB of data + 119GB of indexes in that table. It's memory issue not btree comparison issue. Other indexes are over number/date/etc. fields, so should be small, but how do I find sizes of particular indexes just to be sure?
taw
`UPDATE ... SET ... WHERE ...` maps to `SELECT FROM ... WHERE ...` for `EXPLAIN` purposes. Even with 500M rows, if you're using an index you should not require more than 30 or so pages per updated row.
vladr
A: 

I'm not familiar with MySQL specifically - but my guess is that unique index is a clustered index (meaning the data pages are ordered with it). When you update, you cause a reorganization of the entire table.

If you can move the clustered index to some stable value, then that should solve your problem.

Mark Brackett
In a comment he said the table has 122GB of data + 119GB of indexes, which probably means that the URL data (probably close to 119GB worth) is indexed separately from the rest of the table.
Gabe
A: 

If you are only using the index to guarantee uniqueness and not for retrieval, it's probably a win to use an MD5 in a binary(16) not null column. That way you could potentially have hundreds of keys in an index page, reducing the number of disk seeks for each insert.

An alternative is to use compression in your table by creating it like this:

CREATE TABLE foo (url varchar(255)) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;

Since URLs should compress pretty well, it could be just as big of a win as your hash idea and not require any extra code on your part.

Here's the InnoDB reference on compression: http://www.innodb.com/doc/innodb_plugin-1.0/innodb-compression.html

Gabe