ansaurus

Question

Use of MD5(URL) instead of URL in DB for WHERE.

Answer 1

A:

I think CRC32 would actually be better for this role, as it's shorter and it saves more SQL space. If you're receiving that many queries, the object is to save space anyways? If it does the job, I'd say go for it.

Although, since it's only 32bit, and shorter in length, it's not as unique as MD5 of course. You will have to decide if you want unique, or if you want to save space.

I still think I'd choose CRC32.

My system generates roughly 4k queries per second, and I use CRC32 for links.

Homework 2009-09-08 17:03:07

You can always store the full url in a separate column and ask MySQL to compare both: same CRC32 and same full URL.

too much php 2009-09-09 02:33:19

Will try this, thanks! :P

Homework 2009-09-09 18:24:06

Answer 2

+4 A:

Create a non-clustered index on URL. That will let your SQL engine do all the optimization internally and will produce the best results!

If you create an index on a VARCHAR column, SQL will create a hash internally anyway and using the index can give better performance by an order of magnitude or even more!

Also, something to keep in mind if you're only checking whether a URL exists, is that certain SQL products will produce faster results with a query like this:

IF NOT EXISTS(SELECT * FROM `tablename` WHERE url='')
    -- return TRUE or do your logic here

Miky Dinescu 2009-09-08 17:08:03

I thought "non-clustered" was SQL Server terminology - shouldn't that read as just being an index?

OMG Ponies 2009-09-08 17:12:18

non-clustered indexes are "virtual" indexes on the data, whereas clustered indexes are physical indexes on the data. You can only have one clustered index per table, while you can have multiple non-clustered indexes on the same table

Miky Dinescu 2009-09-08 17:15:38

Agreed, a NC index would get same or similar performance as adding MD5 or other hash. If you have a high ratio of tablename records per url, I would consider a two table structure, where unique urls be maintained in say tblUrls and tablename would only store the corresponding key. This may slightly increase you insert performance but also reduce storage requirements and have a few other benefits, depending on the underlying application.

mjv 2009-09-08 17:21:24

Here's an article that talks about clustered vs non-clustered index performance on MySQL InnoDB tables: http://dbscience.blogspot.com/2008/02/clustered-indexing-and-query.html

Miky Dinescu 2009-09-08 17:24:09

Answer 3

A:

If the tendency is for the result of that select statement to be rather high, an alternative solution would be to have a separate table which keeps track of the counts. Obviously there are high penalties for using that technique, but if this specific query is a common one and is too slow, this might be a solution.

There are obvious trade-offs involved in this solution, and you probably do not want to update this 2nd table after every individual insertion of a new record inserted, as that would slow down your insertions.

Brian 2009-09-08 17:08:21

Answer 4

A:

Using the build-in indexing is always best, or you should volunteer to add to their codebase anyways ;)

When using a hash, create a 2 column index on the hash and the URL. If you only choose the first couple of letters on the index, it still does a complete match, but it doesn't index more then the first few letters.

Something like this:

INDEX(CRC32_col, URL_col(5))

Either hash would work in that case. It's a trade-off of space vs speed.

Also, this query will be much faster:

SELECT * FROM table WHERE hash_col = 'hashvalue' AND url_col = 'urlvalue' LIMIT 1;

This will find the first value and stop. Much faster then finding many matches for the COUNT(*) calculation.

Ultimately the best choice is to make test cases for each variant and benchmark.

Killroy 2009-09-08 17:17:43

Answer 5

A:

If you choose a hash you need to take into account collissions. Even with a large hash like MD5 you have to account the meet-in-the-middle probability, better known as birthday attack. For a smaller hash like CRC-32 the collision probability will be quite large and your WHERE has to specify hash and the full URL.

But I gotta ask, is this the best way to spend your efforts? Is there nothing else left to optimize? You may be well doing premature optimizations unless you have clear metrics and measurements indicating that this problem is the bottleneck of the system. After all, this kind of seek is what databases are optimized for (all of them), and by doing something like a hash you may actually decrease performance (eg. your index may become fragmented becuase hashes have a different distribution than URLs).

Remus Rusanu 2009-09-08 17:24:51

Answer 6

A:

Don't most SQL engines use hash functions internally for text column searches?

Loadmaster 2009-09-09 02:05:04

Answer 7

A:

If you're going to use hashed keys and you're concerned about collisions, use two different hash functions and concatenate the two hashed values.

But even if you do this, you should always store the original key value in the row as well.

Loadmaster 2009-09-09 02:59:13

ansaurus

tags:

views:

answers:

Use of MD5(URL) instead of URL in DB for WHERE.

related questions