views:

328

answers:

7

I have a big MySQL InnoDB table (about 1 milion records, increase by 300K weekly) let's say with blog posts. This table has an url field with index.

By adding new records in it I'm checking for existent records with the same url. Here is how query looks like:

SELECT COUNT(*) FROM `tablename` WHERE url='http://www.google.com/';

Currently system produces about 10-20 queries per second and this amount will be increased. I'm thinking about improving performance by adding additional field which is MD5 hash of the URL.

SELECT COUNT(*) FROM `tablename` WHERE md5url=MD5('http://www.google.com/');

So it will be shorter and with constant length which is better for index compared to URL field. What do you guys think about it. Does it make sense?

Another suggestion by friend of mine is to use CRC32 instead of MD5, but I'm not sure about how unique will be result of CRC32. Let me know what you think about CRC32 for this role.

UPDATE: the URL column is unique for each row.

A: 

I think CRC32 would actually be better for this role, as it's shorter and it saves more SQL space. If you're receiving that many queries, the object is to save space anyways? If it does the job, I'd say go for it.

Although, since it's only 32bit, and shorter in length, it's not as unique as MD5 of course. You will have to decide if you want unique, or if you want to save space.

I still think I'd choose CRC32.

My system generates roughly 4k queries per second, and I use CRC32 for links.

Homework
You can always store the full url in a separate column and ask MySQL to compare both: same CRC32 and same full URL.
too much php
Will try this, thanks! :P
Homework
+4  A: 

Create a non-clustered index on URL. That will let your SQL engine do all the optimization internally and will produce the best results!

If you create an index on a VARCHAR column, SQL will create a hash internally anyway and using the index can give better performance by an order of magnitude or even more!

Also, something to keep in mind if you're only checking whether a URL exists, is that certain SQL products will produce faster results with a query like this:

IF NOT EXISTS(SELECT * FROM `tablename` WHERE url='')
    -- return TRUE or do your logic here
Miky Dinescu
I thought "non-clustered" was SQL Server terminology - shouldn't that read as just being an index?
OMG Ponies
non-clustered indexes are "virtual" indexes on the data, whereas clustered indexes are physical indexes on the data. You can only have one clustered index per table, while you can have multiple non-clustered indexes on the same table
Miky Dinescu
Agreed, a NC index would get same or similar performance as adding MD5 or other hash. If you have a high ratio of tablename records per url, I would consider a two table structure, where unique urls be maintained in say tblUrls and tablename would only store the corresponding key. This may slightly increase you insert performance but also reduce storage requirements and have a few other benefits, depending on the underlying application.
mjv
Here's an article that talks about clustered vs non-clustered index performance on MySQL InnoDB tables: http://dbscience.blogspot.com/2008/02/clustered-indexing-and-query.html
Miky Dinescu
A: 

If the tendency is for the result of that select statement to be rather high, an alternative solution would be to have a separate table which keeps track of the counts. Obviously there are high penalties for using that technique, but if this specific query is a common one and is too slow, this might be a solution.

There are obvious trade-offs involved in this solution, and you probably do not want to update this 2nd table after every individual insertion of a new record inserted, as that would slow down your insertions.

Brian
A: 

Using the build-in indexing is always best, or you should volunteer to add to their codebase anyways ;)

When using a hash, create a 2 column index on the hash and the URL. If you only choose the first couple of letters on the index, it still does a complete match, but it doesn't index more then the first few letters.

Something like this:

INDEX(CRC32_col, URL_col(5))

Either hash would work in that case. It's a trade-off of space vs speed.

Also, this query will be much faster:

SELECT * FROM table WHERE hash_col = 'hashvalue' AND url_col = 'urlvalue' LIMIT 1;

This will find the first value and stop. Much faster then finding many matches for the COUNT(*) calculation.

Ultimately the best choice is to make test cases for each variant and benchmark.

Killroy
A: 

If you choose a hash you need to take into account collissions. Even with a large hash like MD5 you have to account the meet-in-the-middle probability, better known as birthday attack. For a smaller hash like CRC-32 the collision probability will be quite large and your WHERE has to specify hash and the full URL.

But I gotta ask, is this the best way to spend your efforts? Is there nothing else left to optimize? You may be well doing premature optimizations unless you have clear metrics and measurements indicating that this problem is the bottleneck of the system. After all, this kind of seek is what databases are optimized for (all of them), and by doing something like a hash you may actually decrease performance (eg. your index may become fragmented becuase hashes have a different distribution than URLs).

Remus Rusanu
A: 

Don't most SQL engines use hash functions internally for text column searches?

Loadmaster
A: 

If you're going to use hashed keys and you're concerned about collisions, use two different hash functions and concatenate the two hashed values.

But even if you do this, you should always store the original key value in the row as well.

Loadmaster