views:

1003

answers:

2

I have a mysql table consisting of:

CREATE TABLE `url_list` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `crc32` int(10) unsigned NOT NULL,
  `url` varchar(512) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `crc32` (`crc32`)
);

When inserting data into a related table I need to lookup the primary key from this table, and using the crc32 really speeds that up whilst allowing a small index. The URLs do need to be unique, but I'd like to avoid having more index than actual data.

If the value isn't present I need to insert it, but using structures such as INSERT IGNORE, or ON DUPLICATE KEY either requires me to put a unique on the huge varchar, or don't take advantage of my index.

How can I "SELECT id else INSERT", whilst preserving the lookup speed for the 80-90% of hits that are already in the table?

+1  A: 

This website offers a solution to a similar problem.

Adriano Varoli Piazza
+2  A: 

I would recommend ditching the id column and the crc32 because they're not necessary.

You can use an MD5() hash to provide a fixed-length, virtually unique value computed from the lengthy URL data, and then use that hash as the primary key.

CREATE TABLE `url_list` (
  `url_hash` BINARY(16) NOT NULL PRIMARY KEY
  `url`      VARCHAR(512) NOT NULL
);

DELIM !!
CREATE TRIGGER `url_ins` BEFORE INSERT ON `url_list`
FOR EACH ROW
BEGIN
  SET NEW.`url_hash` = UNHEX( MD5( NEW.`url` ) );
END!!

Then you can use INSERT..ON DUPLICATE KEY UPDATE because unlike crc32, the hash should have a very low chance of collision.

edit: See http://en.wikipedia.org/wiki/Birthday_attack. If you log 1 million distinct URL's per day for 2,000 years, the MD5 hashes of these URL's are still less likely to include a collision than your hard disk is to have an uncorrectable bit error.

Bill Karwin
I had thought about using a hash, but I don't like the virtually part of virtually unique.
Ok I'm going for the hash after working out the collision % for md5. The int would overflow first by a long way.
I agree it's good to be wary of the word "virtually." It often means "not," as in, "I have finished virtually all of my homework," or "This new surgical procedure is virtually pain-free." :-)
Bill Karwin