views:

66

answers:

3

Hello,

which is the best primary key to store website address and page URLs?

To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.

This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.

Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.

The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).

Consider I have to store (and work with) some millions of them.

Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?

A: 

Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement

Bozho
+1  A: 

Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.

SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.

There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.

But if you have lots of URLs concentrated in a few domains ... for example ....

http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb

etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.

(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)

If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.

Ollie Jones
One way to improve performance for joins on hashes is to add a second indexed column with a more "concentrated" version of the hash data. A BIGINT with the first 64 bits of an MD5 can be indexed more efficiently than a CHAR(32). Collisions will be a zillion times more common, which is to say, extremely rare. Your WHERE can join on both columns ("WHERE t1.inthash=t2.inthash AND t1.charhash=t2.charhash") and in the extremely rare case of a BIGINT collision, the full hash will ensure you still get the right answer.
Jamie McCarthy
+3  A: 

You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.

As for what other columns and indexes you want, it depends, as always, on how you're going to use them.

A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.

I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.

The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).

The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.

Jamie McCarthy
+1 - ther are serious performance implications on having wider primrary keys, well documented by the SQL team and mostly ignored by most developers.
TomTom