tags:

views:

719

answers:

3

I want to store a large set of URLs in MySql and create a unique index on the column. If I make the column utf8 then I'll be limited to a varchar(333), which is not enough to hold some of my URLs. If I declare the column to be latin1 then I get the full 1000 characters (don't think I need that much). However, I'll have to encode the URL and be consistent about always working with the encoded URL. Is there a better way to manage large sets of URLs?

A: 

The most common practice i know of is using a hash algorithm with collision control, just use some kind of quick one way encoding that will produce very low collisions on URLs.

Try chopping off parts that you know will be the same throughout all urls (i.e. HTTP://, www, etc...) IF the urls are all part of your domain, chop that off too.

Otherwise, i'd re-think the problem and try and find a different way to accomplish whatever you are trying to accomplish. I assume having a unique set of URLs is really solving some other problem.

Martin Dale Lyness
You're making what are probably incorrect assumptions about why he's storing the URLs. It is probably /not/ safe to chop off www, etc.
Matthew Flaschen
Could you site an example? With removing the http and www you can easily re-produce the original URL, it doesn't affect the definitions integrity... Could you explain this?
Martin Dale Lyness
That's wrong. There is no guarantee that http://www.foo.com and http://foo.com refer to the same URL. Let alone https://www.foo.com and http://foo.com .
Matthew Flaschen
Mouse-over the links. Anyway, the point is that the site is free to give the www subdomain special significance.
Matthew Flaschen
+4  A: 

One thing you may think about is storing the hostname and protocol portion of the URL in a seperate table and referencing it via a key. This could also prove useful later on for getting all URLS for a specific host as well as helping out with your string length concerns.

For example:

PROTOCOLS
-----------------------
PROTOCOL_ID   INTEGER
PROTOCOL      VARCHAR(10)    (i.e., http, http, ftp, etc.)

HOSTS
-----------------------
id       BIGINT
hostname varchar(256)   

URL
-----------------------
PROTOCOL      INTEGER  FK to PROTOCOLS
HOSTNAME      BIGINT   FK to HOSTS
QUERY_STRING  VARCHAR(333)
RC
Agreed. You may want to go the route of longneck's suggestion of using a text field. I was more coming at "Is there a better way to manage large sets of URLs" with the side affect of it helping the length of your URLs as well. One thing you may also want to do depending on how your using the URLs is break hostname down into host and domain with the HOSTS table having a domain_key back to a DOMAIN table. This could make finding all URLs within a domain trivial.
RC
+3  A: 

three good ways to do this:

1) use TEXT instead of VARCHAR. to ensure uniqueness, you'll have to also create a separate VARCHAR column to store an MD5() or SHA1() hash and add a UNIQUE or PRIMARY index. this has the unfortunate consequence of an additional disk seek to retrieve the URL, but depending on your use case that might be OK.

2) use VARCHAR with a binary collation and compress the URL using COMPRESS().

3) i forgot the third one as i was typing the first two. grr...

longneck
+1, I personally would go with #1 (placing the unique constraint on the hash of the url, not the url itself).
nathan