I want to store a large set of URLs in MySql and create a unique index on the column. If I make the column utf8 then I'll be limited to a varchar(333), which is not enough to hold some of my URLs. If I declare the column to be latin1 then I get the full 1000 characters (don't think I need that much). However, I'll have to encode the URL and be consistent about always working with the encoded URL. Is there a better way to manage large sets of URLs?
The most common practice i know of is using a hash algorithm with collision control, just use some kind of quick one way encoding that will produce very low collisions on URLs.
Try chopping off parts that you know will be the same throughout all urls (i.e. HTTP://, www, etc...) IF the urls are all part of your domain, chop that off too.
Otherwise, i'd re-think the problem and try and find a different way to accomplish whatever you are trying to accomplish. I assume having a unique set of URLs is really solving some other problem.
One thing you may think about is storing the hostname and protocol portion of the URL in a seperate table and referencing it via a key. This could also prove useful later on for getting all URLS for a specific host as well as helping out with your string length concerns.
For example:
PROTOCOLS ----------------------- PROTOCOL_ID INTEGER PROTOCOL VARCHAR(10) (i.e., http, http, ftp, etc.) HOSTS ----------------------- id BIGINT hostname varchar(256) URL ----------------------- PROTOCOL INTEGER FK to PROTOCOLS HOSTNAME BIGINT FK to HOSTS QUERY_STRING VARCHAR(333)
three good ways to do this:
1) use TEXT instead of VARCHAR. to ensure uniqueness, you'll have to also create a separate VARCHAR column to store an MD5() or SHA1() hash and add a UNIQUE or PRIMARY index. this has the unfortunate consequence of an additional disk seek to retrieve the URL, but depending on your use case that might be OK.
2) use VARCHAR with a binary collation and compress the URL using COMPRESS().
3) i forgot the third one as i was typing the first two. grr...