views:

266

answers:

6

What is the best way to deal with storing and indexing URL's in SQL Server 2005?

I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.

The problem is URL's can be very large, and using them as a key makes the indexes larger and slower. How much I don't know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.

What are the alternatives?

How much pain would there likely to be with using URL as a key instead of a smaller field.

I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.

I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn't be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case...

There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.

Any thoughts?

+1  A: 

I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.

An alternative would be to create GUID and use that as the key.

David Robbins
If you go with the hash solution, what do the related tables use for the foreign key? You couldn't use the hash, as you would get collisions?
Andrew Rimmer
I think the GUID solution would be the same as having an identify column. Each import program would have to hit the WebPage table to translate from URL to GUID, before using GUID in related tables.
Andrew Rimmer
Not sure, but I think Sharepoint uses the GUID solution.
David Robbins
As far as the hash for primary key, if you number of pages is low enough to prevent collisions you would have a unique identifier. There should be no issue with related tables. Am I misunderstanding you, Andrew?
David Robbins
There will be a large amount of records, so I couldn't really trust the hash to be unique. I could use it on the main table, to speed up searching for a web page. Then use the Web Page primary key for related tables...
Andrew Rimmer
That's what I meant - sorry for not being clearer. The main concern, as you said, is the sheer volume in the main table.
David Robbins
+4  A: 

I'd use a normal identity column as the primary key. You say:

This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.

Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.

On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like

CREATE FUNCTION GetUrlId (@Url nvarchar(400)) 
RETURNS int
AS BEGIN
  DECLARE @UrlId int
  SELECT @UrlId = Id FROM Url WHERE Url = @Url
  RETURN @UrlId
END

This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like

INSERT INTO 
  UrlHistory(UrlId, Visited, RemoteIp) 
VALUES 
  (dbo.GetUrlId('http://www.stackoverflow.com/'), @Visited, @RemoteIp)

This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.

Dylan Beattie
I like your answer - you could combine this with hashing the url and have 2 methods for uniquely identifying a page.
David Robbins
You must have an index on the Url column still (obviously having the UrlID as the clustered), otherwise your lookup will take a loooong time.
Valerion
+2  A: 

Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).

stackoverflow.com      -> com.stackoverflow  
blog.stackoverflow.com -> com.stackoverflow.blog

Google has a paper that outlines what they do but I can't find right now.

http://en.wikipedia.org/wiki/Uniform_Resource_Locator

jms
A: 

I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.

Using a big varchar column as key column is wasting much space and affects insert and query performance.

Jan
A: 

Not so much a solution. More another perspective.

Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.

For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

+1  A: 

"Assuming a URL is nvarchar(400)"

I don't think that URL's need to be nvarchar, ordinary varchar should suffice.

Eyvind