ansaurus

Question

Indexing URL's in SQL Server 2005

Answer 1

+1 A:

I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.

An alternative would be to create GUID and use that as the key.

David Robbins 2008-10-05 16:00:23

If you go with the hash solution, what do the related tables use for the foreign key? You couldn't use the hash, as you would get collisions?

Andrew Rimmer 2008-10-05 16:03:38

I think the GUID solution would be the same as having an identify column. Each import program would have to hit the WebPage table to translate from URL to GUID, before using GUID in related tables.

Andrew Rimmer 2008-10-05 16:05:10

Not sure, but I think Sharepoint uses the GUID solution.

David Robbins 2008-10-05 16:22:49

As far as the hash for primary key, if you number of pages is low enough to prevent collisions you would have a unique identifier. There should be no issue with related tables. Am I misunderstanding you, Andrew?

David Robbins 2008-10-05 16:25:09

There will be a large amount of records, so I couldn't really trust the hash to be unique. I could use it on the main table, to speed up searching for a web page. Then use the Web Page primary key for related tables...

Andrew Rimmer 2008-10-05 16:32:29

That's what I meant - sorry for not being clearer. The main concern, as you said, is the sheer volume in the main table.

David Robbins 2008-10-05 22:47:53

Answer 2

+4 A:

I'd use a normal identity column as the primary key. You say:

This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.

Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.

On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like

CREATE FUNCTION GetUrlId (@Url nvarchar(400)) 
RETURNS int
AS BEGIN
  DECLARE @UrlId int
  SELECT @UrlId = Id FROM Url WHERE Url = @Url
  RETURN @UrlId
END

This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like

INSERT INTO 
  UrlHistory(UrlId, Visited, RemoteIp) 
VALUES 
  (dbo.GetUrlId('http://www.stackoverflow.com/'), @Visited, @RemoteIp)

This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.

Dylan Beattie 2008-10-05 16:17:15

I like your answer - you could combine this with hashing the url and have 2 methods for uniquely identifying a page.

David Robbins 2008-10-05 16:20:35

You must have an index on the Url column still (obviously having the UrlID as the clustered), otherwise your lookup will take a loooong time.

Valerion 2008-10-07 09:54:55

Answer 3

+2 A:

Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).

stackoverflow.com      -> com.stackoverflow  
blog.stackoverflow.com -> com.stackoverflow.blog

Google has a paper that outlines what they do but I can't find right now.

http://en.wikipedia.org/wiki/Uniform_Resource_Locator

jms 2008-10-05 18:15:05

Answer 4

A:

I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.

Using a big varchar column as key column is wasting much space and affects insert and query performance.

Jan 2008-10-07 08:52:21

Answer 5

A:

Not so much a solution. More another perspective.

Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.

For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

2008-10-07 09:14:09

Answer 6

+1 A:

"Assuming a URL is nvarchar(400)"

I don't think that URL's need to be nvarchar, ordinary varchar should suffice.

Eyvind 2008-10-07 09:22:25

ansaurus

tags:

views:

answers:

Indexing URL's in SQL Server 2005

related questions