views:

95

answers:

3

I'm not fully getting all i want from Google analytics so I'm making my own simple tracking system to fill in some of the gaps.

I have a session key that I send to the client as a cookie. This is a GUID. I also have a surrogate IDENTITY int column.

I will frequently have to access the session row to make updates to it during the life of the client. Finding this session row to make updates is where my concern lies.

I only send the GUID to the client browser:

a) i dont want my technical 'hacker' users being able to guage what 'user id' they are - i.e. know how many visitors we have had to the site in total

b) i want to make sure noone messes with data maliciously - nobody can guess a GUID

I know GUID indexes are inefficnent, but I'm not sure exactly how inefficient. I'm also not clear how to maximize the efficiency of multiple updates to the same row.

I don't know which of the following I should do :

  • Index the GUID column and always use that to find the row
  • Do a table scan to find the row based on the GUID (assuming recent sessions are easy to find). Do this by reverse date order (if thats even possible!)
  • Avoid a GUID index and keep a hashtable in my application tier of active sessions : IDictionary<GUID, int> to allow the 'secret' IDENTITY surrogate key to be found from the 'non secret' GUID key.

There may be several thousand sessions a day.

PS. I am just trying to better understand the SQL aspects of this. I know I can do other clever thigns like only write to the table on session expiration etc., but please keep answers SQL/index related.

+3  A: 

In this case, I'd just create an index on the GUID. Thousands of sessions a day is a completely trivial load for a modern database.

Some notes:

  • If you create the GUID index as non-clustered, the index will be small and probably be cached in memory. By default most databases cluster on primary key.
  • A GUID column is larger than an integer. But this is hardly a big issue nowadays. And you need a GUID for the application.
  • An index on a GUID is just like an index on a string, for example Last Name. That works efficiently.
  • The B-tree of an index on a GUID is harder to balance than an index on an identity column. (But not harder than an index on Last Name.) This effect can be countered by starting with a low fill factor, and reorganizing the index in a weekly job. This is a micro-optimization for a databases that handle a million inserts an hour or more.
Andomar
so the fact that there's no 'order' to the GUIDs shouldn't concern me. the index knows that 123456 is newer than 123455, but loses that knowledge for a GUID. i guess the index is behaving like a hashtable whether you're indexing on a GUID or an integer column anyway? i.e. find the 'block' or whatever it is, and scan for the specific item requested. is that a correct assumption?
Simon_Weaver
An index is a b-tree, not a hashtable. The order does not matter, Sql Server would "walk" the b-tree to both a GUID and an identity column. At the end of the walk it finds a reference to the row (a Lookup.) So your assumption is correct. I've edited the post after thinking it over again. :)
Andomar
@Andormar The implementation of indexes changes from database to database, column to column and table to table. Usually its a B-Tree, but it might also be an R-Tree or a hash or something else. The database will try to guess which is best, or you can often tell it which to use. Postgres even has something called GiST http://www.postgresql.org/docs/8.1/static/gist.html
Schwern
+1  A: 

If I understand what you're asking, you're worrying that indexing and looking up your users by their hashed GUID might slow your application down? I'm with Andomar, this is unlikely to matter unless you're inserting rows so fast that updating the index slows things down. Only on something like a logging table might that happen, and then only for complicated indicies.

More importantly, did you profile it first? You don't have to guess why your program is slow, you can find out which bits are slow with a profiler. Otherwise you'll waste hours optimizing bits of code that are either A) never used or B) already fast enough.

Schwern
+2  A: 

Assuming you are using SQL Server 2005 or above, your scenario might benefit from NEWSEQUENTIALID(), the function that gives you ordered GUIDs.

Consider this quote from the article Performance Comparison - Identity() x NewId() x NewSequentialId

"The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance, and globally unique values."

Declare your table as

create table MyTable( 
   id uniqueidentifier default newsequentialid() not null primary key clustered
  );

However, keep in mind, as Andomar noted that the sequentiality of the GUIDs produced also make them easy to predict. There are ways to make this harder, but non that would make this better than applying the same techniques to sequential integer keys.

Like the other authors I seriously doubt that the overheads of using straight newid() GUIDs would be significant enough for your application to notice. You would be better of focusing on minimizing roundtrips to your DB than on implementing custom caching scenarios such as the dictionary you propose.

Peter Stuer
+1 Interesting though probably not applicable here, because sequential guids would be easy to guess?
Andomar
@Andomar: True. http://blogs.msdn.com/sqlprogrammability/archive/2006/03/23/559061.aspx explains how NEWSEQUENTIALID is implemented. You could make it less predictable by running random interval calls to UuidCreateSequential through the system, tuning the mean frequency to the desired dispersal, but that would be a kludge and impractical depending on the desired level of unguessability.
Peter Stuer