tags:

views:

30

answers:

2

I have a number of tables that have text columns that contain only a few different distinct values. I often play the tradeoff between the benefits (primarily reduced row size) of extracting the possible values into a lookup table and storing a small index in the table against the amount of work required to do so.

For the columns that have a fixed set of values known in advance (enumerated values), this isn't so bad, but the more painful case is when I know I have a small set of unique values, but I don't know in advance what they will be.

For example, if I have a table that stores log information on different URLs in a web application:

CREATE TABLE [LogData]
(
  ResourcePath varchar(1024) NOT NULL,
  EventTime datetime NOT NULL,
  ExtraData varchar(MAX) NOT NULL
)

I waste a lot of space by repeating the for every request. There will be a very number of duplicate entries in this table. I usually end up with something like this:

CREATE TABLE [LogData]
(
  ResourcePathId smallint NOT NULL,
  EventTime datetime NOT NULL,
  ExtraData varchar(MAX) NOT NULL
)
CREATE TABLE [ResourcePaths]
(
  ResourcePathId smallint NOT NULL,
  ResourceName varchar(1024) NOT NULL
)

In this case however, I no longer have a simple way to append data to the LogData table. I have to a lookup on the resource paths table to get the Id, add it if it is missing, and only then can I perform the actual insert. This makes the code much more complicated and changes my write-only logging function to require some sort of transacting against the lookup table.

Am I missing something obvious?

A: 

If you have a unique index on ResourseName, the lookup should be very fast even on a big table. However, it has disadvantages. For instance, if you log a lot of data and have to archive it off periodically and want to archive the previous month or year of logdata, you are forced to keep all of resoursepaths. You can come up with solutions for all of that.

Matt Wrock
I recognize that the lookups can be fast, the real problem with the simple table design is that the size of the tables grows very rapidly. This affects both the efficiency of queries against the table (due to increased IO activity) and more importantly, the amount of data I have to transfer in when I am inserting.
FlipThePig
Personally, I would go with the two table design. If this is going to grow fast like you say, I would add a partition key to both tables. Lets say you partition by month. You would have a month key in both tables. The resoursenames would be unique only within the month key. This will allow you to archive partitions of both tables when you need to purge for space and the partition scheme should provide performance gains as well.
Matt Wrock
A: 

yes inserting from existing data doing the lookup as part of the insert

Given @resource, @time and @data as inputs

insert( ResourcePathId, EventTime, ExtraData)
    select ResourcePathId, @time, @data
        from ResourcePaths
        where ResourceName = @resource
Mark
This only works if the resource path is already defined. If the path doesn't already exist, I have to INSERT it...
FlipThePig
True but it is quicker to do this and then if this fails to insert then do the insert into ResourcePaths. especially if adding a new resource is much less frequent then adding a log message
Mark