views:

180

answers:

5

How should I manage tables that refer to site 'events'. i.e. certain activities a user has done on a website that I use for tracking. I want to be able to do all kinds of datamining and correlation between different activities of users and what they have done.

Today alone I added 107,000 rows to my SiteEvent table. I dont think this is sustainable!

The database is SQL Server. I'm mainly referring to best practice activites with respect to managing large amounts of data.

For instance :

  • Should I keep these tables in a database all of their own? If i need to join with other tables this could be a problem. Currently I just have one database with everything in.
  • How ought I to purge old records. I want to ensure my db file doesnt keep growing.
  • Best practices for backing up and truncating logs
  • Will adding additional indexes dramatically increase the size of the DB with so many records?
  • Any other things i need to so in SQL Server that might come back to bite me later?

FYI: these are the tables

CREATE TABLE [dbo].[SiteEvent](
    [SiteEventId] [int] IDENTITY(1,1) NOT NULL,
    [SiteEventTypeId] [int] NOT NULL,
    [SiteVisitId] [int] NOT NULL,
    [SiteId] [int] NOT NULL,
    [Date] [datetime] NULL,
    [Data] [varchar](255) NULL,
    [Data2] [varchar](255) NULL,
    [Duration] [int] NULL,
    [StageSize] [varchar](10) NULL,

and

CREATE TABLE [dbo].[SiteVisit](
    [SiteVisitId] [int] IDENTITY(1,1) NOT NULL,
    [SiteUserId] [int] NULL,
    [ClientGUID] [uniqueidentifier] ROWGUIDCOL  NULL CONSTRAINT [DF_SiteVisit_ClientGUID]  DEFAULT (newid()),
    [ServerGUID] [uniqueidentifier] NULL,
    [UserGUID] [uniqueidentifier] NULL,
    [SiteId] [int] NOT NULL,
    [EntryURL] [varchar](100) NULL,
    [CampaignId] [varchar](50) NULL,
    [Date] [datetime] NOT NULL,
    [Cookie] [varchar](50) NULL,
    [UserAgent] [varchar](255) NULL,
    [Platform] [int] NULL,
    [Referer] [varchar](255) NULL,
    [RegisteredReferer] [int] NULL,
    [FlashVersion] [varchar](20) NULL,
    [SiteURL] [varchar](100) NULL,
    [Email] [varchar](50) NULL,
    [FlexSWZVersion] [varchar](20) NULL,
    [HostAddress] [varchar](20) NULL,
    [HostName] [varchar](100) NULL,
    [InitialStageSize] [varchar](20) NULL,
    [OrderId] [varchar](50) NULL,
    [ScreenResolution] [varchar](50) NULL,
    [TotalTimeOnSite] [int] NULL,
    [CumulativeVisitCount] [int] NULL CONSTRAINT [DF_SiteVisit_CumulativeVisitCount]  DEFAULT ((0)),
    [ContentActivatedTime] [int] NULL CONSTRAINT [DF_SiteVisit_ContentActivatedTime]  DEFAULT ((0)),
    [ContentCompleteTime] [int] NULL,
    [MasterVersion] [int] NULL CONSTRAINT [DF_SiteVisit_MasterVersion]  DEFAULT ((0)),
A: 

Re-thinking the problem might be just what the doctor ordered. Can 100k records per day really be that useful? Seems like information overload to me. Maybe start by reducing the granularity of your usage tracking?

xanadont
yes! i definitely want to do that! this is only about 9 events per visitor though so its not completely overkill. plus we expect a lot more traffic coming
Simon
A: 

In terms of re-thinking the problem, you might explore one of the many web statistics packages out there. There are only a few fields in your sample table that are not part of an out-of-the-box implementation of WebTrends or Google Analytics or many others. The other items in your table can be set up as well, but take a bit more thought and some research into which package will meet all of your needs. Most of the off the shelf stuff can deal with campaign tracking, etc. these days.

One other option would be to offload the common stuff to a standard web-stats package and then parse this back into SQL Server with your custom data out-of-band.

I don't know how much other data you have, but if the 107K+ records a day represents the bulk of it, you might end up spending your time dealing with keeping your web stats working rather than your applications actual functionality.

JasonS
main reason we're not using some out-of-box tracking is the site is Flash/Flex based. i also wanted specifically to be able to join with other domain specific tables. its doing ok but i just wanted to start listening to advice! thanks
Simon
A: 

I would keep them in the same database, unless you can safely purge / store old records for OLAP querying and then keep the primary database for OLTP purposes.

Make sure you set a large initial size for the database and set a large autogrow value, and ensure you don't run out of disk space. 107k records a day is going to take up space no matter how you store it.

As for backups, that's completely dependent on your requirements. A weekly full, daily diff and one/two hour diff should work fine as long as the IO subsystem can cope with it.

Additional indexes will take up space, but again, it depends on which columns you add. If you have 10^6 rows and you add a nonclustered index it'll take up 10^6 * 4 * 2. That's 10^6 for the actual indexed column, and an additional 4 bytes for the primary key as well, for each index entry. So for each 1 million records, a nonclustered index on an int column will take up roughly 8MBs.

When the table grows, you can add servers and do horizontal partitioning on the table so you spread out the data on multiple servers.

As for the IO, which is probably going to be the largest hurdle, make sure you have enough spindles to handle the load, preferably with indexes being on their own diskset/LUN and the actual data on their own set of disks/LUN.

Mark S. Rasmussen
+1  A: 

Personally I would keep absolutely keep the log records outside the main database. The performance of your application would take a huge hit by having to constantly do writes.

I think the way to go is to create a secondary database on a different machine, publish a SOAP api that is irrelevant to the underlying DB Schema and have the application report to that. I'd also suggest that maybe-write semantics (don't wait for confirmation response) could do for you, if you can risk loosing some of this information.

On the secondary DB you can have your API calls trigger some sort of database pruning or detach/backup/recreate maintenance procedure. If you need a log then you shouldn't give up on the possibility of it being useful in the future.

If you need some sort of analysis service on that, the best way to go is SQL Server. Otherwise MySQL or PostGREs will do the job much cheaper.

KCorax
+1  A: 

You said two things that are in conflict with each other.

  1. I want to be able to do all kinds of datamining and correlation between different activities of users and what they have done.
  2. I want to ensure my db file doesnt keep growing.

I am also a big fan of data mining, but you need data to mine. In my mind, create a scalable database design and plan for it to grow TREMENDOUSLY. Then, go grab all the data you can. Then, finally, you will be able to do all the cool data mining you are dreaming about.

Maitus