views:

303

answers:

4

Performance consideration: Spread rows in multiple tables vs concentrate all rows in one table.

Hi.

I need to log information about about every step that goes on in the application in an SQL DB. There are certain tables, I want the log should be related to: Product - should log when a product has been created changed etc. Order - same as above Shipping - same etc. etc. etc.

The data will be need to be retrieved often.

I have few ideas on how to do it:

  1. Have a log table that will contain columns for all these tables, then when I wanna represent data in the UI for a certain Product will do select * from Log where LogId = Product.ProductId. I know that this might be funny to have many cols, but I have this feeling that performance will be better. In the other hand there will be a huge amount of rows in this table.
  2. Have many log tables for each log type (ProductLogs, OrderLogs etc.) I really don't like this idea since it's not consistent and have many tables with same structure doesn't make sense, but (?) it might be quicker when searching in a table that has a lower amount of rows (m i wrong?).
  3. According to statement no. 1, I could do a second many-to-one table that will have LogId, TableNameId and RowId cols, and will reference a log row to many table rows in the DB, than will have a UDF to retrieve data (e.g. log id 234 belongs to table Customer at CustomerId 345 and to Product table where productId = RowId); I think this is the nicest way to do it, but again, there might be a huge amount of rows, will it slow down the search? or this is how it should be done, whatcha say?...

Example of No. 3 in the above list:

CREATE TABLE [dbo].[Log](
    [LogId] [int] IDENTITY(1,1) NOT NULL,
    [UserId] [int] NULL,
    [Description] [varchar](1024) NOT NULL,
 CONSTRAINT [PK_Log] PRIMARY KEY CLUSTERED 
(
    [LogId] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO
ALTER TABLE [dbo].[Log]  WITH CHECK ADD  CONSTRAINT [FK_Log_Table] FOREIGN KEY([UserId])
REFERENCES [dbo].[Table] ([TableId])
GO
ALTER TABLE [dbo].[Log] CHECK CONSTRAINT [FK_Log_Table]
---------------------------------------------------------------------
CREATE TABLE [dbo].[LogReference](
    [LogId] [int] NOT NULL,
    [TableName] [varchar](32) NOT NULL,
    [RowId] [int] NOT NULL,
 CONSTRAINT [PK_LogReference] PRIMARY KEY CLUSTERED 
(
    [LogId] ASC,
    [TableName] ASC,
    [RowId] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO
SET ANSI_PADDING OFF
GO
ALTER TABLE [dbo].[LogReference]  WITH CHECK ADD  CONSTRAINT [FK_LogReference_Log] FOREIGN KEY([LogId])
REFERENCES [dbo].[Log] ([LogId])
GO
ALTER TABLE [dbo].[LogReference] CHECK CONSTRAINT [FK_LogReference_Log]
---------------------------------------------------------------------
CREATE FUNCTION GetLog
(   
    @TableName varchar(32),
    @RowId int
)
RETURNS 
@Log TABLE
(    
    LogId int not null,
    UserId int not null,
    Description varchar(1024) not null
)
AS
BEGIN

INSERT INTO @Log
SELECT     [Log].LogId, [Log].UserId, [Log].Description
FROM         [Log] INNER JOIN
                      LogReference ON [Log].LogId = LogReference.LogId
WHERE     (LogReference.TableName = @TableName) AND (LogReference.RowId = @RowId)
    RETURN 
END
GO
+3  A: 

Be careful with preoptimizing databases. Most databases are reasonably fast and somewhat complicated. You want to run a test for efficiency first.

Second putting everything in one table makes it more likely that the results you want are in the cache which will speed up performance immensely. Unfortunately it also makes it much more likely that you will have to search a gigantic table to find what you are looking for. This can be partly solved with an index, but indexes don't come free (they make writing more expensive, for one).

My advice would be to make a test to see if the performance really matter and then test the different scenarios to see which is the fastest.

tomjen
I am kinda strict with developing time, what would be the most recommended way for you opinion?
Shimmy
The cleanest and most efficient way is No. 3 in the list, the question if this will be too slow.
Shimmy
A: 

Try to implement your data access layer in a way so that you can change from one database model to another if needed - that way you just pick one and worry about performance implications later.

Without doing some performance testing and having an accurate idea of the sorts of load your going to get its going to be difficult to optimise as the performance depends on a number of factors, such as the number of reads, the number of writes, and whether or not the reads and writes are likely to conflict and cause locking.

My preference would be for option 1 btw - its simplest to do and there are a number of tweaks you can do to help out fix various sorts of problems you might have.

Kragen
+1  A: 

If you're talking about large volumes of data (millions of rows+), then you will get have a benefit from using different tables to store them in.

e.g. basic example 50 million log entries, assuming 5 different "types" of log table Better to have 5 x 10 million row tables than 1 x 50 million row table

  • INSERT performance will be better with individual tables - indexes on each table will be smaller and so quicker/easier to be updated/maintained as part of the insert operation

  • READ performance will be better with individual tables - less data to query, smaller indexes to traverse. Also, sounds like you'd need to store an extra column to identify what type of Log entry a record is (Product, Shipping....)

  • MAINTENANCE on smaller tables is less painful (statistics, index defragging/rebuilding etc)

Essentially, this is about partitioning data. From SQL 2005 onwards, it has built in support for partitioning (see here) but you need Enterprise Edition for that, which basically allows you to partition data in one table to improve performance (e.g. you'd have your one Log table, and then define how the data within it is partitioned)

I listened to an interview with one of the eBay architects recently, who stressed the importance of partitioning when needing performance and scalability and I strongly agree based on my experiences.

AdaTheDev
+1  A: 

I would definitely go for option 3, for several reasons:

Data should be in the fields of a table, not as a table name (option 2) or a field name (option 1). That way the database gets easier to work with and easier to maintain.

Narrower tables genrally perform better. The number of rows has less impact on the performance than the number of fields.

If you have a field for each table (option 1), you are likely to get a lot of empty fields when only a few of the tables are affected by an operation.

Guffa
You see, I agree with you, the question is especially about the insert AND the search, I don't care about the retrieval to be slow.
Shimmy