tags:

views:

83

answers:

5

I have a table where I store comments for user users. I will have 100 Million+ comments.

2 ways I can create it:

Option 1: user name and comment id as PK. That way all comments are stored physically by user name and comment id.

CREATE TABLE [dbo].[Comments](
    [user] [varchar](20) NOT NULL,
    [com_id] [int] IDENTITY(1,1) NOT NULL,
    [com_posted_by] [varchar](20) NOT NULL,
    [com_posted_on] [smalldatetime] NOT NULL CONSTRAINT DEFAULT (getdate()),
    [com_text] [nvarchar](225) COLLATE NOT NULL,
 CONSTRAINT [PK_channel_comments] PRIMARY KEY CLUSTERED 
 ([channel] ASC, [com_id] ASC) WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]) ON [PRIMARY]

Pros: My query will be get all or top 10 comments for a user order by comment_id DESC. This is SEEK

Option 2: I can make the comment id as the PK. That will store the comments sorted by the comment id, not user name.

Cons: Getting latest top 10 comments of a given user is not a seek anymore as data not stored by user (ie. not sorted by user). So I have to create other index to improve the query performance.

Which way is best way to proceed? How about insertion and deletion? These operations are allowed. But read is frequent.

User can't modify their comments.

I tested both tables with 1.1M rows. Here is the result:

table_name  rows        reserved    data        index_size  unused
comments2   1079892     99488 KB    62824 KB    36576 KB    88 KB  (PK: com_id  Second Index on (user_name, com_id))
comments1   1079892     82376 KB    82040 KB    328 KB      8 KB   (PK: user_name, no other indices)
--------------------------------------------------------------------
diff:       same rows   17112KB     -19216KB    36,248KB    80KB

So the table with com_id as PK is using 36MB extra disk space just for the 2 index The select top query on both table using SEEK, but table with com_id as PK is slower But insertion is slightly faster when I have com_id as PK

Any comments?

+2  A: 

I would use the Comment ID as the Primary Key for the table. If you are going to have a lot of queries that use the Comment ID and the User name, its probably simpler just to add an Index on those fields.

William Leader
Which option will perform better for insertion and deletion?
Projapati
someone deleted comments from marc?
Projapati
The only sure way to know for sure in your environment is to try it both ways and see. However, I agree that my first choice would be to make the Comment ID the primary key, them have a separate index dedicated to making the last 10 comments from a user go faster (perhaps an index based on the com_posted_by / com_posted_on columns.You have more reads than writes in your scenario, so the little bit of overhead (in terms of creating index data on insert) is small compared to the gains you'll get from the retrieves (which happen many more times).
RQDQ
posted test result in the question above
Projapati
A: 

I would not use User name in a PK as it may change, creating cascade update issues later. Also, concatenating those two into the PK creates a large(r) PK that might have to be passed to other tables as a FK. I try to keep PK that appear as FKs as small as possible, unless I know I will want all the PK of the contributing tables in one large key for speed of query. Comment id should be fine. You may need to create an additional index for fast searching on comment id and user name. Will you be doing more insertions/updates or queries? if query intensive, then the index is not an issue.

Joe
username will never change. The PK is not used a FK of other table. There is cost\ovehead of having another index. Every insert\delete would require the other index to be modified. Plus 2nd index will cost disk space as well.
Projapati
insertion\deletion is much less common that selecting\finding comments of specific user
Projapati
posted test result in the question above
Projapati
A: 

Are you sure that you have that CREATE TABLE statement correct? You're using [Channel] in the PK definition, and I don't see that as a column. Did you mean [User].

Do you have a user table someplace? If so, you might save a lot of overhead by keying that on an integer value and putting UserID into the comments table, rather than User.

I would PK on the CommentID and then add a non-clustered index on [UserID, CommentID]. That gives you immediate access to a comment by ID (for deleting, etc) without having to involve the UserID value in the WHERE clause; and it provides quick access to the user's comments. I do not, however, tend to work with table of the size you anticipate.

Larry Lustig
Larry, You got it right. I renamed the table to Comments.
Projapati
I have another table where I store the users with 20 chars username as PK
Projapati
Please consider storing the username in that table attached to an integer PK instead. You'll save a lot of indexing overhead in your comments table.
Larry Lustig
posted test result in the question above
Projapati
A: 

As a rule of thumb, always choose the narrowest PK. Then, to improve performance, you may want to use an integer based User_id, instead of a varchar, and add an index for both columns.

The best approach will depends on the number of users, if you have just a few users the commet_id user_id pk could be better (additionally, parttition by user would be an option); in the other hand, if the number of users are high, a combined Pk will be useless.

MaLo
posted test result in the question
Projapati
A: 

My initial approach would be to make CommentID alone the PK, maybe in descending order so you don't have to do any reordering on select. Then put an index on UserID.

If you use the concatenated key, consider switching CommentID to desc.

ulty4life
posted test result in the question
Projapati