views:

30

answers:

4

I have a table that stores photo information with id as Primary key:

id(PK), title, album_id, posted_by, published, filename, tags, ratings, date_posted

This table will hold infor of 100+ Million photos and I need to run this query like these frequently:

1) get all photos (just id,filename,title columns) of a given album

select id, filename, title from photos where album_id = @AlbumId and published = 1

2) get all published photos of a given user but exclude photos of currently viewing album

select id, filename, title from photos where posted_by='bob' and album_id <>10 and published = 1

I want to avoid index and table scanning. I need to use seek(say 100%) as much as possible.

Can this be done? What type of index and on which columns would help me achieve this?

Thanks

+2  A: 

In reality, you'll only be able to find this out yourself by measuring performance before you tweak, then tweak, and measure again and again.

But based on your query, you should consider (or at least try this first) a non-clustered index like this:

CREATE NONCLUSTERED INDEX IX01_Photos
  ON dbo.Photos(album_id, published, posted_by)
  INCLUDE(id, filename, title)

Reasoning:

  • both your most frequent queries have WHERE clauses using album_id and published - so use these two columns first in your index
  • your second query also includes posted_by in the WHERE clause - put that into that same index as the third column
  • in order to avoid expensive bookmark lookups into the actual data table, you can include the id, filename, title columns in the index

With all these things in place, you should be seeing mostly index seeks on that new non-clustered index to satisfy your queries. But again: lots of other factors also come into play which you probably haven't mentioned in your question and possibly not even thought about yourself - but this approach should give you a good starting point no less.

marc_s
Marc, I liked your approach and reasoning!!Few more questions:Title is the only column that can be changed by the user. How will this impact your proposed non-clustered index?I have another column that stores the view count(how many times) photo was viewed. This column increased by 1 each time someone see the photo.Do you think I am better off Not include that colm in the index?
Projapati
Whether or not the user can change the "Title" doesn't matter. Obviously, if the Title column is changed, the index needs to be updated. The view count: if you need to use it in your queries, to display it - then yes, put then in the list of INCLUDE() columns. If you don't do that, then if you search for a set of photos, and want to display the view count as well, SQL Server has to go from the index (where it found your matching criteria) to the actual data pages to retrieve that view count for each hit. Those "bookmark lookup" tend to be expensive and slow.
marc_s
A: 

You didn't mention if there's a need to use the date_posted or the id as filter criteria in the query, so it might be best to use a CLUSTERED index on a non-chronological column (I'm assuming that the current CLUSTERED index is the PK. Right?).

I would create a CLUSTERED index on the album_id.

If you can't change the CLUSTERED index or there are many other queries that benefit from the existing clustered index, then I support the answer from @marc_s (and will vote accordingly.)

bobs
date_posted and id col will not be used to filter.I already have a clustered index on id column
Projapati
A: 

I would suggest a clustered index on album_id and a secondary index on posted_by, if the former is the one that will get hit most. Invert them if posted_by is hit most. Depending on how many photos there are for each album_id or posted_by, it may be quite feasible to filter on published in the calling code (in other words, don't add it as a restriction in the query, rather filter client-side). If not, you'll have to add that published constraint into the query, but the primary restriction of album_id should mean that only a small scan on published is incurred. But as stated, it may be easier just to filter on published client-side.

cjrh
A: 

Primary key on Id. Make it non-clustered. I'd guess this won't be used much (particularly if all lookups are by album or poster).

Clustered index on AlbumId. Seems like it'd be used in most queries.

Non-clustered index on Posted_By. With AlbumId the clustered index, it will appear at the leaf-level of this index, and so act pretty much like an INCLUDEd column. Depending on usage it might be better to have this as the clustered index... but as a varchar(20), it'd take up more disk space, and performance would be worse than AlbumId (assuming AlbumId is an int).

You cannot have Published as a column in the index, as you cannot index on bit columns. Nor would you want to--with only two possible values across 100M+ rows, SQL would probably never use it for optimizing queries.

I'd recommend normalizing Posted_By (move it to its own table, give it its own surrogate key, and use that as a foreign key in this table). This would significantly reduce storage space in your main table, increase overall performance, and allow you to flip the clustered index to that column if need be. (Also, if "Bob" posts to the table, and then "Bob" from across town also posts, how do you tell Bob from Bob?)

Philip Kelley
bob was used as a place holder. The app won't allow duplicate user names.
Projapati
If Posted_By is a unique column, then make it a unique non-clustered index.
Philip Kelley