views:

49

answers:

3

For my current project we want to present statistical data and rank it. For my case I'm talking about "Favouriting" of an artist, counting the times an artist's track has been played, displaying a count of how many playlists an artists track has been added to a playlist... These are all very domain specific issues, but it's a concrete example of my issue.

The main issue is that I'm going to be returning result sets that are returned in order for all these statistical attributes.

Here are some examples:

  • Music Landing page should display the top 5 artists that have been favourited the most.
  • Music Landing page should display top 5 tracks that have been played the most.

My first thought has determined I need a computed aggregate column. Since I want to order on these values that means a CLUSTERED INDEX would be optimal on each aggregate I want to order by. Secondly, since DML on CLUSTERED INDEX columns can be costly when they are not sequential on insertion I need to make this a scheduled job.

So, for the artist favourite stats, here's the DDL that I have come up with. Noted my T-SQL might be horribly off, but I think the intentions are clear.

CREATE TABLE Stats_ArtistFavourites (
    FavouriteCount INT DEFAULT 0,
    ArtistId INT PRIMARY KEY NONCLUSTERED,
    FOREIGN KEY (ArtistId) REFERENCES Artists
)

CREATED CLUSTERED INDEX IDX_Favourites 
ON Stats_ArtistFavourites (FavouriteCount, ArtistId) DESC

So as you can see, I would need to create a separate table for each stat I want to keep track of, otherwise I would have to ORDER BY columns that aren't in the CLUSTERED INDEX. The fact that this seems ugly makes me think I'm going about it all wrong.

Should I start thinking about integrating OLAP (I have very little experience with OLAP cubes)? Or maybe Lucene?

+2  A: 

Scanning by a plain index is akin to a join, as plain indices contain a indexed value along with a reference to a table block in each leaf. To extract a non-indexed value you need to "join" the table by this block reference.

Clustered indexes, on the contrary, contain the table data itself it each leaf, you get the non-indexed field values right on scan.

As long as you select 5 top records, it's OK to use plain index, as one single table is always more simple to manage.

It will be a little slower than a cluster index, as this will imply the "join" desribed above, but it's only 5 records, you will hardly note any difference.

You may even create your statistics table as such:

CREATE TABLE stats (type INTEGER, score INTEGER, artist INTEGER);
CREATE INDEX ix_stats (type, score);

, this will help you to add new aggregated values more easily.

1 for type here can mean how many time the artist is played, 2 how many times he is favorited and so on. When you need new aggregate, you just make a new type and INSERT 5 new rows into the table instead of changing its definition.

Again, if I understand you task, we are talking about selecting dozens records from this table. In this case, manageability is more important than selecting these top 5 artists 10 milliseconds faster.

Quassnoi
A: 

You might explore indexed views. http://technet.microsoft.com/en-us/library/cc917715.aspx

  • Aggregations can be precomputed and stored in the index to minimize expensive computations during query execution.
  • Tables can be prejoined and the resulting data set stored.
  • Combinations of joins or aggregations can be stored.

The first point looks like what you are after.

Sam
A: 

Have you considered using RANK? You may be surprised by the performance.

Jeff O
Looks interesting but I'm using SQL Server 2000.
Scott Muc