views:

333

answers:

5

I've got a really large table (10+ million rows) that is starting to show signs of performance degradation for queries. Since this table will probably double or triple in size relatively soon I'm looking into partitioning the table to squeeze out some query performance.

The table looks something like this:

CREATE TABLE [my_data] (
    [id] [int] IDENTITY(1,1) NOT NULL,
    [topic_id] [int] NULL,
    [data_value] [decimal](19, 5) NULL
)

So, a bunch of values for any given topic. Queries on this table will always be by topic ID, so there's a clustered index on (id, topic_id).

Anyway, since topic IDs aren't bounded (any number of topics could be added) I'd like to try partitioning this table on a modulus function of the topic IDs. So something like:

topic_id % 4 == 0 => partition 0
topic_id % 4 == 1 => partition 1
topic_id % 4 == 2 => partition 2
topic_id % 4 == 3 => partition 3

However, I haven't seen any way to tell "create partition function" or "create partition scheme" to perform this operation when deciding on a partition.

Is this even possible? How can we make a partition function based on an operation performed on the input value?

A: 

From the documentation, it seems like you have to give values to the function:

To create 4 partitions...

CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (1, 100, 1000);

Couldn't you just do your computations above this call and find the proper values to split on? Substitute the values into the call? Or am I missing why you want to use the modulus? Based on the possibility of your ID's having gaps, you may need to use some statistics math to find out where to partition.

CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (@low, @Med, @High);
Sam
+2  A: 

Hash partitioning is not available in SQL Server 2005/2008. You must use range partitioning.

That being said, you should be aware that partitioning is primarily a storage option, see Partitioned Table and Index Concepts:

Partitioning makes large tables or indexes more manageable, because partitioning enables you to manage and access subsets of data quickly and efficiently, while maintaining the integrity of a data collection. By using partitioning, an operation such as loading data from an OLTP to an OLAP system takes only seconds, instead of the minutes and hours the operation takes in earlier versions of SQL Server. Maintenance operations that are performed on subsets of data are also performed more efficiently because these operations target only the data that is required, instead of the whole table.

As you can see, the introduction of partitioning in MSDN focuses on maintenance, manageability and data load. In my experience partitioning gives, at best, 0 performance gain. Specially in SQL 2005. Usualy it gives performance degradation. To improve performance you should use a correct clustered index and properly designed non-clustered indexes.

In SQL 2008 there are improvements in the parallel operators in regard to partitions if they are properly distributed from an IO point of view, see Designing Partitions to Improve Query Performance. Their benefit are marginal though and overshadowed by the benefits of a properly designed set of clustered and non-clustered indexes. Case in point a clustered index in (id, topic_id) where id is an identity is usefull solely for single item lookup by id. On the other hand a clustered index by (topic_id, id) would benefit any queries that look for specific topic(s). I don't know your system requirements and the queries you run, but 10M rows performance problems on such a narrow table smell like indexing ands querying issue, no partitioning issue.

Remus Rusanu
Thanks! Changing the index from (id, topic_id) to (topic_id, id) sped up the queries noticeably. I wish I could give two accepted answers here, but Matt's answer was just what I was looking for.
_rusty
A: 

10 million rows isn't that many for SQL server to handle; regular index design would probably solve this without the need for partitioning. As has been noted, try clustering on different sets of columns; clustering on topicid, id seems like something to test out, especially if most queries have topicid as a criterion. A clustered index like that has approximately the same effect as paritioning, at least in that it groups the related rows of data together on disk and allows a range scan to fetch them quickly.

If that design works, all you have to worry about is fragmentation from inserts, but that's manageable. After getting the indexing right, make sure you have enough RAM, and that you don't have a disk bottleneck.

onupdatecascade
+1  A: 

You just need to create your modulus column as a PERSISTED computed column.

Blue Peter style, here's one I made earlier (although I'm not 100% sure I have the partition values clause right):

CREATE PARTITION FUNCTION [PF_PartitonFour] (int)
AS RANGE RIGHT
FOR VALUES (
  0,
  1,
  2)
GO

CREATE PARTITION SCHEME [PS_PartitionFourScheme]
AS PARTITION [PF_PartitonFour]
TO ([TestPartitionGroup1],
    [TestPartitionGroup2],
    [TestPartitionGroup3],
    [TestPartitionGroup4])
GO

CREATE TABLE [my_data] (
  [id] [int] IDENTITY(1,1) NOT NULL,
  [topic_id] [int] NULL,
  [data_value] [decimal](19, 5) NULL
  [PartitionElement] AS [topic_id] % 4 PERSISTED,
) ON [PS_PartitionFourScheme] (PartitionElement);
GO
Matt Whitfield
Thanks! This is exactly what I was looking for.
_rusty
Technically correct, but doesn't this miss the point of the OP's question (performance)?
onupdatecascade
No, it's simply answering the question being asked, not the question that *I think* should be asked. Plus, somebody finding this question from google will not want to find a load of preaching, they will want to find how to partition a table based on a modulus function, which, correct me if i'm wrong, is the answer I gave.
Matt Whitfield
Based on the OP's response, I see no point missed ;)
Rob Cooper
OK, fair enough. I read "... I'm looking into partitioning the table to squeeze out some query performance." and went straight to, "well, partitioning's not what you want to do to fix that problem." :-)
onupdatecascade
A: 

Hi, How does partitioning improve query performance? Is it just enough that we include the partitioning column in the where clause or is there something more to be done to enhance query performance on a partitioned table? Thanks in advance..

Developer
Ask this as a new question. Not an answer to somebody elses!
Martin Smith