views:

338

answers:

5

I am designing a new laboratory database.

My primary data tables will have at least id (PK NUMBER) and created_on (DATE). Also, for any two entries, the entry with a higher id will have a later created_on date.

I plan to partition by created_on to increase performance on recently entered data. Since the columns increase together, the table would also be partitioned by id, implicitly. Oracle wouldn't know about the implied partitioning by id to take advantage of the partitioning of table joins on id.

Two questions:

  1. How do I enforce both columns increasing together?

  2. How can I take advantage of this implicit partitioning for table joins?

+2  A: 

In my opinion, the decision to partition should be based more on the need for table maintenance activities (purging, archiving, etc.) than performance. In your case I'm guessing you'll probably be performing index range scans on the samples for a date range, so make sure the date index is locally (instead of globally) partitioned also . This will also eliminate the need to rebuild the index if you truncate a partition. I'd also guess that the joins on the PK will use seeks by rowid so that will happen after the index range scan and there's no way partitioning can affect this.

[Edit]

With regard to relating the PK and CREATED_ON columns, I work with a couple of systems that construct the numeric key from a sequence that is prefixed with YYYYMMDD and that works pretty well. You'll have to:

  • Liberally estimate the number of samples you'll have per day

  • Define a sequence that has this as a maximum value and then cycles back to 0

  • Have a function that returns YYYYMMDD || {sequence value left-padded with
    zeros to the appropriate fixed length} that is called from a trigger or application code when the key is needed

Some would disagree with embedding meaning in the key, but in practice it is useful to look at a sample ID and have an idea of when it was processed

dpbradley
Hmmm, well I profoundly disagree with that. You only need high performance purging (or loading) on a record once, but you need select performance every time you need to find it. If I had to chose one or the other, I'd partition for query performance.
David Aldridge
Oh, I wrote something a while ago on those lines http://oraclesponge.wordpress.com/2005/11/14/choosing-partitioning-keys-for-etl-or-reporting/
David Aldridge
Yeah, you're right, I'll edit to make clear it's an opinion... still believe it however...:-)
dpbradley
@David - good article, thanks, but doesn't this apply more to denormalized DW systems?
dpbradley
@dpbradley - partitioning is generally more relevant to datawarehouses than it is to OLTP systems.
APC
@APC - Right, I agree - I support a few mostly-OLTP systems that use partitioning so older transactional data can roll off without too much operational impact. Partitioning generally doesn't help our performance, and actually creates some interesting performance problems when data starts going into a new partition before the statistics have "caught up"
dpbradley
+1  A: 

It's pretty tricky, to be honest. Multicolumn partitioning is one option, whereby you create range-based partitions on more than one column. In 11g you can impliment this either as partitioning on Column A and subpartitioning on Column B, but in 10g you have to partition by range on the two columns together. I think that the tricky part is to know what boundary to partition on because you probably want the two partitioning schemes to "sync".

David Aldridge
+1  A: 

In this case to speed up performance on joins on "table_id" you should also store corresponding "created_on" in tables that you will mostly join. If you do that you can always join on both "table_id" and "created_on" so your "PARTITION RANGE ALL" turns into "PARTITION RANGE SINGLE". You can measure speed gains and weigh them against additional storage costs.

Edit:

How to keep both fields increasing together:

ALTER TABLE my_table MODIFY created_on DEFAULT SYSDATE;

And fill ID from sequence in all your inserts.

jva
Joining tables on a DATE column doesn't feel right, but I can't put my finger on the exact reason.
Steven
In some RDBMS (sybase, maybe), DATE types are actually fuzzy and joining them is unpredictable. I'd have the same aversion to joining dates as @Steven, even if there's no rational reason for avoiding it any more.
skaffman
In Oracle DATE joins work as well as they would with NUMBER. We are doing it on some pretty large tables and it isn't causing any problems. Think about that join as simply partition pointer.
jva
A: 

How do I enforce both columns increasing together?

  1. Assuming it is a bulk load and id is sequence generated at the time of the bulk-load, you could ALTER SEQUENCE between loads to get more control over the range of sequences used for each partition. If the sequence and created on is assigned prior to the bulk load, you may need a stage in your ETL process to work out what the min/max id for each created date is.

  2. Range partition on created_on, Range subpartition on id. Each partition should only have one sub-partition.

  3. Assuming that, since this is a new DB, you'll have 11g how about a check constraints on virtual columns. Virtual column date_partition

    CASE WHEN created_on BETWEEN ... AND ... THEN 'PARTITION_1' WHEN created_on BETWEEN ... AND ... THEN 'PARTITION_2' ... END

Similar virtual column on id_partition, though you'd have to query to get the minimum/maximum PK for each partition. Should be quick as, being the primary key, there's an index on it.

Then you add a constraint such that id_partition = date_partition

Gary
+2  A: 

Hi Steven,

the real important question is: will you ever need to query by range of IDs? It is unlikely you will need to build a query with ID BETWEEN :A AND :B. Therefore, Oracle wouldn't benefit from a correlated partition scheme. For all that matters you could use a GUID for the primary key and you will get better scalability for INSERTS.

Vincent Malgrat
No. However, I will join on the id (PK) all the time, so it would be useful to quickly know what partition a particular id is in to avoid searching through them all.
Steven
@Steven: The PK index will be global (not partitionned), therefore Oracle will only have to look in one place (the index) to be able to find the rowid. The rowid contains the partition information, and Oracle won't have to scan through all the partitions to find a row -- it will only look in the partition pointed by the rowid.
Vincent Malgrat
Simply, partition by date range and forget about the rest. Is that what you're saying?
Steven
If it only adds complexity and doesn't help performance, yes, don't bother =)
Vincent Malgrat