views:

182

answers:

1

I am writing a data warehouse, using MySQL as the back-end. I need to partition a table based on two integer IDs and a name string. I have read (parts of) the mySQL documentation regarding partitioning, and it seems the most appropriate partitioning scheme in this scenario would be either a HASH or KEY partitioning.

I have elected for a KEY partitioning because I (chicked out and) dont want to be responsible for providing a 'collision free' hashing algorithm for my fields - instead, I am relying on MySQL hashing to generate the keys required for hashing.

I have included below, a snippet of the schema of the table that I would like to partition based on the COMPOSITE of the following fields:

school id, course_id, ssname (student surname).

BTW, before anyone points out that this is not the best way to store school related information, I'll have to point out that I am only using the case below as an analogy to what I am trying to model.

My Current CREATE TABLE statement looks like this:

CREATE TABLE foobar (
    id         int UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    school_id  int UNSIGNED NOT NULL,
    course_id  int UNSIGNED NOT NULL,
    ssname     varchar(64) NOT NULL,

    /* some other fields */

    FOREIGN KEY (school_id) REFERENCES school(id) ON DELETE RESTRICT ON UPDATE CASCADE,

    FOREIGN KEY (course_id) REFERENCES course(id) ON DELETE RESTRICT ON UPDATE CASCADE,

    INDEX idx_fb_si (school_id),
    INDEX idx_fb_ci (course_id),
    CONSTRAINT UNIQUE INDEX idx_fb_scs (school_id,course_id,ssname(16))
) ENGINE=innodb;

I would like to know how to modify the statement above so that the table is partitioned using the three fields I mentioned at the begining of this question (namely - school_id, course_id and the starting letter of the students surname).

Another question I would like to ask is this:

What happens in 'edge' situations for example if I attempt to insert a record that contains a valid* school_id, course_id or surname - for which no underlying partitioned table file exists - will mySQL automatically create the underlying file.?

Case in point. I have the following schools: New York Kindergaten, Belfast Elementary and the following courses: Lie Algebra in Infitesmal Dimensions, Entangled Entities

Also assume I have the following students (surnames): Bush, Blair, Hussein

When I add a new school (or course, or student), can I insert them into the foobar table (actually, I cant think why not). The reason I ask is that I forsee adding more schools and courses etc, which means that mySQL will have to create additional tables behind the scenes (as the hash will generate new keys).

I will be grateful if someone with experience in this area can confirm (preferably with links backing their assertion), that my understanding (i.e. no manual administration is required if I add new schools, courses or students to the database), is correct.

I dont know if my second question was well formed (clear) or not. If not, I will be glad to clarify further.

*VALID - by valid, I mean that it is valid in terms of not breaking referential integrity.

+2  A: 

I doubt partitioning is as useful as you think. That said, there are a couple of other problems with what you're asking for (note: the entirety of this answer applies to MySQL 5; version 6 might be different):

  • columns used in KEY partitioning must be a part of the primary key. school_id, course_id and ssname are not part of the primary key.
  • more generally, every UNIQUE key (including the primary key) must include all columns in the partition1. This means you can only partition on the intersection of the columns in the UNIQUE keys. In your example, the intersection is empty.
  • most partitioning schemes (other than KEY) require integer or null values. If not NULL, ssname will not be an integer value.
  • foreign keys and partitioning aren't supported simultaneously2. This is a strong argument not to use partitioning.

Fortunately, collision free hashing is one thing you don't need to worry about, because partitioning is going to result in collisions (otherwise, you'd only have a single row in each partition). If you could ignore the above problems as well as the limitations on functions used in partitioning expressions, you could create a HASH partition with:

CREATE TABLE foobar (
    ...
) ENGINE=innodb
  PARTITION BY HASH (school_id + course_id + ORD(ssname))
  PARTITIONS 2
;

What should work is:

CREATE TABLE foobar (
    id         int UNSIGNED NOT NULL AUTO_INCREMENT,
    school_id  int UNSIGNED NOT NULL,
    course_id  int UNSIGNED NOT NULL,
    ssname     varchar(64) NOT NULL,

    /* some other fields */

    PRIMARY KEY (id, school_id, course_id),
    INDEX idx_fb_si (school_id),
    INDEX idx_fb_ci (course_id),
    CONSTRAINT UNIQUE INDEX idx_fb_scs (school_id,course_id,ssname)
) ENGINE=innodb
      PARTITION BY HASH (school_id + course_id)
      PARTITIONS 2
;

or:

CREATE TABLE foobar (
    id         int UNSIGNED NOT NULL AUTO_INCREMENT,
    school_id  int UNSIGNED NOT NULL,
    course_id  int UNSIGNED NOT NULL,
    ssname     varchar(64) NOT NULL,

    /* some other fields */

    PRIMARY KEY (id, school_id, course_id, ssname),
    INDEX idx_fb_si (school_id),
    INDEX idx_fb_ci (course_id),
    CONSTRAINT UNIQUE INDEX idx_fb_scs (school_id,course_id,ssname)
) ENGINE=innodb
      PARTITION BY KEY (school_id, course_id, ssname)
      PARTITIONS 2
;

As for the files that store tables, MySOL will create them, though it may do it when you define the table rather than when rows are inserted into it. You don't need to worry about how MySQL manages files. Remember, there are a limited number of partitions, defined when you create the table by the PARTITIONS *n* clause.

outis
very informative answer. Thank you outis. You address all my concerns. The reason why I am thinking of implementing partitioning, is the (almost ridicilous) number of rows I will potentially be dealing with. At the last rough count, we are talking of numbers north of 150M rows for the table. If you still think partitioning will not help in this case, I would like to know your reasons. BTW, the table is already in 4NF, and all the fields are required, there is no further optimisation to be had in terms of db design.
Stick it to THE MAN
The problem with partitioning is that it doesn't often help. Some queries will still need to consult every partition (basically, it's when you filter based on a part of the partition key, such as if you were to filter a query on `school_id` but not `course_id` under the above partitioning schemes). Indices will help much more in query optimization (both partition schemes and indices are more matters of queries than table schema). Not to say you shouldn't use partitions, but the foreign keys might be more valuable.
outis
outis: The scenario you describe will not arise, because I will ALWAYS search using the three fields. In fact, before thinking of using partitioning, I was actually thinking of using physically separate table named info_[sid]_[cid]_ssn_records. I then realised that partitioning was a more elegant solution. If there is something I am overlooking though, I would be grateful if you could point out my oversight, before I embark down this road.
Stick it to THE MAN
outis: Actually, I just reread your last post, where you stated "... but the foreign keys might be more valuable...". I then checked the CREATE TABLE statement you proposed, and noticed that the FKs had disappeared. Was this an oversight? - OR am I to infer that the fields used for partitioning MAY NOT be be FKs to another table?
Stick it to THE MAN
It wasn't an oversight. See the 4th point in my answer.
outis
As for partitioning not always helping, it also depends on what limits you place on the fields. A partition will be skipped only if a query can't possibly return any row in the partition; if it might include one row, the partition will be included in the query. Indices, however, can help limit the rows a query will scan, even to the point of skipping a scan altogether. The query optimizer can make use of index key prefixes; the unique key on `(school_id, course_id, ssname)` can help a query filtered on `school_id, course_id` but not `course_id, ssname`.
outis
The thing to do is create a 2nd database with a partitioned version of the table(s) and test it against the non-partitioned version by timing queries run on each table. Use also the EXPLAIN statement to examine how a query will be executed: http://dev.mysql.com/doc/refman/5.1/en/using-explain.html
outis
Lots of useful info here - thanks. FKs being mutually exclusive with Partitioning is a potential spanner in the works. I am at a crossroads since ref integrity may be compromised if I remove the FKs. At the same time, I dont much like the idea of having 150+ million rows in one table. Indices will not add much in my scenario, since ALL queries on the foobar table will at the very minimum, specify the three fields in the composite PK. My problem is simply this: storing 159+ million rows (growing by around 11 million rows a year). On the other hand, I dont want to lose the ability to enforce R.I
Stick it to THE MAN
I dont want to lose the ability to enforce Referential Integrity (due to loss of FKs). If mySQL is capable of storing this volume of data (and growth rate) in one table, then I will simply use indexing, since as I mentioned earlier, ALL queries on this table will have as a bare minimum, the 3 fields used in the composite PK.
Stick it to THE MAN
Take a look at clustering (http://www.mysql.com/products/database/cluster/faq.html), if you have the budget for a few more servers.
outis
Hmm, tough choice. From what you say, indices should help the query engine skip over the records not to be included in the result set. So even though the table has a huge row count, performance will not suffer too badly (hopefully). I think I will use the indices and FKs for now, and maybe use partitioning/clustering later on
Stick it to THE MAN