tags:

views:

89

answers:

1

I'm building a mash-up to store meta-data on items from multiple REST API datasources. I'd like to be able to generate typical feeds (most recent, top rated, most viewed, etc) based on data summarized across all the different datasources, and also add tags (i.e. many-to-many relationships).

My problem is that each datasource has a different way to issue unique ids through their REST API. I need suggestions on the best pattern to use for my MySQL datamodel.

My current solution is to use 1 table for all items and a composite key but the joins are long and cakePHP doesn't deal with composite keys natively:

datasource_id smallint,  
datasource_item_id VARCHAR(36), // some datasources issue alpha keys

Q: Is it ok/better to add an auto increment primary key to my table and translate all my internal joins/indexes from external UIDs to my internal UIDs? :

id int(10) unsigned NOT NULL auto_increment,

Q: Are enums an efficient datatype for storing datasource_id (should have maybe 10 different datasources)?

Q: Are there other approaches that yield better, more scalable results in the long run?

+1  A: 

Mostly I can only confirm the solutions you've already considered.

Since the storage type used in the table schema doesn't have to be the same as the type of the data (which is why SQLite 2 was untyped and SQLite 3 has so few types), my first impulse is the same as your current solution.

Following another school of thought, namely that IDs which are arbitrary (i.e. those not based on attributes of whatever you're modeling) should be kept internal to your own database, suggests the second solution you mention: add an id column. One reason for this school is that you don't want your tables to be dependent on someone else's internals, though that's less of a concern here. Since cakePHP doesn't support composite keys, this seems the most viable option.

Another solution would be to have the primary key column be a concatenation of the data from the other composite key columns. That is, add an additional column, as with the auto-incrementing ID, but one that stores a non-arbitrary value. This falls under the category of denormalizing and has all the caveats and warnings that implies.

If SQL were a second order logic, you could easily give each datasource its own table. Since SQL is first order, this isn't a very scalable solution.

The first three all share a downside. Each datasource has its own ID type; when storing IDs from different sources in the same column, you need to define additional constraints to enforce type integrity at the database level, probably in the form of triggers (since MySQL doesn't support the CHECK clause).

Q: Are enums an efficient datatype for storing datasource_id (should have maybe 10 different datasources)?

The storage requirements for an ENUM are 1 or 2 bytes, depending on how many distinct values there are. At ten datasources, only a single byte should be used per row. That still wastes a little over 4 bits/row. Whether it's efficient I'll leave up to you.

outis