views:

175

answers:

1

This is a simplified version of the problem.

We have customers who send us lots of data and then query it. We are required by them to have several "public" ids they can query our data by. (Most want to query our system via the id they send along with the data, but not always). For simplicity, we'll call them "pid", "crid" and "musicbrainzid". We have an "entity" table which stores this information. It looks something like this (the "authority" is who sent the data):

entity 
-- 
entity_id   
authority  // who sent the data
type       // 'pid', 'crid', 'musicbrainz', etc.
value      // the actual id value

Then we have separate entities such as "episode", "series" and "broadcast" (actually, there's a lot more, but I'm keeping it simple here). Each of these has an entity_id pointing to the entity table.

How can external customers search, via pid or crid and get the appropriate episode or series, along with proper identification of what it is? Given a pid, we can fetch the entity id, but then we need to search the episode, series and broadcast tables for this value. Further, not all ids will necessarily be related to all of the other tables, but any entity (e.g., an "episode") might have several ids (pid, crid, etc.)

Strategies:

  1. Find the entity id for a pid and search every other table for the pid.
  2. Put an "entity_type" column on entity, but what if it's a pid in the episode table but we accidentally set episode.type as series? We don't want to duplicate data and I don't want to put database metadata into column values.

Option number 1 is slow and seems wrong (further, the various tables have different structures making problematic).

Option 2 means duplicate data and this data can get out of synch. We can use triggers for force this, but this seems really nasty and, in any event, bugs in the implementation of mysql triggers have hit us several times. We're using this strategy right now, but without triggers.

What's option 3?

Side note: we know we need to break "authority" out into a separate table because not all authority/type combinations are valid.

+3  A: 

If i've understood your question correctly, I'd go with Option 1.

The query to identity the row based on the entity_id shouldn't be that slow as all that data should be in an index.
If your indexes are configured correctly this shouldn't even access the actual data. (At least in SQL Server it wouldn't.)

One small change I'd make would be to create a small set of tables to identity which id's are valid for which tables.
You would then use this to narrow down which tables you need to search through.

An alternative to Option 1 or 2, might be to change your database structure completely, to store different data on the same table, using entity_id as the primary key, with generic columns containing the data.
This would certainly be more radical, but i've seen it work well for a system like yours where the data and it's structure is quite dynamic.

Bravax