views:

79

answers:

2

I'm struggling to find an efficient and flexible representation for my data. We have a many-to-many relationship between two entities which have arbitrary lifetimes. Let's call these Voter and Candidate. Each relationship has a measurement which we'd like to summarize in various ways. These are timestamped and are guaranteed to be within the lifetime of the two related entities. Let's say the measure is approval rating, or just Rating.

One unusual requirement is that if I'm summarizing a period which has no measurement, I should substitute the latest valid measurement, rather than giving NULL.

Our current solution is to compile a list of valid voters and candidates for each day, then formulate a many-to-many table which records the latest valid measure.

What would your solution be?

Example diagram

This allows me to do a single query to get a daily summary:

   select 
       avg(rating), valid_date, candidate_SSN, candidate_DOB
   from 
       daily_rating natural join rating
   group by
       valid_date, candidate_SSN, candidate_DOB

This might work ok, but It seems inefficient to me. We're repeating a lot of data, especially if nothing happens for a given day. It also is unclear how to do weekly/monthly summaries without compiling even more tables. Since we're dealing with millions of rows (we're not really talking about voter polls...) I'm looking for a more efficient solution.

+1  A: 

I have used data-warehousing technique here, hence the dim and fact table names.

alt text

dimDate is so-called date dimension, one row per a date.

dimCandidate has all candidate data, new and old records. In data-warehousing terms this is called type 2 dimension. One candidate can have several rows in this table, only one of them having r_status = 'current'.

Fields

, r_valid_from date
, r_valid_to   date
, r_version    integer -- (1, 2, 3,..)
, r_status     varchar(10) -- (expired, current)

describe a record (row) status. Each time a candidate status changes, a new row is inserted and the pervious row's r_valid_to and r_status are modified.

CandidateFullName is a business (natural) key and has to uniquely identify a candidate. No two candidates can have the same CandidateFullName. Note that the CandidateKey uniquely identifies a row in the table, while CandidateFullName uniquely identifies a candidate.

dimVoter has voter data, new and old records -- just like the dimCandidate.

dimCampaign describes campaign details, this is so-called type one dimension, does not hold historical data.

factRating has the Rating measure.

Normaly this would be enough, but there is the reqirement to interpolate the missing data for a day; for that, an aggregate table aggDailyRating is introduced. At the end of a day, a scheduled job aggregates ratings for the day. This job takes care of the data-interpolation requirement. This way the aggregate table has one row for each date-(valid) candidate-campaign combination. Note that voter is not included in the combination, data is aggregated over all voters.

alt text

Any reporting is done on the aggregate table, for example

--
-- monthy rating for years 2009-2010
-- for candidate john_smith_256
--
select
     CalendarYear
   , MonthNumber  
   , avg(DailyRating) as AverageRating
from aggDailyRating as f
join dimDate        as d on d.DateKey      = f.DateKey
join dimCandidate   as c on c.CandidateKey = f.CandidateKey
where CandidateFullName = 'john_smith_256'
  and CalendarYear between 2009 and 2010
group by CalendarYear, MonthNumber
order by CalendarYear desc, MonthNumber desc ;
Damir Sudarevic
@PerformanceDBA -- Yes they can be all derived from `fullDate`, but then we end up with functions in the WHERE clause -- the point of this technique is to avoid this. The versioning works just fine, try it. The latest record is always retrieved by using the business key an the `r_status`. This a standard, plain-vanilla warehousing approach -- Kimball star.
Damir Sudarevic
@PerformanceDBA -- see (google) Kimball type 2 dimension.
Damir Sudarevic
@PerformanceDBA -- business key uniquely identifies a `Candidate` (person) -- like in `john_smith_256`. Business key IS NOT unique key of the table, there can be many rows for `john_smith_256`.
Damir Sudarevic
@PerformanceDBA -- do note that the question is flagged as `data-warehouse` too.
Damir Sudarevic
@Damir: The title is "Database", but if the tags include DataWarehouse, I must apologise. I will remove my comments and replace them with something more appropriate, re the distinction between DB and DW.
PerformanceDBA
@Damir: beautiful, and classic DW (but not DB) solution. @Bukzor: you need to decide on whether you want a DB or a DW solution; each has dis/advantages and inherent limitations.
PerformanceDBA
@Bukzor: If you require a DB, this solution will not work for two classic [well-known DB vs DW issue] reasons (1) the distinction re CandidateKey vs CandidateFullName cannot be implemented (unless and additional table is impemented to identify that), beacuse both CandidateKey and CandidateFullName cannot both be unique at the same time). The FKs will fail, because the referenced PF is required to be Unique. Eg. Fk in factRating to the CandidateKey; latter is not Unique.
PerformanceDBA
(2) None of the commercial DBs have problems with functions in WHERE clauses (DW products do). In a DB, all columns in dimDate other than fullDate are redundant, duplicates; they can be easily and readily derived from fullDate. Those columns are required only for DW a solution.
PerformanceDBA
@Damir: Ok, done. If my modified comments are acceptable, you may wish to delete your comments. Also, I need you to edit your post, so that I can vote it up.
PerformanceDBA
+1  A: 

Yes, that is very inefficient and wasteful. It is merely a set of files, not reasonably comparable to a set of "tables" or a "database"; extensions and enhancements to it will compound the duplication and inefficiency. Duplication is the antithesis of a database. In database terms, there are far more efficient and easier ways to implement that.

Assumption

Your post does not provide much info, so I have had to make some assumptions, but I think you can correct my submission quite easily if any of them are incorrect. Otherwise comment, and I will correct my submission.

  1. A Voter is a Person; a Candidate is a Voter; (Candidate = subset of Voter)

  2. A Campaign is related to Candidate (not to a Polling Campaign).

  3. A Poll is a survey of the Voters response to a Candidate's performance, staring on a set date, running over a few days, and completing on an set date.

  4. There are many Measures, such as ApprovalRating, that are surveyed in each Poll.

  5. The Measures of such surveys across all Voters are aggregated at the Poll level.

Limitation

  1. The expiry requirement is unclear, so I am not suggesting I have implemented that. If the model does not provide that for you (if it is not immediately obvious), supply details and I will add to the model. The current model provides exclusion/inclusion capability for what I understand the expiry requirement to be.

  2. The Poll::Measure does not have enough info to be implemented fully; I need further details. The submission is primitive and unconstrained in that area.

  3. Likewise, any Poll::Campaign relation or constraint ("there are many Polls per Campaign, and they are always related to Campaign") has not been implemented.

  4. The arrangement of the key in the child tables is arbitrary for now: if you identify the most common queries, it can be re-arranged, so that the most those obtain the best speed.

Submission

Campaign Poll Data Model

  1. This is just a Relational (Normalised; zero duplication) Database, pure IDEF1X, including provision for the consideration that the child tables will be huge: migration of narrow surrogate keys into the child tables, avoiding migration of wide keys.

  2. It provides "data warehouse" capability as is. In fact, if it does not provide any BI or DSS requirement in a single query, that is only due to lack of detail from you; please provide, and I will happily change it. (Note, your item re "single query" is actually "single file"; joins are pedestrian in a Relational database.)

  3. Keys such as %Code are 2-, 3-, and at most 4-characters. Such keys are just as fast as Integer keys, and very helpful (makes sense) when perusing the tables (without having to join the parent).

  4. Any and all aggregation, either to load the historic rows, or to produce aggregates for the current values, should be possible in a single Relational (set-oriented) command; you should not need to resort to serial (cursor) processing. Again, if you think you need to, please comment and I will provide the set-oriented method.

We implement Versioning in DBs quite differently to the way it is done in DWs, and without limitations. Please identify if you require versioning of (eg) Candidate, and I will provide.

Last, the Null requirement is not unusual. It is catered for here. Again, if you think it isn't ...

PerformanceDBA