views:

861

answers:

2

Our MySQL web analytics database contains a summary table which is updated throughout the day as new activity is imported. We use ON DUPLICATE KEY UPDATE in order that the summarization overwrites earlier calculations, but are having difficulty because one of the columns in the summary table's UNIQUE KEY is an optional FK, and contains NULL values.

These NULLs are intended to mean "not present, and all such cases are equivalent". Of course, MySQL usually treats NULLs as meaning "unknown, and all such cases are not equivalent".

Basic structure is as follows:

An "Activity" table containing an entry for each session, each belonging to a campaign, with optional filter and transaction IDs for some entries.

CREATE TABLE `Activity` (
    `session_id` INTEGER AUTO_INCREMENT
    , `campaign_id` INTEGER NOT NULL
    , `filter_id` INTEGER DEFAULT NULL
    , `transaction_id` INTEGER DEFAULT NULL
    , PRIMARY KEY (`session_id`)
);

A "Summary" table containing daily rollups of total number of sessions in activity table, an d the total number of those sessions which contain a transaction ID. These summaries are split up, with one for every combination of campaign and (optional) filter. This is a non-transactional table using MyISAM.

CREATE TABLE `Summary` (
    `day` DATE NOT NULL
    , `campaign_id` INTEGER NOT NULL
    , `filter_id` INTEGER DEFAULT NULL
    , `sessions` INTEGER UNSIGNED DEFAULT NULL
    , `transactions` INTEGER UNSIGNED DEFAULT NULL
    , UNIQUE KEY (`day`, `campaign_id`, `filter_id`)
) ENGINE=MyISAM;

The actual summarization query is something like the following, counting up the number of sessions and transactions, then grouping by campaign and (optional) filter.

INSERT INTO `Summary` 
    (`day`, `campaign_id`, `filter_id`, `sessions`, `transactions`)
    SELECT `day`, `campaign_id`, `filter_id
        , COUNT(`session_id`) AS `sessions`
        , COUNT(`transaction_id` IS NOT NULL) AS `transactions`
    FROM Activity
    GROUP BY `day`, `campaign_id`, `filter_id`
ON DUPLICATE KEY UPDATE
    `sessions` = VALUES(`sessions`)
    , `transactions` = VALUES(`transactions`)
;

Everything works great, except for the summary of cases where the filter_id is NULL. In these cases, the ON DUPLICATE KEY UPDATE clause does not match the existing row, and a new row is written every time. This is due to the fact that "NULL != NULL". What we need, however, is "NULL = NULL" when comparing the unique keys.

I am looking for ideas for workarounds or feedback on those we have come up with so far. Workarounds we have thought of so far follow.

  1. Delete all summary entries containing a NULL key value prior to running the summarization. (This is what we are doing now) This has the negative side effect of returning results with missing data if a query is executed during the summarization process.

  2. Change the DEFAULT NULL column to DEFAULT 0, which allows the UNIQUE KEY to be matched consistently. This has the negative side effect of overly complicating the development of queries against the summary table. It forces us to use a lot of "CASE filter_id = 0 THEN NULL ELSE filter_id END", and makes for awkward joining since all of the other tables have actual NULLs for the filter_id.

  3. Create a view which returns "CASE filter_id = 0 THEN NULL ELSE filter_id END", and using this view instead of the table directly. The summary table contains a few hundred thousand rows, and I've been told view performance is quite poor.

  4. Allow the duplicate entries to be created, and delete the old entries after summarization completes. Has similar problems to deleting them ahead of time.

  5. Add a surrogate column which contains 0 for NULL, and use that surrogate in the UNIQUE KEY (actually we could use PRIMARY KEY if all columns are NOT NULL).
    This solution seems reasonable, except that the example above is only an example; the actual database contains half a dozen summary tables, one of which contains four nullable columns in the UNIQUE KEY. There is concern by some that the overhead is too much.

Do you have a better workaround, table structure, update process or MySQL best practice which can help?

EDIT: To clarify the "meaning of null"

The data in the summary rows containing NULL columns are considered to belong together only in the sense that of being a single "catch-all" row in summary reports, summarizing those items for which that data point does not exist or is unknown. So within the context of the summary table itself, the meaning is "the sum of those entries for which no value is known". Within the relational tables, on the other hand, these truly are NULL results.

The only reason for putting them into a unique key on the summary table is to allow for automatic update (by ON DUPLICATE KEY UPDATE) when re-calculating the summary reports.

Maybe a better way to describe it is by the specific example that one of the summary tables groups results geographically by the zip code prefix of the business address given by the respondent. Not all respondents provide a business address, so the relationship between the transaction and addresses table is quite correctly NULL. In the summary table for this data, a row is generated for each zip code prefix, containing the summary of data within that area. An additional row is generated to show the summary of data for which no zip code prefix is known.

Altering the rest of the data tables to have an explicit "THERE_IS_NO_ZIP_CODE" 0-value, and placing a special record in the ZipCodePrefix table representing this value, is improper--that relationship truly is NULL.

+4  A: 

I think something along the lines of (2) is really the best bet — or, at least, it would be if you were starting from scratch. In SQL, NULL means unknown. If you want some other meaning, you really ought to use a special value for that, and 0 is certainly an OK choice.

You should do this across the entire database, not just this one table. Then you shouldn't wind up with weird special cases. In fact, you should be able to get rid of a lot of your current ones (example: currently, if you want the summary row where there is no filter, you have the special case "filter is null" as opposed to the normal case "filter = ?".)

You should also go ahead and create a "not present" entry in the referred-to table as well, to keep the FK constraint valid (and avoid special cases).

PS: Tables w/o a primary key are not relational tables and should really be avoided.

edit 1

Hmmm, in that case, do you actually need the on duplicate key update? If you're doing a INSERT ... SELECT, then you probably do. But if your app is supplying the data, just do it by hand — do the update (mapping zip = null to zip is null), check how many rows were changed (MySQL returns this), if 0 do an insert.

derobert
Yes, the summary table is quite explicitly not a relational table. It is simply a convenient container for holding reporting results.My statement that "These NULLs are intended to mean 'not present, and all such cases are equivalent'", is perhaps misleading. In the relational tables containing the normalized data, the filter_id and other nullable relationships I mention as being part of the unique key in the summary table truly have the meaning of "unknown", and are not part of any primary or unique keys. See edit, above.
ryandenki
Exactly right. We use INSERT...SELECT, using the ON DUPLICATE KEY clause there to update entries throughout the day.Actually, the first implementation two years ago was as you suggest--first selecting the data, performing some extra manipulation, then issuing individual INSERTS, with WHERE clauses taking into account the IS NULL case.That approach has the advantage that the locks inserting individual rows are shorter than for the INSERT...SELECT method. But these locks are only on the master using row replication, and we could replace all the app-side code with a single SQL statement.
ryandenki
A: 

Change the DEFAULT NULL column to DEFAULT 0, which allows the UNIQUE KEY to be matched consistently. This has the negative side effect of overly complicating the development of queries against the summary table. It forces us to use a lot of "CASE filter_id = 0 THEN NULL ELSE filter_id END", and makes for awkward joining since all of the other tables have actual NULLs for the filter_id.

Create a view which returns "CASE filter_id = 0 THEN NULL ELSE filter_id END", and using this view instead of the table directly. The summary table contains a few hundred thousand rows, and I've been told view performance is quite poor.

View performance in MySQL 5.x will be fine, as the view does nothing but replace a zero with a null. Unless you use aggregates/sorts in a view, most any query against the view will be re-written by the query optimizer to just hit the underlying table.

And of course, since it's an FK, you'll have to create an entry in the referred-to table with an id of zero.

tpdi