views:

78

answers:

10

I've been given a stack of data where a particular value has been collected sometimes as a date (YYYY-MM-DD) and sometimes as just a year.

Depending on how you look at it, this is either a variance in type or margin of error.

This is a subprime situation, but I can't afford to recover or discard any data.

What's the optimal (eg. least worst :) ) SQL table design that will accept either form while avoiding monstrous queries and allowing maximum use of database features like constraints and keys*?

*i.e. Entity-Attribute-Value is out.

+2  A: 

if it comes in as just a year make it default to 01 for month and date, YYYY-01-01

This way you can still use a date/datetime datatype and don't have to worry about invalid dates

SQLMenace
if it is important to know an actual 01/01 from a defaulted 01/01 then consider adding another column to indicate complete_date_received or similar...
Randy
+1 my thoughts exactly...
marc_s
I really don't like the idea of putting false data in a column. It's unintuitive in a way that could lead to serious mistakes from newcomers if they lack or don't read documentation.
Ian Mackinnon
In that case add another column isincomplete where you would have 0 if complete ,1 if year only 2 if year and month only etc etc..otherwise you can't use constraints
SQLMenace
@Ian Mackinnon - don't think of it as false data, think of it like this: the particular piece of data you are dealing with is somewhat complex and requires two columns to describe completely. This is no worse than having separate fields for address line one and address line two, or first name and last name, or storing dates in a DateTime field and having the time default to midnight or whatever.
no
@no - I disagree. I'd say it's more like occasionally storing a middle name as well in the first name column and having another column includes_middle_name. Swathes of queries for first or middle names now require referring to two columns. SQL includes NULL for exactly the purpose of indicating missing data and this method ignores it along with all the benefits that it provides.
Ian Mackinnon
+1  A: 

Either bring it in as a string unmolested, and modify it so it's consistent in another step, or modify the year-only values during the import like SQLMenace recommends.

Beth
+1  A: 

I'd store the value in a DATETIME type and another value (just an integer will do, or some kind of enumerated type) that signifies its precision.

It would be easier to give more information if you mentioned what kind of queries you will be doing on the data.

Matti Virkkunen
+1  A: 
  • Either fix it, then store it (OK, not an option)
  • Or store it broken with a fixed computed columns

Something like this

CREATE TABLE ...
   ...
   Broken varchar(20),
   Fixed AS CAST(CASE WHEN Broken LIKE '[12][0-9][0-9][0-9]' THEN Broken + '0101' ELSE Broken END AS datetime)

This also allows you to detect good from bad source data

gbn
+4  A: 

You could store the year, month and day components in separate columns. That way, you only need to populate the columns for which you have data.

ninesided
I like that one can semantically correctly specify the unknown columns as NULL this way.
Ian Mackinnon
+1  A: 

If you don't always have a full date, what sort of keys and constraints would you need? Perhaps store two columns of data; a full date, and a year. For data that has only year, the year is stored and date is null. For items with full info, both are populated.

Nikki9696
+1  A: 

I'd put three columns in the table:

  1. The provided value (YYYY-MM-DD or YYYY)
  2. A date column, Date or DateTime data type, which is nullable
  3. A year column, as an integer or char(4) depending upon your needs.

I'd always populate the year column, populate the date column only when the provided value is a date.

And, because you've kept the provided value, you can always re-process down the road if needs change.

Jeff Siver
A: 

I was going to suggest the same solution as @ninesided did above. Additionally, you could have a date field and a field that quantitatively represents your uncertainty. This offers the advantage of being able to represent things like "on or about Sept 23, 2010". The problem is that to represent the case where you only know the year, you'd have to set your date to be the middle of the year, with 182.5 days' uncertainty (assuming non-leap year), which seems ugly.

You could use a similar but distinct approach with a mask that represents what date parts you're confident about - that's what SQLMenace offered in his answer above.

Dathan
+1  A: 

An alternative solution would be to that of a date mask (like in IP). Store the date in a regular datetime field, and insert an additional field of type smallint or something, where you could indicate which is present (could go even binary here):

If you have YYYY-MM-DD, you would have 3 bits of data, which will have the values 1 if data is present and 0 if not.

Example:

Date         Mask
2009-12-05   7 (111)
2009-12-01   6 (110, only year and month are know, and day is set to default 1)
2009-01-20   5 (101, for some strange reason, only the year and the date is known. January has 31 days, so it will never generate an error)

Which solution is better depends on what you will do with it.

This is better when you want to select those with full dates, which are between a certain period (less to write). Also this way it's easier to compare any dates which have masks like 7,6,4. It may also take up less memory (date + smallint may be smaller than int+int+int, and only if datetime uses 64 bit, and smallint uses up as much as int, it will be the same).

Alexander
A: 

+1 each to recommendations from ninesided, Nikki9696 and Jeff Siver - I support all those answers though none was exactly what I decided upon.

My solution:

  • a date column used only for complete dates
  • an int column used for years
  • a constraint to ensure integrity between the two
  • a trigger to populate the year if only date is supplied

Advantages:

  1. can run simple (one-column) queries on the date column with missing data ignored (by using NULL for what it was designed for)
  2. can run simple (one-column) queries on the year column for any row with a date (because year is automatically populated)
  3. insert either year or date or both (provided they agree)
  4. no fear of disagreement between columns
  5. self explanatory, intuitive

I would argue that methods using YYYY-01-01 to signify missing data (when flagged as such with a second explanatory column) fail seriously on points 1 and 5.

Example code for Sqlite 3:

create table events
(
rowid integer primary key,
event_year integer,
event_date date,
check (event_year = cast(strftime("%Y", event_date) as integer))
);

create trigger year_trigger after insert on events
begin
update events set event_year = cast(strftime("%Y", event_date) as integer)
    where rowid = new.rowid and event_date is not null;
end;

-- various methods to insert
insert into events (event_year, event_date) values (2008, "2008-02-23");
insert into events (event_year) values (2009);
insert into events (event_date) values ("2010-01-19");

-- select events in January without expressions on supplementary columns
select rowid, event_date from events where strftime("%m", event_date) = "01";
Ian Mackinnon