tags:

views:

232

answers:

9

I'm playing around with some historical data wherein some dates I know accurately (i.e. dd/mm/yyyy) whilst others are just yyyy and others are yyyy? (i.e. the year is uncertain). I've even come across fl, which apparently means "flourished".

At the moment I'm using the DateTime class which doesn't seem to support the flagging/representation of such uncertainty. Is there a standard way of addressing this problem?

+1  A: 

DateTime? is nullable. That might be your best bet. The other alternative is DateTime.MinValue (or MaxValue).

[Edit] Actually, rereading your question, I think your best bet is to devise a custom class that serves your exact purpose.

pdr
+9  A: 

I would consider creating a class that wraps a DateTime (or DateTimeOffset) and has additional fields to represent which portions of the date are certain and which are not.

You could then expose month, day, and year fields as nullable values to reflect which portions of the date are known.

LBushkin
I wouldn't just consider this. I would do this. The uncertainty is additional information that needs to be modelled or represented in some way, and DateTime doesn't do that.
Cheeso
This approach isn't so good for many of the common historical dates that you get on old documents or photos like "circa 1950", or "after June 1945". Circa 1950 would map to ?/?/? if all you model is a DateTime with uncertainty on the portions of the date.
Hightechrider
+2  A: 

If the uncertainty is binary (i.e., the date is either known or unknown), the I'd go with a nullable DateTime type. Otherwise, I'd consider creating a wrapper struct with an additional enum property:

public enum DateConfidence
{
     Certain,
     Unknown,
     YearOnly,
     ApproximateYearOnly
}
Christopher
+4  A: 

There are various academic papers on ways to represent approximate time, e.g. http://www.musiccog.ohio-state.edu/Humdrum/representations/date.rep.html

If you want to handle the full scope of historical documents and the approximate knowledge you'll have for any of them it's not a simple bool / nullable operation with DateTime values.

I haven't seen a C# library to handle this yet. My own Natural Language Engine for C# can understand all kinds of date time phrases but was designed for a different problem - it can accept an imprecise question and query a database of exact values.

It has classes for a specific date, a range of dates, a known year (but no month/day), a known year+month (but no date), a half-infinite range (e.g. before or after a given date), ... and using these it can construct queries against databases or can enumerate all the possible ranges of dates that could be meant. e.g. you can ask it "who called last year on friday after 4pm" and it can generate the appropriate SQL query.

If you want to do this right it's not easy! If I were you, I would capture a string value with the original text in it alongside whatever representation you chose to use for the DateTime values. That way you can make the representation smarter over time to cover more cases, ultimately being able to handle something like "sometime between 1940 and September 16th 1945.

Initially you might want to store just the string representation and two DateTime values - earliest possible and latest possible date. That covers a majority of the cases you will see and it's really easy to query against. You can leave either Datetime value null or perhaps set it to maximum or minimum value to represent half-infinite ranges like "after 1900".

Hightechrider
+1 Agree that it involves natural language parsing.
Fadrian Sudaman
Like the idea of capturing the original string representation, and thanks for the ref.
Ian Hopkinson
A: 

There's no such class in .Net so the best is to create your own class with nullable properties representing all the necessary date fields.

This will give you most flexibility in future and will allow to handle any scenario you may have (if not - you just refactor your class and compiler will help you to find places where fix needs to be done).

Konstantin Spirin
+1  A: 

Radio carbon dating would be a typical example of this. You need a class with two members. The guessed date and the error estimate. The latter usually expressed in years, but you're free to pick any unit. Beware that DateTime cannot express a date before 0 BCE, so make it a simple int for the year. Avoid making it any more fancy than that, guessing the right month is meaningless for any date before the year 1000.

Hans Passant
Thanks for the tip on the 0 BCE limit, I got caught out by the 1900 limit in Excel...
Ian Hopkinson
A: 

Nope, but that would be useful.

Jonathan Allen
A: 

My preference for such a situation would be to create a date range object with a degree of certainty property.

Something such as:

public struct HistorialDateRange
{
    public DateTime StartDate { get; }
    public DateTime EndDate { get; }
    public double Confidence { get; } /* range [0.0, 1.0] */
}

I would then have a series of constructors that let me set a year, month range, or a single date, each with a confidence value. The confidence gives me a "rubbery" number for fuzzy comparisons.

If I set a single day the the StartDate & EndDate should encompass that date.

It's then up to your needs how to determine comparisons between HistorialDateRange objects. I would expect methods that let me ask if they are distinct, overlapping, etc.

Hope that helps.

Enigmativity
A: 

A slightly outside of the box answer to your problem.

If you are dealing with unstructured historical data like you describe, I will actually capture them as string - as it is. The actual meaning of the data come from the context of where it is being used. You may argue that we are losing the meaning, but in fact forcing such data with lots of nullable/arbitrary value to the DateTime object is just as unmeaningful. Take this as example:

  • 1910 - 1929
  • < 1960 or Before 1960
  • > Jul 1950 or After Jul 1950
  • 1950 - Present or 1950 - Now

Unless you can cater for every possibilities, early mapping of the period text into a structure object like DateTime, may potentially lose data. Take Now/Present as an example, it is a relative value that should only be substituted when it is used not when you parse or convert the value. How would you store before and after certain date? Of course with a lot of modelling work, you can capture all these information in a structured manner for all possibilities.

The period text should be interpreted within the context of when and how it is being used and you can employ whichever parsing method or natural language parsing if that suit you. If the parsing fails, you can always improve on it, but you should not lose the semantic meaning of the data at the very start when you read or migrate them.

Fadrian Sudaman