views:

148

answers:

3

Has anyone found a simple, but effective way to extract date references from text? I've done a fair amount of searching for temporal extraction tools, but there isn't a lot out there. There are a few white papers, but it seems to fall into a subset of the whole semantic web thingy but not given much attention.

I'm just looking for something that is 80% effective. There is no need to capture things like "the month after Jan 2009", but basic common dates entities would be nice.

I'm open to all suggestions, even fancy regex expressions.

Fire away!

(and thanks - Henry)

+1  A: 

One way I have done this is to just look for anything that is 4 numbers and convert it to a number. If the number falls within the range of years you are interested in, you probably have a year you can use. If you are interested in any matching months and days you could check adjacent words to see if they are a month name or a number between 1 and 31. I am confident this would satisfy your 80% requirement.

Regex for years: [0-9]{4} - you will need to convert to a number and see if it's within the range of years you consider valid.

Regex for months: jan|january|feb|february ... etc for each month

Regex for days of the month: [0-9]{1,2} - you would need to convert to a number and see if it is 1-31

John JJ Curtis
I currently extract year using a simple regex => /\b((19|20)\d\d)\b/ (only wanted to focus on years beginning with 19 and 20 to limit false positives); the next step is to look for months, but still haven't found a way to deal with multiple date in the same sentence
henry74
A: 

I'm drawing a blank on how to find what to feed it, but this library will parse a wide range of dates and could be used as the "is this a real date" function. (Full disclosure, I'm the author of that lib)

BCS
Looks like the library requires you to send in the actual date terms. I'm looking for something which allows you to feed it sentences and have it extract the date/time entities.
henry74
A: 

i'm working on exactly this at hotdate. i'll un-spaghetti code it later this month :)