Given an arbitrary string, for example ("I'm going to play croquet next Friday" or "Gadzooks, is it 17th June already?"), how would you go about extracting the dates from there?

If this is looking like a good candidate for the too-hard basket, perhaps you could suggest an alternative. I want to be able to parse Twitter messages for dates. The tweets I'd be looking at would be ones which users are directing at this service, so they could be coached into using an easier format, however I'd like it to be as transparent as possible. Is there a good middle ground you could think of?

+2  A: 

Use the strtotime php function.

Of course you would need to set up some rules to parse them since you need to get rid of all the extra content on the string, but aside from that, it's a very flexible function that will more than likely help you out here.

For example, it can take strings like "next Friday" and "June 15th" and return the appropriate UNIX timestamp for the date in the string. I guess that if you consider some basic rules like looking for "next X" and week and month names you would be able to do this.

If you could locate the "next Friday" from the "I'm going to play croquet next Friday" you could extract the date. Looks like a fun project to do! But keep in mind that strtotime only takes english phrases and will not work with any other language.

For example, a rule that will locate all the "Next weekday" cases would be as simple as:

$datestring = "I'm going to play croquet next Friday";

$weekdays = array('monday','tuesday','wednesday',

foreach($weekdays as $weekday){
    if(strpos(strtolower($datestring),"next ".$weekday) !== false){
        echo date("F j, Y, g:i a",strtotime("next ".$weekday));

This will return the date of the next weekday mentioned on the string as long as it follows the rule! In this particular case, the output was June 18, 2010, 12:00 am. With a few (maybe more than a few!) of those rules you will more than likely extract the correct date in a high percentage of the cases, considering that the users use correct spelling though.

Like it's been pointed out, with regular expressions and a little patience you can do this. The hardest part of coding is deciding what way you are going to approach your problem, not coding it once you know what!

Not sure about how feasible that is. See for the allowed input formats. Getting rid of the incompatible parts of the string seems non-trivial.
I never said it was trivial but it's an approach. Setting some rules that will extract the valid date is not impossible. It would take some time for sure, but it's doable.It would be a lot easier if the dates followed a certain pattern but since it's arbitrary I can't think of a better approach. I'd be more than pleased to see another solution!
@Gordon, exactly - I'm wondering about any interesting approaches to isolating the date part which you could then parse with strtotime.
I've added an example of one of the rules I was talking about, making a function that will try all the rules against the string shouldn't be too hard @nickf
+1  A: 

Something like the following might do it:

$months = array(
                    "01" => "January", 
                    "02" => "Feberuary", 
                    "03" => "March", 
                    "04" => "April", 
                    "05" => "May", 
                    "06" => "June", 
                    "07" => "July", 
                    "08" => "August", 
                    "09" => "September", 
                    "10" => "October", 
                    "11" => "November", 
                    "12" => "December"

$weekDays = array(
                    "01" => "Monday", 
                    "02" => "Tuesday", 
                    "03" => "Wednesday", 
                    "04" => "Thursday", 
                    "05" => "Friday", 
                    "06" => "Saturday", 
                    "07" => "Sunday"

foreach($months as $value){
        \\ extract and assign as you like...

Probably do a nother loop to check for other weekDays or other formats, or just nest.

+8  A: 

If you have the horsepower, you could try the following algorithm. I'm showing an example, and leaving the tedious work up to you :)

//Attempt to perform strtotime() on each contiguous subset of words...

//1st iteration
strtotime("Gadzooks, is it 17th June already")
strtotime("is it 17th June already")
strtotime("it 17th June already")
strtotime("17th June already")
strtotime("June already")

//2nd iteration
strtotime("Gadzooks, is it 17th June")
strtotime("is it 17th June")
strtotime("17th June") //date!
strtotime("June") //date!

//3rd iteration
strtotime("Gadzooks, is it 17th")
strtotime("is it 17th")
strtotime("it 17th")
strtotime("17th") //date!

//4th iteration
strtotime("Gadzooks, is it")

And we can assume that strtotime("17th June") is more accurate than strtotime("17th") simply because it contains more words... i.e. "next Friday" will always be more accurate than "Friday".

+4  A: 

I would do it this way:

First check if the entire string is a valid date with strtotime(). If so, you're done.

If not, determine how many words are in your string (split on whitespace for example). Let this number be n.

Loop over every n-1 word combination and use strtotime() to see if the phrase is a valid date. If so you've found the longest valid date string within your original string.

If not, loop over every n-2 word combination and use strtotime() to see if the phrase is a valid date. If so you've found the longest valid date string within your original string.

...and so on until you've found a valid date string or searched every single/individual word. By finding the longest matches, you'll get the most informed dates (if that makes sense). Since you're dealing with tweets, your strings will never be huge.

Scott Saunders
This is definitely an easy way to start out. The time complexity is quite atrocious, though, so be careful. After a few thousand characters the complexity boils down to O(n^3). At the 140 character mark the savings of the n-1 have a more significant effect, but still surpasses O(n^2).
@erisco: Agreed. I wouldn't process a book this way. A tweet should never have more than 70 words though, and usually no more than 25, so n will remain fairly small. To optimize further, you could decide that no date will be composed of more than seven words - for example: 'Thursday, June 17th, 2010 at 9:00 a.m.' Then, rather than starting with n-1, you could count down from seven.
Scott Saunders

Majority of suggested algorithms are in fact pretty lame. I suggest using some nice regex for dates and testing the sentence with it. Use this as an example:

(\d{1,2})? (\d{2,4})?

I skipped months, since I'm not sure I remember them in the right order.

This is the easiest solution, yet I will do the job better than other compute-power based solutions. (And yeah, it's hardly a fail-proof regex, but you get the point). Then apply the strtotime function on the matched string. This is the simplest and the fastest solution.

Mikulas Dite
+1  A: 

What you're looking for a is a temporal expression parser. You might look at the Wikipedia article to get started. Keep in mind that the parsers can get pretty complicated, because this really a language recognition problem. That is commonly a problem tackled by the artificial intelligence/computational linguistics field.

+2  A: 

Following Dolph Mathews idea and basically ignoring my previous answer, I built a pretty nice function that does exactly that. It returns the string it thinks is the one that matches a date, the unix datestamp of it, and the date itself either with the user specified format or the predefined one (F j, Y).I wrote a small post about it on Extracting a date from a string with PHP. As a teaser, here's the output of the two example strings:

Input: “I’m going to play croquet next Friday”

Output: Array ( 
           [string] => "next friday",
           [unix] => 1276844400,
           [date] => "June 18, 2010" 

Input: “Gadzooks, is it 17th June already?”

Output: Array ( 
           [string] => "17th june",
           [unix] => 1276758000,
           [date] => "June 17, 2010" 

I hope it helps someone.