tags:

views:

56

answers:

2

I'm working on the search box for an events website. I've been recording the searches people make and alot of people are entering a {date}+{keyword} combo.

example searches:

jazz 5th november
dj shadow tonight
2nd october live music

so I need to write/find a regex that can match textual dates from within a longer string.

I'm thinking the easiest way to do this would be to work from the source code for PHP's strtotime() , assuming it runs on regular expressions.

Can anyone give me any tips for obtaining the source or alternatively has anyone come across any good regular expressions for textual dates?

+1  A: 

Expanding on this answer, how about using this to find dates (or things that at least look like dates) within the text and then try parsing those:

\b                     # match a word boundary
(?:                    # either...
 (?:                   # match the following one to three times:
  (?:                  # either
   \d+                 # a number,
   (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
   |                   # or a month name
   (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
  )
  [\s./-]*             # followed by a date separator or whitespace (optional)
 ){1,3}                # do this one to three times
|                      # or match a "colloquial" date and capture in backref 1:
(to(?:day|ni(?:te|ght)|morrow)|next\s+(?:week|month|year))
)
\b                     # and end at a word boundary.

So if you have a match, and backref $1 is empty, then a literal date was presumably found; if $1 is not empty, it found a date like "today" or "next week". Of course, this is only going to work with dates in English text, and it's probably not going to be very reliable.

if (preg_match(
    '%\b                   # match a word boundary
    (?:                    # either...
     (?:                   # match the following one to three times:
      (?:                  # either
       \d+                 # a number,
       (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
       |                   # or a month name
       (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
      )
      [\s./-]*             # followed by a date separator or whitespace (optional)
     ){1,3}                # do this one to three times
    |                      # or ...
    (?:to(?:day|ni(?:te|ght)|morrow)|next\s+(?:week|month|year))
    )
    \b                    # and end at a word boundary.%ix', 
    $subject, $regs)) {
    $result = $regs[0];
        $colloq = $regs[1];   // don't know what happens if $1 didn't participate in the match, though.
} else {
    $result = "";
}
Tim Pietzcker
Fantastic tim, thanks!
Haroldo
A: 

strtotime recognizes every format that is explained in Date and Time Formats. You could take the formats right from there and build the regular expression on your own.

Here’s an example for the time formats:

// Uses Symbols
$frac = "(?:\.[0-9]+)"; //".21342", ".85"
$hh = "(?:0?[1-9]|1[0-2])"; // "04", "7", "12"
$HH = "(?:[01][0-9]|2[0-4])"; // "04", "7", "19"
$meridian = "(?:[AaPp]\.?[Mm]\.?[\0\t ])"; // "A.m.", "pM", "am."
$MM = "(?:[0-5][0-9])"; // "00", "12", "59"
$II = "(?:[0-5][0-9])"; // "00", "12", "59"
$space = "(?:[ \t])";
$tz = "(?:\(?[A-Za-z]{1,6})?|[A-Z][a-z]+(?:[_/][A-Z][a-z]+)+)"; // "CEST", "Europe/Amsterdam", "America/Indiana/Knox"
$tzcorrection = "(?:(?:GMT)?[+-]$hh:?$MM?)"; // "+0400", "GMT-07:00", "-07:00"

// 12 Hour Notation
$Hour_only_with_meridian = "(?:$hh$space?$meridian)"; // "4 am", "5PM"
$Hour_and_minutes_with_meridian = "(?:$hh[.:]$MM$space?$meridian)"; // "4:08 am", "7:19P.M."
$Hour_minutes_and_seconds_with_meridian = "(?:$hh[.:]$MM[.:]$II$space?$meridian)"; // "4:08:37 am", "7:19:19P.M."
$Hour_minutes_seconds_and_fraction_with_meridian = "(?:$hh:$MM:$II[.:][0-9]+$meridian)"; // "4:08:39:12313am"

// 24 Hour Notation
$Hour_and_minutes = "($t?$HH[.:]$MM)"; // "04:08", "19.19", "T23:43"
$Hour_and_minutes_no_colon = "(?:t?$HH$MM)"; // "0408", "t1919", "T2343"
$Hour_minutes_and_seconds = "(?:t?$HH$[.:]$MM[.:]$II)"; // "04.08.37", "t19:19:19"
$Hour_minutes_and_seconds_no_colon = "(?:t?$HH$MM$II)"; // "040837", "T191919"
$Hour_minutes_seconds_and_timezone = "(?:t?$HH[.:]$MM[.:]$II$space?(?:$tzcorrection|$tz))"; // "040837CEST", "T191919-0700"
$Hour_minutes_seconds_and_fraction = "(?:t?$HH[.:]$MM[.:]$II$frac)"; // "04.08.37.81412", "19:19:19.532453"
$Time_zone_information = "(?:$tz|$tzcorrection)"; // "CEST", "Europe/Amsterdam", "+0430", "GMT-06:00"
Gumbo