views:

103

answers:

6

I want to match dates that have the following format:

2010-08-27 02:11:36

i.e. yyyy-mm-dd hh:mm:ss.

Right now I am not very particular about the date being actually feasible, but just that it is in the correct format.

Possible formats that should match are (for this example)

2010
2010-08
2010-08-27
2010-08-27 02
2010-08-27 02:11
2010-08-27 02:11:36

In Perl, what can be a concise regex for this?

I have this so far (which works, btw)

/\d{4}(-\d{2}(-\d{2}( \d{2}(:\d{2}(:\d{2})?)?)?)?)?/

Can this be improved performance-wise?

+5  A: 

How about something from Regexp::Common::time?

Philip Potter
+3  A: 

I would use the split function :

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my @dates = (
'2010',
'2010-08',
'2010-08-27',
'2010-08-27 02',
'2010-08-27 02:11',
'2010-08-27 02:11:36',
);

for (@dates) {
  my @list = split /[ :-]/;
  print Dumper(\@list);
}

output :

$VAR1 = [
          '2010'
        ];
$VAR1 = [
          '2010',
          '08'
        ];
$VAR1 = [
          '2010',
          '08',
          '27'
        ];
$VAR1 = [
          '2010',
          '08',
          '27',
          '02'
        ];
$VAR1 = [
          '2010',
          '08',
          '27',
          '02',
          '11'
        ];
$VAR1 = [
          '2010',
          '08',
          '27',
          '02',
          '11',
          '36'
        ];
M42
I did not understand. What are you trying to do here?
Lazer
I'm splitting all date format you've given into an array. This done, you can test the values contained in this array. `array[0]` contains the year, `array[1]` contains (eventually) the month and so on.
M42
+1  A: 

This matches all the above (but also other stuff - see the comment!) and may be slightly easier to read:

/(\d{4})(-\d{2})?(\w{1}\d{2})?(:\d{2})?/
Dave Everitt
I wouldn't say this is cleaner. Nor that it does the job, actually: it accepts 1234q56, for instance. Also: {0,2}? is superfluous, you can't optionally match zero times.
mscha
Accepted - I only tested against all the given patterns. Thanks for the heads-up about the {0,2}? - cross-brain-infection from something else I was doing. Corrected.
Dave Everitt
The ? in {0,2}? is non-greedy, so it will prefer matching fewer times, presumably not what was intended, but certainly not superfluous.
ysth
Thanks ysth. Was using exactly that in some Ruby regex (hence the spillover into this question).
Dave Everitt
Indeed, I was wrong about {0,2}?, apologies.
mscha
+1  A: 

If you want faster, then look away from regex, and look at XS modules: Date::Calc is a good one.

gms8994
+2  A: 

Your regex is just fine except for missing anchors (unless you want to match 2008 in "abc200890"?). Assuming you want to match the whole string:

/^\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/

(?:...) should be used if you don't actually want the captured substrings, which I'd guess to be the case.

ysth
+7  A: 

Based on the lack of a capturing group around the year, I assume you care only whether a date matches.

I tried a few different patterns related to the one from your question, and the one that gave a ten- to fifteen-percent improvement was disabling capturing, i.e.,

/\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/

The perlre documentation covers (?:...):

(?:pattern)

(?imsx-imsx:pattern)

This is for clustering, not capturing; it groups subexpressions like (), but doesn't make backreferences as () does. So

@fields = split(/\b(?:a|b|c)\b/)

is like

@fields = split(/\b(a|b|c)\b/)

but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to.

Any letters between ? and : act as flags modifiers as with (?imsx-imsx). For example,

/(?s-i:more.*than).*million/i

is equivalent to the more verbose

/(?:(?s-i)more.*than).*million/i

Benchmark output:

             Rate      U   U/NC CH/NC/A CH/NC/A/U     CH  CH/NC   null
U         31811/s     --   -32%    -58%      -59%   -61%   -66%   -93%
U/NC      46849/s    47%     --    -38%      -39%   -42%   -50%   -90%
CH/NC/A   76119/s   139%    62%      --       -1%    -6%   -18%   -84%
CH/NC/A/U 76663/s   141%    64%      1%        --    -6%   -17%   -84%
CH        81147/s   155%    73%      7%        6%     --   -13%   -83%
CH/NC     92789/s   192%    98%     22%       21%    14%     --   -81%
null     481882/s  1415%   929%    533%      529%   494%   419%     --

Code:

#! /usr/bin/perl

use warnings;
use strict;

use Benchmark qw/ :all /;

sub option_chain {
  local($_) = @_;
  /\d{4}(-\d{2}(-\d{2}( \d{2}(:\d{2}(:\d{2})?)?)?)?)?/
}

sub option_chain_nocap {
  local($_) = @_;
  /\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?/
}

sub option_chain_nocap_anchored {
  local($_) = @_;
  /\A\d{4}(?:-\d{2}(?:-\d{2}(?: \d{2}(?::\d{2}(?::\d{2})?)?)?)?)?\z/
}

sub option_chain_anchored_unrolled {
  local($_) = @_;
  /\A\d\d\d\d(-\d\d(-\d\d( \d\d(:\d\d(:\d\d)?)?)?)?)?\z/
}

sub simple_split {
  local($_) = @_;
  split /[ :-]/;
}

sub unrolled {
  local($_) = @_;
  grep defined($_), /\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d):(\d\d) \z
                    |\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d):(\d\d)        \z
                    |\A (\d\d\d\d)-(\d\d)-(\d\d) (\d\d)               \z
                    |\A (\d\d\d\d)-(\d\d)-(\d\d)                      \z
                    |\A (\d\d\d\d)-(\d\d)                             \z
                    |\A (\d\d\d\d)                                    \z
                    /x;
}

sub unrolled_nocap {
  local($_) = @_;
  grep defined($_), /\A \d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d \z
                    |\A \d\d\d\d-\d\d-\d\d \d\d:\d\d      \z
                    |\A \d\d\d\d-\d\d-\d\d \d\d           \z
                    |\A \d\d\d\d-\d\d-\d\d                \z
                    |\A \d\d\d\d-\d\d                     \z
                    |\A \d\d\d\d                          \z
                    /x;
}

sub id { $_[0] }

my @examples = (
  "xyz",
  "2010",
  "2010-08",
  "2010-08-27",
  "2010-08-27 02",
  "2010-08-27 02:11",
  "2010-08-27 02:11:36",
);

cmpthese -1 => {
  "CH"        => sub {                   option_chain $_ for @examples },
  "CH/NC"     => sub {             option_chain_nocap $_ for @examples },
  "CH/NC/A"   => sub {    option_chain_nocap_anchored $_ for @examples },
  "CH/NC/A/U" => sub { option_chain_anchored_unrolled $_ for @examples },
  "U"         => sub {                       unrolled $_ for @examples },
  "U/NC"      => sub {                 unrolled_nocap $_ for @examples },
  "null"      => sub {                             id $_ for @examples },
};
Greg Bacon
thanks a lot for the effort, @gbacon.
Lazer
+20 (if I could) for producing a benchmark vs armchair speculation! +1 anyways...
drewk