tags:

views:

152

answers:

2

Hello SO:

I have this Regular Expression that matches the following strings:

<!-- 09-02-2009 --->
<!-- 09-02-2009 12:00:00 --->
<!-- 09-02-2009 12:00:00 A --->
<!-- 09-02-2009 12:00:00 AM --->

Here is the pattern:

<!-- (?<month>\d{2}?)-(?<day>\d{2}?)-(?<year>\d{4}?)(?:(?: ?\d{2}:?){3}?(?: ?[aApP][mM]?)?)? --->

updated pattern, per twistol:

<!-- (?<month>\d{2}?)-(?<day>\d{2}?)-(?<year>\d{4}?)(?<time>(?: ?(?:\d{2}:){2}\d{2})?(?: ?[aApP][mM]?)?)? --->

Is there anything I can do to simplify this pattern?

Thanks!

EDIT

Here is the pattern I came up with all comments/answers, plus validation built in. It is a bit ugly, but who said regex needs to be pretty? :P

<!-- (?<month>(?:0[1-9]|1[0-2]))-(?<day>(?:0[1-9]|1[0-9]|2[0-9]|3[01]))-(?<year>\d{4})(?<time> (?:0[0-9]|1[0-9]|2[0-3]):(?:[0-5][0-9])(?::[0-5][0-9])?(?: [aApP][mM]?)?)? --->

It will match valid dates in the following formats:

<!-- 09-02-2009 --->
<!-- 09-02-2009 12:00 --->
<!-- 09-02-2009 12:00 A --->
<!-- 09-02-2009 12:00 AM --->
<!-- 09-02-2009 12:00:00 --->
<!-- 09-02-2009 12:00:00 A --->
<!-- 09-02-2009 12:00:00 AM --->
+3  A: 
<!-- (?<month>\d\d)-(?<day>\d\d)-(?<year>\d{4})(?: \d\d:\d\d:\d\d(?: [aApP][mM]?)?)? -->

Is as simple as I can think of. Note that this regex isn't exactly the same, since in the original the timestamp colons were all optional, meaning it would match 01:0203 or 0102:03:, etc. I think my version may be more correct.

Basically I removed all the noncapturing groups and quantifiers I could, which when they are merely doubling a digit make it less readable, as opposed to more. I also removed the greediness modifier on the quantifiers, since they will always match exactly 2 or 4 or whatever whether it's greedy or not.

And of course, this will match invalid dates, such as 13-32-0000. To fix that, you will have to decide whether a complex yet correct solution is more desirable than a simple, more understandable one. Basically, it depends on your confidence in the text you will be running this over. If there are likely to be false positives that you want to filter out, go for a more correct solution, even if it is slightly less readable.

Sean Nyman
Ok thanks, I see what you did there. I am going to attempt to modify this pattern to accept valid date values as annakata pointed out under the OP comments.
Anders
Won't both yours and the original also only match on ---> (3 dashes), rather than --> (2 dashes) as most of the original examples contained?
Twisol
@Twistol - oops, I forgot the third dash. That was a type ;(
Anders
@Twisol: Ooops. Didn't notice that when I copy-pasta'd! I changed it to assume it was a typo in the original regex, and it should be two dashes.
Sean Nyman
On your "this will match invalid dates" bit of the answer, I think of regular expressions like this one to be better suited to simple validation of *format*, leaving the actual data validation to be done afterwards. It (a) makes for clearer regular expressions, and (b) is very useful where the "valid data" domain might be dynamic or complex.
Twisol
A: 

Here is my take...

(?<month>\d{2}?)-(?<day>\d{2}?)-(?<year>\d{4})(?:\s\d{2}:\d{2}:\d{2}\s?[aApP]?[mM]?)?

Can't seem to make it any shorter.

J.13.L
Be sure to note that this one will match more invalid data than the original, for example the time portion could be 01:02:03M, since both the whitespace and a/p characters are mutually exclusively optional to the m.
Sean Nyman