views:

310

answers:

2

How can the regex below be modified to match dates with ordinals on the day part? This regex matches "Jan 1, 2003 | February 29, 2004 | November 02, 3202" but I need it to match also: "Jan 1st, 2003 | February 29th, 2004 | November 02nd, 3202 | March 3rd, 2010"

^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))))\,\ ((1[6-9]|[2-9]\d)\d{2}))

Thank you.

+1  A: 

That regex is doing waaaaay too much. You'd be much better off using your language's equivalent of strptime(). However, the regex below will match ordinals:

^(?:(((Jan(uary)?|Ma(r(ch)?|y)|Jul(y)?|Aug(ust)?|Oct(ober)?|Dec(ember)?)\ 31(st)?)|((Jan(uary)?|Ma(r(ch)?|y)|Apr(il)?|Ju((ly?)|(ne?))|Aug(ust)?|Oct(ober)?|(Sept|Nov|Dec)(ember)?)\ (0?[1-9]|([12]\d)|30))(st|nd|rd|th)?|(Feb(ruary)?\ (0?[1-9]|1\d|2[0-8]|(29(th)?(?=,\ ((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00)))))(st|nd|rd|th)?))\,\ ((1[6-9]|[2-9]\d)\d{2}))

Note that it will also match things like "20nd" but the likelihood of encountering that in real data is way too low to bother caring in most cases.

Max Shawabkeh
While I like your answer and it works, I picked Jay's regex since it was smaller. I wish I could select both as correct.
NTulip
+3  A: 

This will depend on your use case, but in the interest of pragmatism, you might do well to just match anything matching:
(1) any month name or abbreviation;
(2) whitespace;
(3) any one or two digits;
(4) whitespace;
(5) any st,nd,rd,th;
(6) whitespace OR comma + optional whitespace;
(7) any four digits;

I'm not sure what you're matching in, but if I had Jan 35nd,3001, I think I'd rather capture it now and invalidate it later than to just skip over it right at the get-go.

Also, depending on your data set, consider case sensitivity issues and common international English variants, like 1 Jan 2004 or 1st Jan, 2004 or January, 2004 etc.

line breaks added

^(?:j(?:an(?:uary)?|un(?:e)?|ul(?:y)?)?|feb(?:ruary)?|ma(?:r(?:ch)?|y)
|a(?:pr(?:il)?|ug(?:ust)?)|sep(?:t|tember)?|oct(?:ober)?|(?:nov|dec)(?:ember)?)  
\s+\d{1,2}(?:st|nd|rd|th)?(?:\s+|,\s*)\d{4}\b

Even more pragmatic (and readable), unless you have a very bizarre dataset, is to allow anything after the common prefixes:

(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*?\s+\d{1,2}(?:[a-z]{2})?(?:\s+|,\s*)\d{4}\b

Would this match octagenarianism 99xx, 0000 ? Yes. Is that likely to be an issue? I doubt it.

Jay
I agree with you. I know nothing about regex so I had to rely on a sample I had found. I tested your sample against http://regexlib.com/RETester.aspx and it couldn't match January 20, 2020.
NTulip
Sorry, some perl regex metacharacters snuck in there. I've edited it to match .NET flavour.
Jay
thank you. Works great.
NTulip
Very sensible. I just have one quibble: you list `(4) whitespace;` *between* the digits and the ordinal suffix. That doesn't really belong there--and I notice it's not reflected in your regexes, either.
Alan Moore
You're right, Alan. Cut-and-paste editing gone awry ("Foiling computer users since 1974").
Jay