views:

1487

answers:

3

Is there an existing solution to create regular expressions dynamically out of given date time format pattern? Supported date time format pattern does not matter (Joda DateTimeFormat, java.text.SimpleDateTimeFormat or others).

i.e. for a given date-time format (for example "dd/MM/yyyy hh:mm"), it will generate corresponding regular expression to match the date-times within the specified formats.

A: 

SimpleDateFormat already does this with the parse() method.

If you need to parse multiple dates from a single string, start with a regex (even if it matches too leniently), and use parse() on all the potential matches found by the regex.

Jason Cohen
it only parses if the given text matches the pattern and returns the Date object representation of that string. It doesn't parse if the date information is somewhere within the other texts. like "sometext 12/03/2004 sometext.
hakan
That might be undecidable. What if the text contains two dates, e.g. "Between A and B, blah" where A and B are dates...
Jason Cohen
well, then will find them as two groups, just like a regex find.
hakan
ok, updated answer accordingly.
Jason Cohen
A: 

If your looking for basic date checking. this code matches this data.

\b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20)?[0-9]{2}\b

10/07/2008  
10.07.2008
1-01/2008
10/07/08    
10.07.2008
1-01/08

Code Via regexbuddy

Keng
what I need is a regex generator for the given DateTimeFormat. I don't know the date format used in the given corpus. So user first should provide by saying sth like DDMM hh:mm and I find these date values in the text. I've created sth by using JFlex. I will also post it here after I clean it up.
hakan
+3  A: 

I guess you have a limited alphabet that your time formats can be constructed of. That means, "HH" would always be "hours" on the 24-hour clock, "dd" always the day with leading zero, and so on.

Because of the sequential nature of a time format, you could try to tokenize a format string of "dd/mm/yyyy HH:nn" into an array ["dd", "/", "mm", "/", "yyyy", " ", "HH", ":", "nn"]. Then go ahead and form a pattern string from that array by replacing "HH" with "([01][0-9]|2[0-3])" and so on. Preconstruct these pattern atoms into a lookup table/array. All parts of your array that are not in the lookup table are literals. Escape them to according regex rules and append them to you pattern string.


EDIT: As a side effect for a regex based solution, when you put all regex "atoms" of your lookup table into parens and keep track of their order in a given format string, you would be able to use sub-matches to extract the required components from a match and feed them into a CreateDate function, thus skipping the ParseDate part altogether.

Tomalak
This works decently, but it is rather English-centric. "ddd" might be mapped to (mon|tue|wed|thu|fri|sat|sun) then, but you'd need a locale-dependent mapping. It gets worse when the date format generates non-ASCII digits. See M.Kaplan's blog for far more details on i18n.
MSalters
Yes, that's exactly what I need. I thought to do something similar since I couldn't find anything that already exists. For parsing dateTimeFormat, I used jflex. So if it's "d" it should match 1 or 2 digits or if it's "ddd" it should match 3 digits etc. However, I still need to improve it for i18n.
hakan
@MSalters. Could you provide the link for the blog that you mentioned? Thanks
hakan
For locale-dependent patterns you could abstract that further and construct the lookup table from your language's locale date functions. Prior knowledge of the language your dates are are in is of course required, but then it is quite straight-forward.
Tomalak
Be aware that "dd" for "2 digits" will yield false positives. "99" would match, but certainly is not valid as a date component other than "two-digit year".
Tomalak
Yes it's obvious, I just didn't have enough chars left to explain in more details while adding comment. The real burden is as MSalters said non-ASCII digits like japanese and arabic numbers. I think for each different type of digit sets, different regex should be created.
hakan
Of course. As I said, I think you can pull most of them out of locale specific date functions of your programming language of choice, and some of them might need to be hand-crafted, if you know what you do. I18N is hard if you take it to the max.
Tomalak