tags:

views:

156

answers:

4

I want to parse data which might contain mixed patterns like

1-4pm
1pm-5pm
noon to 11pm
noon to midnight
etc.

I want to extract start and end time. How can I achieve this through regex. I know I can't support all possible input formats, but how can I achieve to support a maximum?

A: 

Without much to go on, it looks like you can split based on "-" or "to".

^(.+) ?(-|to) ?(.+)$

That will capture the start time in the first group and the end time in the third. If you want a more specific syntax, you'll have to specify which variant of regex you intend to use.

Welbog
The greedy "+" is a bit problematic. I think it's better to change at least the first paren to `(.+?)`. But I take it as read that this is what you mean by "you'll have to specify which variant of regex". ;-)
Tomalak
+2  A: 

First, define a pattern that matches a single point in time. Given your examples it might be something like:

(noon|midnight|[0-9]+\s?(am|pm)?)

Next, define the separator. Perhaps:

(to|\-)

Finally, combine two of the first with one of the second. Assuming your language supports variables, something like:

set timePattern {(noon|midnight|[0-9]+\s?(am|pm)?)}
set separator {(to|\-)}
set fullPattern "$timePattern(\s*$separator\s*$timePattern)?"

Once you pass that through the engine you should be able to get at the parts of the expression that matched. You might need to make some groups non-capturing but I'll leave that as an exercise for the reader. You'll then likely have to parse the individual parts to figure out the time. For example, parse "1pm" as a 1 and "pm" and calculate a time based on that.

Once you have it broken down like that it makes it easier to fiddle with the constituent parts and makes the expression a bit more comprehensible. Though, the same thing can be accomplished in some languages that support multiline expressions with comments.

Bryan Oakley
You shouldn't escape the dash in the separator expression, that's unnecessary. +1 for the approach.
Tomalak
You are correct that, strictly speaking, it's not necessary. It's a habit I've gotten myself into since you have to treat "-" specially in a range. I tend to automatically protect it everywhere. <shrug>
Bryan Oakley
+1  A: 

Depending on language, you can 'build-up' a matching pattern. Ruby, for example, will allow you to do something like:

time_spec = /noon|midnight|\d{1,2}/
sep = /-|to/
match = /#{time_spec}\s*#{sep}\s*#{time_spec}/

But, since this seems like something that will be much more complex as it gets extended, why not create some sort of parser (using flex/yacc?) that will maintain much better than a regex? When you start supporting a range of input such as 1pm/1p/13:00/13 regex start creating more problems then solutions.

ezpz
A: 

this is my expression ^((?[a-z]+)?)\s*(?[0-9]{1,2}[:]?[0-9]{0,2}\s*[am|pm|a.m|p.m][.])?\s*[-|to|\|/|=]\s((?[a-z]+)?|(?[0-9]{1,2}[:]?[0-9]{0,2}\s*[am|pm|a.m|p.m][.]))?$

which covers almost all combination. I just want to know if there is any optimization in this regex. Here dayPart will consume all starting non-digit characters to handle if Timespan starts with noon ,midnight etc or any value which we can ignore like Sunday. startTime will try to consume any time in any format if it is there. same is for endPart and EndTime.

ZafarYousafi