One way to do it is via RegexKitLites Blocks support:
NSString *string = @"Mon-Wed 930-1700 Thu 900-1700 Fri 930-1700\nMon-Wed 930-1700 Thu 900-1700 Fri 930-1700 Sat 900-1200, Home Lending Sat 900-1600\nMon-Thu 900-1600, Fri 900-1700";
NSString *replaced = [string stringByReplacingOccurrencesOfRegex:@"(?<=[[:Pattern_Syntax:][:White_Space:]]|\\A)(\\d{1,2})(\\d{2,2})(?=[[:Pattern_Syntax:][:White_Space:]]|\\z)" usingBlock:^NSString *(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {
NSInteger hour = [capturedStrings[1] integerValue];
NSString *amOrPMString = @"am";
if(hour >= 12) { amOrPMString = @"pm"; if(hour > 12) { hour -= 12; } }
return([NSString stringWithFormat:@"%d:%@%@", hour, capturedStrings[2], amOrPMString]);
}];
NSLog(@"Replaced:\n%@", replaced);
When run, prints out the following:
2010-07-10 17:42:10.650 RegexKitLite[26086:a0f] Replaced:
Mon-Wed 9:30am-5:00pm Thu 9:00am-5:00pm Fri 9:30am-5:00pm
Mon-Wed 9:30am-5:00pm Thu 9:00am-5:00pm Fri 9:30am-5:00pm Sat 9:00am-12:00pm, Home Lending Sat 9:00am-4:00pm
Mon-Thu 9:00am-4:00pm, Fri 9:00am-5:00pm
EDIT 2010/07/11 - Add info per OP's request.
An explanation of the regex used in the example is as follows (broken down in to its four most logical chunks)
1: (?<=[[:Pattern_Syntax:][:White_Space:]]|\A)
2: (\d{1,2})
3: (\d{2,2})
4: (?=[[:Pattern_Syntax:][:White_Space:]]|\z)
Part 1
The sequence (?<= ... )
means "A look-behind assertion", or in prose, it roughly translates in to something along the lines of "If the next part of the regex (in this case, #2), then the text just before #2 must be matched by the regex enclosed by these parenthesis".
The regex "enclosed by these parenthesis" in this case is [[:Pattern_Syntax:][:White_Space:]]|\A
. This regex says in rough prose 'Match any character that is in the set of characters that have the Unicode property of Pattern_Syntax or White_Space, or \A
, which means Match at the beginning of the input. Differs from ^ in that \A will not match after a new-line within the input.
. The characters that are Pattern_Syntax
or White_Space
are characters such as ' ' (a space), '\t' (a tab), new-lines, etc etc. Pattern_Syntax
are characters like '-', ',', '%', etc.
Parts 2 and 3
These parts are fairly obvious. The \d
matches a "digit" character, like '0'..'9', and the {x,y}
means "Match at least x
, but not more than y
times".
Part 4
Part 4 is essentially identical to part one, except it uses a "look-ahead assertion" in the form of (?=
, and the meaning should hopefully be obvious from the context of the explanation in part 1. Another difference is the use of \z
, which means "Match if the current position is at the end of input.".
Why are \A
and \z
needed? In case the time is the very first thing in the string, or the very last thing in the string since the []
set of characters to match does not include "or no character if at either the start or end of the text to match". For example, the OP's example strings ends with ..., Fri 900-1700
. Without the |\z
, the regex would not match that last 1700
.
Why are parts 1 and parts 4 needed? They may not be, depending on the exact format of the text string to be matched. Since I can't say much about the format of the input string, I tried to make it "fairly robust" and tolerant of a wide range of reasonable input. There's definitely more than one way to do this.
What the Block does
The ^{}
Block is called each time the regular expression is matched. Details about what was matched are passed as arguments to the Blocks. The Block then returns a new string that is used as to replace all of the text that was matched by the regex. This process is repeated until there is no more matches of the regex in the string.
For clarity, the original string is only "matched once". For example, the regex given essentially matches any "number" in the form of "NNN" or "NNNN". For each match, the Block is called, and then the search for the next match in the original string picks up at the very next character after the last match. It does not "go back" or "start over" in any way.
The original string is not modified in any way. Instead, an entirely new string is constructed. It is built up bit by bit from the "text in between matches" and the replacement strings returned by the Block. When all the replacements are finished, this is the string that is returned.
EDIT 2010/07/12 - Add some additional information per OP's (additional) request.
Q If I was more confident on the input format being consistent (such as Day-DaySpaceTime OR DaySpaceTime) could I just have some regex something like this? (\s|-?)(\d{1,2})(\d{2,2})(;?|\s?|-?).
A If you were more confident on the input format, the regex could definitely be changed. For example, if you were "absolutely positive" that the input was always going to be in the form of nNNN-nNNN
(where the lower case n
represents "an optional digit", as in 900-1730
vs 1100-1915
) for "times", the code could be changed to something like:
NSString *string = @"Mon-Wed 930-1700 Thu 900-1700 Fri 930-1700\nMon-Wed 930-1700 Thu 900-1700 Fri 930-1700 Sat 900-1200, Home Lending Sat 900-1600\nMon-Thu 900-1600, Fri 900-1700";
NSString *replaced = [string stringByReplacingOccurrencesOfRegex:@"\\b(\\d{1,2})(\\d{2,2})\\-(\\d{1,2})(\\d{2,2})\\b" usingBlock:^NSString *(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {
NSInteger firstHour = [capturedStrings[1] integerValue], secondHour = [capturedStrings[3] integerValue];
NSString *firstAMorPMString = @"am", *secondAMorPMString = @"am";
if(firstHour >= 12) { firstAMorPMString = @"pm"; if(firstHour > 12) { firstHour -= 12; } }
if(secondHour >= 12) { secondAMorPMString = @"pm"; if(secondHour > 12) { secondHour -= 12; } }
if(firstHour == 0) { firstHour = 12; }
if(secondHour == 0) { secondHour = 12; }
return([NSString stringWithFormat:@"%d:%@%@-%d:%@%@", firstHour, capturedStrings[2], firstAMorPMString, secondHour, capturedStrings[4], secondAMorPMString]);
}];
NSLog(@"Replaced:\n%@", replaced);
This example processes both "times" as a single chunk. The \b
present at the beginning and end of the regex means Match if the current position is a word boundary
. This prevents it from matching something like abc123-456def
. It is a simpler form of the more complicated [[:Pattern_Syntax:][:White_Space:]]
stuff in the original example, but it doesn't necessarily mean exactly the same thing (though it is fairly close "for most purposes).
Another advantage to matching both times as a single chunk is it reduces the number of potential "false matches" that can happen if just matching for one time. For example, the first example would turn a "comment" of "Home econ 101" in to "Home econ 1:01am", which is probably not what you want. :)
I also modified the example so that a "military 24 hour time" of "000" means "12:00am", so it makes an assumption that the time values parsed are always in 24 hour military time format.
Q Also is the (?<= ... ) look behind syntax part of RegexKitLite or is that standard regex?
A It is part of the regex syntax accepted by the ICU library (which is what RegexKitLite uses to perform the actual regular expression matching). There is no "standard regular expression syntax" per-se, though both (?<=...)
and (?=...)
are accepted by "most" regular expression engines.
Q Sorry, In your regex you had four sets of () does that mean that there is a capturedStrings[0],capturedStrings[1],capturedStrings[2],capturedStrings[3]?
A The (?<=...)
and (?=...)
patterns are what are known as "zero width assertions". They do not actually contribute to the text that is "captured" by the regular expression, but must match the given text in order for the overall regular expression to "match". The distinction between the words "captured" and "match" in the previous sentence is that "captured" consumes the part of the input matched, whereas "matched" does not. This allows you to create regular expressions like (\d+)(?=,)
, which means "Match and 'capture' one or more numbers which must be followed by a ',', but do not capture the trailing comma". Look-ahead and look-behind are definitely advanced, non-novice features of regular expressions which are difficult to explain fully in a short post like this.
Of particular note, however, is that neither (?<=...)
or (?=...)
count as "captures", unlike (\d{1,2})(\d{2,2})
. The full regular expression from the original example contains only two captures even though there are a total of four parenthesis groups.