views:

54

answers:

3

I have a problem with regex, using preg_match_all(), to match something of a variable length.

What I am trying to match is the traffic condition after the word 'Congestion' What I came up with is this regex pattern:

Congestion\s*:\s*(?P<congestion>.*)

It would however, extract the first instance all the way to the end of the entire subject, since .* would match everything. But that's not what I want though, I would like it to match separately as 3 instances.

Now since the words behind Congestion could be of variable length, I can't really predict how many words and spaces are in between to come up with a stricter \w*\s*\w* match etc.

Any clues on how I can proceed from here?

Highway : Highway 26
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow from Smith St to Alice Springs St

Highway : Princes Highway
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow at the Flinders St / Elizabeth St intersection

Highway : Eastern Freeway
Datetime : 18-Oct-2010 05:19 PM
Congestion : Traffic is slow from Prince St to Queen St

EDIT FOR CLARITY

These very nicely formatted texts here, are actually received via a very poorly formatted html email. It contains random line breaks here and there eg "Congestion : Traffic\n is slow from Prince\nSt to Queen St".

So while processing the emails, I stripped off all the html codes and the random line breaks, and json_encode() them into one very long single-line string with no line break...

+2  A: 
Congestion\s*:\s*Traffic is\s*(?P<c1>[^\n]*)\s*from\s*(?P<c2>[^\n]*)\s*to\s*(?P<c3>[^\n]*)$
Amarghosh
This doesn't work unfortunately. (?P<c1>.*) would already match till the end of the 3rd instance, for the first match.
blacklotus
@blacklotus Dot shouldn't match newlines unless specified with DOTALL flag - anyway, replace `.` with `[^\n]`
Amarghosh
@amarghosh They are actually all in one single line, that's why. I had to format them nicely here or else you wouldn't be able to see the pattern.
blacklotus
+4  A: 

Usually, regex matching is line-based. Regex assumes that your string is a single line. You can use the m” (PCRE_MULTILINE) flag to change that behaviour. Then you can tell PHP to match only to the end of the line:

preg_match('/^Congestion\s*:\s*(?P<congestion>.*)$/m', $subject, $matches);

There are two things to notice: first, the pattern was modified to include line-begin (^) and line-end ($) markers. Secondly, the pattern now carries the m modifier.

Konrad Rudolph
+1 This made the question make sense :)
jensgram
They are not actually multiline. They are actually huge block of text breaking randomly, which I used json_encode() to form a single-line string. Had to format them nicely here or else it would be even more confusing to my question.
blacklotus
@blacklotus: Pity. Just to check, though: where did you get the text from. JSON-encoding shouldn’t mess with the formatting of the string (once properly decoded).
Konrad Rudolph
It's actually an HTML email that is very poorly formatted. So after stripping off the html codes, and some random "=" and line breaks, I used json_encode() to form them into one huge string again so that the random line breaks wouldn't affect the match. I actually had to match the highway's name and the datetime as well but their patterns are constant so it's a lot easier to match their patterns. Really having a lot of difficulties with this.
blacklotus
+2  A: 

You can try a minimal match:

Congestion\s*:\s*(?P<congestion>.*?)

This would result in returning zero characters in the named group 'congestion' unless you could match something immediately after the congestion string.

So, this could be fixed if "Highway" always starts the traffic condition records:

Congestion\s*:\s*(?P<congestion>.*?)Highway\s*:

If this works (I have not checked it), then the first records are matched but the last record is not! This could be easily fixed by appending the text 'Highway :' at the end of the input string.

Uphill_ What '1
Thank you so much! :) This works. I understand that .* would match multiple or anything, but what does ? do when added into the mix?
blacklotus
The question mark (?) after a quantifier (*, +, ?, {n,m}) forces a minimum match, so for example if the regex (test)+ would match once the string testtesttesttest, the regex (test)+? would match it four times.
Uphill_ What '1