Can you rely on the order that regular expression syntax is interpreted?

views:

answers:

Can you rely on the order that regular expression syntax is interpreted?

(The background for this question is that I thought it would be fun to write something that parses wiki creole markup. Anyway the problem that I think I have a solution to is differentiating between // in a url and as opening/closing syntax for italic text)

My question is slightly compound so I've tried to break it up under the headings

If there is a substring(S1) that can contain any one of a series of substrings separated by `|` does the regular expression interpreter simply match the first substring within 'S1' then move onto the regular expression after 'S1'? Or can will it in some instances try find the best/greediest match?

Here is an example to try and make my question more clear: String to search within: String
Regex: /(?:(Str|Strin).*)/ (the 'S1' in my question refers to the non-capturing substring

I think that the matches from the above should be:
$0 will be String
$1 will be Str and not Strin

Will this always happen or are the instances (e.g maybe 'S1' being match greedily using *) where the another matching substring will be used i.e. Strin in my example.

If the above is correct than can I/should I rely on this behaviour?

Real world example

/^\/\/(\b((https?|ftp):\/\/|mailto:)([^\s~]*?(?:~(.|$))?)+?(?=\/\/|\s|$)|~(.|$)|[^/]|\/([^/]|$))*\/\//

Should correctly match:

//Some text including a http//:url//

With $1 == Some text including a http//:url

Note: I've tried to make this relatively language agnostic but I will be using php

+3 A:

PHP uses the PCRE regex engine. By default, and the way PHP uses it, the PCRE engine runs in longest-leftmost mode. This mode returns the first match, evaluating the regex from left to right. So yes, you can rely on the order that PHP interprets a regex.

The other mode, provided by the pcre_dfa_exec() function, evaluates all possible matches and returns the longest possible match.

Andomar 2009-12-21 16:14:35

This pcre_dfa_exec() function is not available in my PHP 5.2.11.How do you turn on the mode ?

Arno 2009-12-21 16:22:05

It's a C library function made available by PCRE. You could call it from the PHP source code. There's a PHP bug report to make available, but it's in Assigned: http://bugs.php.net/bug.php?id=34121

Andomar 2009-12-21 16:40:08

In PHP, using preg extension, you can choose between greedy and non greedy operators (usually appending '?' to them).

By the way, in the example you gave, if you want Strin to match, you must invert your cases : /(?:(Strin|Str).*)/. I think, you should put the most generic expression at the end of the Regex.

FYI, with preg engine,

alternation operator is neither greedy nor lazy but ordered

Mastering regular expressions, J. Friedl, p175

If you want a greedy engine, you must use a Posix compliant engine (ereg - but it's deprecated).

Arno 2009-12-21 16:15:35

ansaurus

tags:

views:

answers:

Can you rely on the order that regular expression syntax is interpreted?

If there is a substring(S1) that can contain any one of a series of substrings separated by `|` does the regular expression interpreter simply match the first substring within 'S1' then move onto the regular expression after 'S1'? Or can will it in some instances try find the best/greediest match?

If the above is correct than can I/should I rely on this behaviour?

Real world example

related questions

ansaurus

tags:

views:

answers:

Can you rely on the order that regular expression syntax is interpreted?

If there is a substring(S1) that can contain any one of a series of substrings separated by | does the regular expression interpreter simply match the first substring within 'S1' then move onto the regular expression after 'S1'? Or can will it in some instances try find the best/greediest match?

If the above is correct than can I/should I rely on this behaviour?

Real world example

related questions

If there is a substring(S1) that can contain any one of a series of substrings separated by `|` does the regular expression interpreter simply match the first substring within 'S1' then move onto the regular expression after 'S1'? Or can will it in some instances try find the best/greediest match?