(The background for this question is that I thought it would be fun to write something that parses wiki creole markup. Anyway the problem that I think I have a solution to is differentiating between //
in a url and as opening/closing syntax for italic text)
My question is slightly compound so I've tried to break it up under the headings
If there is a substring(S1) that can contain any one of a series of substrings separated by |
does the regular expression interpreter simply match the first substring within 'S1' then move onto the regular expression after 'S1'? Or can will it in some instances try find the best/greediest match?
Here is an example to try and make my question more clear:
String to search within: String
Regex: /(?:(Str|Strin).*)/
(the 'S1' in my question refers to the non-capturing substring
I think that the matches from the above should be:
$0 will be String
$1 will be Str
and not Strin
Will this always happen or are the instances (e.g maybe 'S1' being match greedily using *) where the another matching substring will be used i.e. Strin
in my example.
If the above is correct than can I/should I rely on this behaviour?
Real world example
/^\/\/(\b((https?|ftp):\/\/|mailto:)([^\s~]*?(?:~(.|$))?)+?(?=\/\/|\s|$)|~(.|$)|[^/]|\/([^/]|$))*\/\//
Should correctly match:
//Some text including a http//:url//
With $1 == Some text including a http//:url
Note: I've tried to make this relatively language agnostic but I will be using php