(.*)
means that you are dealing with any number of repetitions of " SCF SF " before you find the one that indicates it's the next capture, by making it non-greedy you're still handling the capability that even 'SCF SF' would appear in the capture after 'FF'. I think you are handling a lot of cases you don't need.
The best way to optimize a regular expression sometimes makes it more cryptic--but you definitely find ways to make the expression fail earlier. (.*?)
while not being "greedy" is definitely too tolerant.
Below is a more verbose, but faster failing alternative to your second capture.
((?:[^S]|S[^C]|SC[^F]|SCF[^ ]|SCF [^S]|SCF S[^F])*)
But you can optimize it even more if you think that the string \bSCF\b
should automatically make the capture commit and expect only "\bSCF SF\b". Thus you could re-write that as:
((?:[^S]|S[^C]|SC[^F]SCF\B)*) SCF SF
But you can optimize these strings even more by backtracking control. If you think that there is no way in the world that SCF would ever occur as a word and not be followed by SF on valid input. To do that, you add another group around it, with brackets (?>
and )
.
(?>((?:[^S]|S[^C]|SC[^F]SCF\B)*)) SCF SF
That means that the matching logic will in no way try to reassess what it captured. If the characters after this fail to be " SCF SF " the whole expression fails. And it fails long before it ever gets to try to accommodate "MV" and other sub-expressions.
In fact, given certain expressions about the uniqueness of the delimiters, the fastest performance for this expression would be:
$text_normal = qr{^(\/F\d+) FF (?>((?:[^S]|S[^C]|SC[^F]SCF\B)*))SCF SF (?>((?:[^M]|M[^V]|MV\B)*))MV (?>(\((?:[^S]|S[^H]|SH.)*))SH$};
Additionally, the verbose, exhaustive negative matches can be alternative expressed by negative lookaheads--but I have no idea on how that works on performance. But negative look aheads would work like this:
((?:.(?! SCF))*) SCF SF
This means that for this capture I want any character that is not a space starting the string " SCF SF ".