I'm attempting to determine the end of an English sentence (only approximately), by looking for "!", "?" or ".", but in the case of "." only when not preceeded by common abbreviations such as Mr. or Dr.
Is there any way to make the following Regular Expression even marginally more efficient? Perhaps by sorting the negative lookbehinds by descending size, or even by alphabetical order?
Here is the regular expression I have now:
((?<!St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)(\.)|(!)|(\?))(\s*$|\s+([_$#]|[A-Z][^.]))
The Problem:
The site at http://regex.powertoy.org/ reports: "7 matches 21044 probes (finished)" on even a simple paragraph... That outrageous size of the number 21044 seems intimately tied to the number of negative lookbehinds.
I'm seeking to reduce the computational complexity for the RegEx engine, as I have several GB of data to pass through it.
Is there any way to tigten this up? Is negative lookbehind truly the best/only way to accomplish this? Is there a way to do it as a lookahead instead? Is regex the wrong tool for this task?
EDIT: I am able to use either ActionScript's or PHP's RegEx engine.
EDIT: I cannot count on the number of spaces between sentences. Really!? sigh.
Please don't answer if you have no understanding of the inner workings of the RegEx engine, as pertaining to optimization.
Thanks in advance.