REs are efficient, but they're not magic :-)
It is not the number of lines in a input file which slows you down (3,000 is a very small number). When an RE is compiled, it basically has to (conceptually) write a program which can parse the input file based on that RE.
The speed of that "program" always depends on the complexity of the RE.
For example, "^$"
will be very fast, "[0-9]{1,9}"
will be somewhat slower and anything that involves backtracking (i.e., having to back up in the RE, usually anything involving multiple variable-number-of-elements clauses, of which yours is an example) will be slower still.
Anything you can do to minimize the number of lines beforehand will help to some extent but as to optimizing the RE itself, that's often considered a black art. One possibility is to first strip out the lines other than those between lines where the Annotation stops and starts.
I don't tend to worry too much about optimizing my REs (but they're not usually this complex). My view is that they will take as long as they take. If that's too long, I usually look for another solution which is faster but not so adaptable.
In the case of your RE where you wanted to get all Annotation XML where the about
attribute contains MATCH, I would do it in Perl (or awk for us old-timers :-) since the input file was reasonably fixed format:
- "<Annotation " on first line [a].
- "MATCH" also on first line [a].
- </Annotation> on last line and on its own [b].
This would be fast as a simple line scanner, turning on echo when the [a] conditions were met (and printing that line), printing any other line when echo was on, and turning echo off when [b] conditions were met (after printing line).
Yes, far less adaptable but almost certainly faster (given your well-formatted input).