.*?id="number"[^>]*?>([^<]+?).*
Is that really the regex you're using? The reason I ask is because ([^<]+?)
will always match exactly one character, as if you had written ([^<])
instead. The +
quantifier has to match at least once, but because it's reluctant it immediately hands off to the next part - .*
- which always succeeds. Removing the .*
and switching to find()
or lookingAt()
won't change that behavior, either (although it will probably be a little quicker to get the same result). If you want to match all the text up to the next angle bracket, you should get rid of the question mark: ([^<]+)
.
[^>]*?>
doesn't make much sense, either. You have to consume as many non-brackets as there are before you can match the bracket, so what's the point of making that quantifier reluctant? In fact, there's no point making it greedy either; if [^>]*
matches as much as it can and the next character isn't '>', you know backtracking won't do any good. You might as well use a possessive quantifier - [^>]*+>
- or an atomic group - (?>[^>]*+)>
- if your regex flavor supports them.
The first quantified portion - .*?
- is the only one that's used correctly (if not optimally). Putting that at the beginning of a regex simulates the behavior of find()
when you're using lookingAt()
or (with a .*
at the end) matches()
. However, leaving it off and using find()
is more efficient, as you've discovered.
Reluctant quantifiers are very handy, but lately it seems like they've been getting overexposed. With increasing frequency I see people giving the advice "Use reluctant quantifiers" with no explanation or qualification--just another silver bullet. And I believe regexes like the one in this question are the result. Of the three reluctant quantifiers, one should have been greedy, one should have been possessive, and the other shouldn't have been there at all.
EDIT: Here's an example to illustrate some of what I'm talking about, and to address Stephen C's comment. Given this string:
<div id="number" class="whatever">abc123</div>
...the dynamic parts of the regex match like this:
.*? => '<div '
[^>]*? => ' class="whatever"'
([^<]+?) => 'a'
.* => 'bc123</div>'
Changing all the reluctant quantifiers to greedy doesn't change the overall match (the whole string), and it doesn't change what gets matched by the first two dynamic portions. But the last two get reapportioned:
([^<]+) => 'abc123'
.* => '</div>'
Looking at the original regex, I thought this must be the desired result; why use such a complicated subexpression inside a capturing group if not to capture the whole content, 'abc123'
? That's what leads me to believe the reluctant quantifiers were used blindly, as a panacea.
One other thing: looking back over the thread, I see the OP didn't actually say he had removed the .*?
from the front of the regex when he switched to the find()
method. @Ben, if you haven't done that, you should; it's just slowing things down now. That would leave you with this regex:
id="number"[^>]*+>([^<]+)
I don't want anyone to think I'm contesting the accepted answer, either. I'm just scratching this itch I have about the overuse/inappropriate use of reluctant quantifiers.