On capturing group
A capturing group attempts to capture what it matches. This has some important consequences:
- A group that matches nothing, can never capture anything.
- A group that only matches an empty string, can only capture an empty string.
- A group that captures repeatedly in a match attempt only gets to keep the last capture
- Generally true for most flavors, but .NET regex is an exception (see related question)
Here's a simple pattern that contains 2 capturing groups:
(\d+) (cats|dogs)
\___/ \_________/
1 2
Given i have 16 cats, 20 dogs, and 13 turtles
, there are 2 matches (as seen on rubular.com):
16 cats
is a match: group 1 captures 16
, group 2 captures cats
20 dogs
is a match: group 1 captures 20
, group 2 captures dogs
Now consider this slight modification on the pattern:
(\d)+ (cats|dogs)
\__/ \_________/
1 2
Now group 1 matches \d
, i.e. a single digit. In most flavor, a group that matches repeatedly (thanks to the +
in this case) only gets to keep the last match. Thus, in most flavors, only the last digit that was matched is captured by group 1 (as seen on rubular.com):
16 cats
is a match: group 1 captures 6
, group 2 captures cats
20 dogs
is a match: group 1 captures 0
, group 2 captures dogs
References
On greedy vs reluctant vs negated character class
Now let's consider the problem of matching "everything between A
and ZZ
". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.
We use the following as input:
eeAiiZooAuuZZeeeZZfff
We use 3 different patterns:
A(.*)ZZ
yields 1 match: AiiZooAuuZZeeeZZ
(as seen on ideone.com)
- This is the greedy variant; group 1 matched and captured
iiZooAuuZZeee
A(.*?)ZZ
yields 1 match: AiiZooAuuZZ
(as seen on ideone.com)
- This is the reluctant variant; group 1 matched and captured
iiZooAuu
A([^Z]*)ZZ
yields 1 match: AuuZZ
(as seen on ideone.com)
- This is the negated character class variant; group 1 matched and captured
uu
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
See related question for a more in-depth treatment on the difference between these 3 techniques.
Related questions
Going back to the question
So let's go back to the question and see what's wrong with pattern:
<h1>()<br
\/
1
Group 1 matches the empty string, therefore the whole pattern overall can only match <hr1><br
, and group 1 can only match the empty string.
One can try to "fix" this in many different ways. The 3 obvious ones to try are:
<h1>(.*)<br
; greedy
<h1>(.*?)<br
; reluctant
<h1>([^<]*)<br
; negated character class
You will find that none of the above "work" all the time; there will be problems with some HTML. This is to be expected: regex is the "wrong" tool for the job. You can try to make the pattern more and more complicated, to get it "right" more often and "wrong" less often. More than likely you'll end up with a horrible mess that no one can understand and/or maintain, and it'd still probably won't work "right" 100% of the time.