Let's say we have the following input:
<amy>
(bob)
<carol)
(dean>
We also have the following regex:
<(\w+)>|\((\w+)\)
Now we get two matches (as seen on rubular.com):
<amy>
is a match,\1
capturesamy
,\2
fails(bob)
is a match,\2
capturesbob
,\1
fails
This regex does most of what we want, which are:
- It matches the open and close brackets properly (i.e. no mixing)
- It captures the part we're interested in
However, it does have a few drawbacks:
- The capturing pattern (i.e. the "main" part) is repeated
- It's only
\w+
in this case, but generally speaking this can be quite complex,- If it involves backreferences, then they must be renumbered for each alternate!
- Repetition makes maintenance a nightmare! (what if it changes?)
- It's only
- The groups are essentially duplicated
- Depending on which alternate matches, we must query different groups
- It's only
\1
or\2
in this case, but generally the "main" part can have capturing groups of their own!
- It's only
- Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
- Depending on which alternate matches, we must query different groups
- The situation quickly worsens if we also want to match
{...}
,[...]
, etc.
So the question is obvious: how can we do this without repeating the "main" pattern?
Note: for the most part I'm interested in
java.util.regex
flavor, but other flavors are welcomed.
Appendix
There's nothing new in this section; it only illustrates the problem mentioned above with an example.
Let's take the above example to the next step: we now want to match these:
<amy=amy>
(bob=bob)
[carol=carol]
But not these:
<amy=amy) # non-matching bracket
<amy=bob> # left hand side not equal to right hand side
Using the alternate technique, we have the following that works (as seen on rubular.com):
<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]
As explained above:
- The main pattern can't simply be repeated; backreferences must be renumbered
- Repetition also means maintenance nightmare if it ever changes
- Depending on which alternate matches, we must query either
\1 \2
,\3 \4
, or\5 \6