I think something like this is what you want.
For alternating characters:
(?=(.)(?!\1)(.))(?:\1\2){2,}
\0
will be the entire alternating sequence, \1
and \2
are the two (distinct) alternating characters.
For run of N and M characters, possibly separated by other characters (replace N
and M
with numbers here):
(?=(.))\1{N}.*?(?=(?!\1)(.))\2{M}
\0
will be entire match, including infix. \1
is the character repeated (at least) N
times, \2
is the character repeated (at least) M
times.
Here's a test harness in Java.
import java.util.regex.*;
public class Regex3 {
static String runNrunM(int N, int M) {
return "(?=(.))\\1{N}.*?(?=(?!\\1)(.))\\2{M}"
.replace("N", String.valueOf(N))
.replace("M", String.valueOf(M));
}
static void dumpMatches(String text, String pattern) {
Matcher m = Pattern.compile(pattern).matcher(text);
System.out.println(text + " <- " + pattern);
while (m.find()) {
System.out.println(" match");
for (int g = 0; g <= m.groupCount(); g++) {
System.out.format(" %d: [%s]%n", g, m.group(g));
}
}
}
public static void main(String[] args) {
String[] tests = {
"foobababababaf foobaafoobaaaooo",
"xxyyyy axxayyyya zzzzzzzzzzzzzz"
};
for (String test : tests) {
dumpMatches(test, "(?=(.)(?!\\1)(.))(?:\\1\\2){2,}");
}
for (String test : tests) {
dumpMatches(test, runNrunM(3, 3));
}
for (String test : tests) {
dumpMatches(test, runNrunM(2, 4));
}
}
}
This produces the following output:
foobababababaf foobaafoobaaaooo <- (?=(.)(?!\1)(.))(?:\1\2){2,}
match
0: [bababababa]
1: [b]
2: [a]
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.)(?!\1)(.))(?:\1\2){2,}
foobababababaf foobaafoobaaaooo <- (?=(.))\1{3}.*?(?=(?!\1)(.))\2{3}
match
0: [aaaooo]
1: [a]
2: [o]
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.))\1{3}.*?(?=(?!\1)(.))\2{3}
match
0: [yyyy axxayyyya zzz]
1: [y]
2: [z]
foobababababaf foobaafoobaaaooo <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{4}
xxyyyy axxayyyya zzzzzzzzzzzzzz <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{4}
match
0: [xxyyyy]
1: [x]
2: [y]
match
0: [xxayyyy]
1: [x]
2: [y]
Explanation
(?=(.)(?!\1)(.))(?:\1\2){2,}
has two parts
(?=(.)(?!\1)(.))
establishes \1
and \2
using lookahead
- Nested negative lookahead ensures that
\1
!= \2
- Using lookahead to capture lets
\0
have the entire match (instead of just the "tail" end)
(?:\1\2){2,}
captures the \1\2
sequence, which must repeat at least twice.
(?=(.))\1{N}.*?(?=(?!\1)(.))\2{M}
has three parts
(?=(.))\1{N}
captures \1
in a lookahead, and then match it N
times
- Using lookahead to capture means the repetition can be
N
instead of N-1
.*?
allows an infix to separate the two runs, reluctant to keep it as short as possible
(?=(?!\1)(.))\2{M}
- Similar to first part
- Nested negative lookahead ensures that
\1
!= \2
The run regex will match longer runs, e.g. run(2,2)
matches "xxxyyy"
:
xxxyyy <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{2}
match
0: [xxxyy]
1: [x]
2: [y]
Also, it does not allow overlapping matches. That is, there is only one run(2,3)
in "xx11yyy222"
.
xx11yyy222 <- (?=(.))\1{2}.*?(?=(?!\1)(.))\2{3}
match
0: [xx11yyy]
1: [x]
2: [y]