I always thought that a look-behind assertion in Java's regex-API (and many other languages for that matter) must have an obvious length. So, STAR and PLUS quantifiers are not allowed inside look-behinds.
The excellent online resource regular-expressions.info seems to confirm (some of) my assumptions:
"[...] Java takes things a step further by allowing finite repetition. You still cannot use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. Java recognizes the fact that finite repetition can be rewritten as an alternation of strings with different, but fixed lengths. Unfortunately, the JDK 1.4 and 1.5 have some bugs when you use alternation inside lookbehind. These were fixed in JDK 1.6. [...]"
Using the curly brackets works as long as the total length of range of the characters inside the look-behind is smaller or equal to Integer.MAX_VALUE. So these regexes are valid:
"(?<=a{0," +(Integer.MAX_VALUE) + "})B"
"(?<=Ca{0," +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"
But these aren't:
"(?<=Ca{0," +(Integer.MAX_VALUE) +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"
However, I don't understand the following:
When I run a test using the * and + quantifier inside a look-behind, all goes well (see output Test 1 and Test 2).
But, when I add a single character at the start of the look-behind from Test 1 and Test 2, it breaks (see output Test 3).
Making the greedy * from Test 3 reluctant has no effect, it still breaks (see Test 4).
Here's the test harness:
public class Main {
private static String testFind(String regex, String input) {
try {
boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
return "testFind : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testFind : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testReplaceAll(String regex, String input) {
try {
String returned = input.replaceAll(regex, "FOO");
return "testReplaceAll : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testSplit(String regex, String input) {
try {
String[] returned = input.split(regex);
return "testSplit : Valid -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
} catch(Exception e) {
return "testSplit : Invalid -> "+regex+", "+e.getMessage();
}
}
public static void main(String[] args) {
String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
String input = "CaaaaaaaaaaaaaaaBaaaa";
int test = 0;
for(String regex : regexes) {
test++;
System.out.println("********************** Test "+test+" **********************");
System.out.println(" "+testFind(regex, input));
System.out.println(" "+testReplaceAll(regex, input));
System.out.println(" "+testSplit(regex, input));
System.out.println();
}
}
}
The output:
********************** Test 1 **********************
testFind : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 2 **********************
testFind : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 3 **********************
testFind : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testSplit : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
********************** Test 4 **********************
testFind : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testSplit : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
My question may be obvious, but I'll still ask it: Can anyone explain to me why Test 1 and 2 fail, and Test 3 and 4 don't? I would have expected them all to fail, not half of them to work and half of them to fail.
Thanks.
PS. I'm using: Java version 1.6.0_14