tags:

views:

975

answers:

2

I always thought that a look-behind assertion in Java's regex-API (and many other languages for that matter) must have an obvious length. So, STAR and PLUS quantifiers are not allowed inside look-behinds.

The excellent online resource regular-expressions.info seems to confirm (some of) my assumptions:

"[...] Java takes things a step further by allowing finite repetition. You still cannot use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. Java recognizes the fact that finite repetition can be rewritten as an alternation of strings with different, but fixed lengths. Unfortunately, the JDK 1.4 and 1.5 have some bugs when you use alternation inside lookbehind. These were fixed in JDK 1.6. [...]"

-- http://www.regular-expressions.info/lookaround.html

Using the curly brackets works as long as the total length of range of the characters inside the look-behind is smaller or equal to Integer.MAX_VALUE. So these regexes are valid:

"(?<=a{0,"   +(Integer.MAX_VALUE)   + "})B"
"(?<=Ca{0,"  +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"

But these aren't:

"(?<=Ca{0,"  +(Integer.MAX_VALUE)   +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"

However, I don't understand the following:

When I run a test using the * and + quantifier inside a look-behind, all goes well (see output Test 1 and Test 2).

But, when I add a single character at the start of the look-behind from Test 1 and Test 2, it breaks (see output Test 3).

Making the greedy * from Test 3 reluctant has no effect, it still breaks (see Test 4).

Here's the test harness:

public class Main {

    private static String testFind(String regex, String input) {
        try {
            boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
            return "testFind       : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testFind       : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testReplaceAll(String regex, String input) {
        try {
            String returned = input.replaceAll(regex, "FOO");
            return "testReplaceAll : Valid   -> regex = "+regex+", input = "+input+", returned = "+returned;
        } catch(Exception e) {
            return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    private static String testSplit(String regex, String input) {
        try {
            String[] returned = input.split(regex);
            return "testSplit      : Valid   -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
        } catch(Exception e) {
            return "testSplit      : Invalid -> "+regex+", "+e.getMessage();
        }
    }

    public static void main(String[] args) {
        String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
        String input = "CaaaaaaaaaaaaaaaBaaaa";
        int test = 0;
        for(String regex : regexes) {
            test++;
            System.out.println("********************** Test "+test+" **********************");
            System.out.println("    "+testFind(regex, input));
            System.out.println("    "+testReplaceAll(regex, input));
            System.out.println("    "+testSplit(regex, input));
            System.out.println();
        }
    }
}

The output:

********************** Test 1 **********************
    testFind       : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 2 **********************
    testFind       : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
    testReplaceAll : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
    testSplit      : Valid   -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]

********************** Test 3 **********************
    testFind       : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^
    testSplit      : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
      ^

********************** Test 4 **********************
    testFind       : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^
    testSplit      : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
       ^

My question may be obvious, but I'll still ask it: Can anyone explain to me why Test 1 and 2 fail, and Test 3 and 4 don't? I would have expected them all to fail, not half of them to work and half of them to fail.

Thanks.

PS. I'm using: Java version 1.6.0_14

+6  A: 

Glancing at the source code for Pattern.java reveals that the '*' and '+' are implemented as instances of Curly (which is the object created for curly operators). So,

a*

is implemented as

a{0,0x7FFFFFFF}

and

a+

is implemented as

a{1,0x7FFFFFFF}

which is why you see exactly the same behaviors for curlies and stars.

Jonathan Feinberg
It didn't occur to me to look at the source of Pattern... Silly me. Thank you! It makes perfect sense now.
Bart Kiers
One of the many reasons I can't live without Eclipse is my itchy ctrl-click finger (which, if you don't use Eclipse, means "open up the source file where this name is defined").
Jonathan Feinberg
Thanks, I've never taken the trouble to attach the source to Eclipse. I will do so now for sure. Thanks.
Bart Kiers
+4  A: 

It's a bug: http://bugs.sun.com/view_bug.do?bug_id=6695369

Pattern.compile() is always supposed to throw an exception if it can't determine the maximum possible length of a lookbehind match.

Alan Moore
Hey Alan, figured you'd come along soon! Thanks, I knew this used to be a bug (in 1.5) but I remembered that the engine always evaluated to false or something. Now it does evaluate the result correctly, so I figured the bug got fixed in 1.6. I must've remembered incorrectly. Thanks for the info.Bart (aka prometheuzz)
Bart Kiers
That bug got fixed; this is a new one that was introduced in 1.6. :/
Alan Moore
Ah, great :|. Thanks for the info.
Bart Kiers