tags:

views:

227

answers:

3

Trying to extract strings that are wrapped in double brackets. For example [[this is one token]] that should be matched. To make things more elegant, there should be an escape sequence so that double bracketed items like \[[this escaped token\]] don't get matched.

The pattern [^\\\\]([\\[]{2}.+[^\\\\][\\]]{2}) with "group 1" to extract the token is close, but there are situations where it doesn't work. The problem seems to be that the first "not" statement is being evaluated as "anything except a backslash". The problem is, "anything" is not including "nothing". So, what would make this pattern match "nothing or any character other than a backslash"?

Here is a unit test to show the desired behavior:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import junit.framework.TestCase;

public class RegexSpike extends TestCase {
    private String regex;
    private Pattern pattern;
    private Matcher matcher;

    @Override
    protected void setUp() throws Exception {
        super.setUp();
        regex = "[^\\\\]([\\[]{2}.+[^\\\\][\\]]{2})";
        pattern = Pattern.compile(regex);
    }

    private String runRegex(String testString) {
        matcher = pattern.matcher(testString);
        return matcher.find() ? matcher.group(1) : "NOT FOUND";
    }

    public void testBeginsWithTag_Passes() {
        assertEquals("[[should work]]", runRegex("[[should work]]"));
    }

    public void testBeginsWithSpaces_Passes() {
        assertEquals("[[should work]]", runRegex("   [[should work]]"));
    }

    public void testBeginsWithChars_Passes() {
        assertEquals("[[should work]]", runRegex("anything here[[should
work]]"));
    }

    public void testEndsWithChars_Passes() {
        assertEquals("[[should work]]", runRegex("[[should
work]]with anything here"));
    }

    public void testBeginsAndEndsWithChars_Passes() {
        assertEquals("[[should work]]", runRegex("anything here[[should
work]]and anything here"));
    }

    public void testFirstBracketsEscaped_Fails() {
        assertEquals("NOT FOUND", runRegex("\\[[should NOT work]]"));
    }

    public void testSingleBrackets_Fails() {
        assertEquals("NOT FOUND", runRegex("[should NOT work]"));
    }

    public void testSecondBracketsEscaped_Fails() {
        assertEquals("NOT FOUND", runRegex("[[should NOT work\\]]"));
    }

}
+1  A: 

You want a "zero-width negative lookbehind assertion", which is (?<!expr). Try:

(?<!\\\\)([\\[]{2}.+[^\\\\][\\]]{2})

Actually, this can be simplified and made more general by cutting out some of those unnecessary brackets, and adding a negative lookbehind for the closing bracket, too. (Your version also will fail if you have an escaped bracket in the middle of the string, like [[text\]]moretext]]).

(?<!\\\\)(\\[{2}.*?(?<!\\\\)\\]{2})
JSBangs
+2  A: 

You can simply use (^|[^\\]), which will either match the beginning of a string (provided you set the MULTILINE mode on your regex) or a single character that is not a backslash (including spaces, newlines, etc.).

You'll also want to replace .+ with .+?, because otherwise a string such as "[[one]] and [[two]]" will be seen as a single match, where "one]] and [[two" is considered to be between brackets.

A third point is that you do not have to wrap a single character (even escaped ones such as \[ or \]) in a character class with [].

So that would make the following regex (pardon me removing the double-escapedness for clarity):

(^|[^\\])(\[{2}.+?[^\\]\]{2})

(Also note that you cannot escape the escape character with your regex. Two slashes before a [ will not be parsed as a single (escaped) slash, but will indicate a single (unescaped) slash and an escaped bracket.)

molf
+1  A: 

What should happen with this string? (Actual string content, not a Java literal.)

foo\\[[blah]]bar

What I'm asking is whether you're supporting escaped backslashes. If you are, the lookbehind won't work. Instead of looking for a single backslash, you would have to check for on odd but unknown number of them, and Java lookbehinds can't be open-ended like that. Also, what about escaped brackets inside a token--is this valid?

foo[[blah\]]]bar

In any case, I suggest you come at the backslash problem from the other direction: match any number of escaped characters (i.e. backslash plus anything) immediately preceding the first bracket as part of the token. Inside the token, match any number of characters other than square brackets or backslashes, or any number of escaped characters. Here's the actual regex:

(?<!\\)(?:\\.)*+\[\[((?:[^\[\]\\]++|\\.)*+)\]\]

...and here it is as a Java string literal:

"(?<!\\\\)(?:\\\\.)*+\\[\\[((?:[^\\[\\]\\\\]++|\\\\.)*+)\\]\\]"
Alan Moore