views:

185

answers:

3

What I'm trying to do: remove innermost unescaped square brackets surrounding a specific, unescaped character (\ is escape)

input: [\[x\]]\]\[[\[y\]]
output when looking for brackets around y: [\[x\]]\]\[\[y\]
output when looking for brackets around x: \[x\]\]\[[\[y\]]

In short, remove only the unescaped set of brackets around the specific character.

I tried this (for y): Regex.Replace(input, @"(?<!\\)\[(.*?(?<!\\)y.*?)(?<!\\)\]",@"$1", but that seems to match the first unescaped [ (before the x) with the last ]. I figured I could replace the . wildcards with a negating character class to exclude [ and ], but what I really need to negate is unescaped versions of these, and when I try to incorporate a negative lookbehind like (?<!\\) in the negating character class, I seem to match nothing at all.

Thanks in advance for your time and effort.

edit:

To clarify, the contents of the unescaped square brackets can be anything (except another unescaped square bracket), as long as they contain the unescaped character of interest (y). All the content of the brackets should remain.

+1  A: 

Edited after question was edited

Regex.Replace(input, @"((?<!\\)\[(?=((\\\[)|[^[])*((?<!\\)y)))|((?<=[^\\]y((\\\]|[^]]))*)(?<!\\)\])","");

We want to match the brackets to be removed:

(?<!\\)\[ - Match is an unescaped left bracket
(?=((\\\[)|[^[])*((?<!\\)y)) - Match is followed by any number of (escaped left brackets or non-left brackets) followed by an unescaped y

| - OR

(?<=[^\\]y((\\\]|[^]]))*) - Match is preceded by unescaped y followed by any number of (escaped right brackets or non-right brackets)
(?<!\\)\] - Match is an unescaped right bracket
mbeckish
Thank you for looking at this! The reason I have `.` in there now is because the contents of unescaped brackets can be anything (except of course another unescaped bracket), as long as it has that unescaped `x` or `y` in there somewhere. So, if the input is `[\[x\]]\]\[[1234(\[abcycba\]\y\y)]`, the output should be `[\[x\]]\]\[1234(\[abcycba\]\y\y)` (**only** the unescaped brackets containing an unescaped `y` are removed. I'll edit question to clarify.
Jay
Probably some extra parentheses in there that can be removed.
mbeckish
+2  A: 

Writing a regex for this might be overly complex for the problem. While this function is a bit lengthy, it's conceptually simple and does the trick:

    string FixString(char x, string original)
    {
        int i = 0;
        string s = original;
        while (i < s.Length)
        {
            if (s[i] == x)
            {
                bool found = false;
                for (int j = i + 1; (j < s.Length) && !found; j++)
                {
                    if ((s[j] == ']') &&
                        (s[j-1] != '\\'))
                    {
                        s = s.Remove(j, 1);
                        found = true;
                    }
                }
                if (i > 0)
                {
                    found = false;
                    for (int j = i - 1; (j >= 0) && !found; j--)
                    {
                        if ((s[j] == '[') &&
                            ( (j == 0) ||
                              (s[j - 1] != '\\') ))
                        {
                            s = s.Remove(j, 1);
                            i--;
                            found = true;
                        }
                    }
                }
            }
            i++;
        }

        return s;
    }
Mark Synowiec
+1 for the perspective check. Regex may or may not be a required skill (as some have argued), but every programmer **must** be able to solve a problem like this "longhand", as you did. But I'd use a StringBuilder, and avoid using `Remove` or equivalent. :-)
Alan Moore
+1 because I'm usually the one telling people to chill with the insane application of regex. Thank you.
Jay
+2  A: 

Lookbehind is the wrong tool for this job. Try this instead:

Regex r = new Regex(
  @"\[((?>(?:[^y\[\]\\]|\\.)*)y(?>(?:[^\[\]\\]|\\.)*))\]");

string s1 = @"[\[x\]]\]\[[\[y\]]";
Console.WriteLine(s1);
Console.WriteLine(r.Replace(s1, @"%$1%"));

Console.WriteLine();

string s2 = @"[\[x\]]\]\[[1234(\[abcycba\]\y\y)]";
Console.WriteLine(s2);
Console.WriteLine(r.Replace(s2, @"%$1%"));

result:

[\[x\]]\]\[[\[y\]]
[\[x\]]\]\[%\[y\]%

[\[x\]]\]\[[1234(\[abcycba\]\y\y)]
[\[x\]]\]\[%1234(\[abcycba\]\y\y)%

(I replaced the brackets with % instead of deleting them to make it easier to see exactly what's getting replaced.)

(?:\\.|[^y\[\]\\])* matches zero or more of (1) a backslash followed by any character, or (2) anything that's not a 'y', a square bracket or a backslash. If the next character is a 'y', it gets consumed and (?:\\.|[^\[\]\\])* matches any remaining characters until the next unescaped bracket. Including both brackets in the negated character class (along with the backslash) ensures that you only match the innermost set of unescaped brackets.

It's also vital that you use the atomic groups--i.e., (?>...); this prevents backtracking which we know would be useless, and which could cause serious performance problems when the regex is used on strings that contain no matches.

An alternative approach would use a lookahead to assert the presence of the 'y' and then use the much simpler (?>(?:\\.|[^\[\]\\])*) to consume the characters between the brackets. The problem is that you're now making two passes over the string, and it can be tricky making sure the lookahead doesn't look too far ahead, or not far enough. Doing all the work in one pass makes it much easier to keep track of where you are at each stage of the matching process.

Alan Moore
Wow. +1 for effort, thoroughness, +1-ing the non-regex solution, and for prompting me to finally get my head around atomic groups (is it fair to say that atomic grouping the regex equivalent of the `||` operator?). This looks good; I'll run it through my unit tests tomorrow and accept accordingly. Thanks.
Jay
If I were to compare `||` to anything, it would be alternation: in Perl-compatible regex flavors like .NET's, alternation is short-circuiting like `||`, while in DFA or POSIX flavors, all alternatives are always checked. What atomic groups remind me of are the power pellets in Pac-Man. :P
Alan Moore