tags:

views:

336

answers:

3

I made an application designed to prepare files for translation using lists of regexes.

It runs each regex on the file using Regex.Replace. There is also an inspector module which allows the user to see the matches for each regex on the list.

It works well, except when a regex contains a back-reference, Regex.Replace does not replace anything, yet the inspector shows the matches properly (so I know the regex is valid and matches what it should).

sSrcRtf = Regex.Replace(sSrcRtf, sTag, sTaggedTag,
  RegexOptions.Compiled | RegexOptions.Singleline);

sSrcRtf contains the RTF code of the page. sTag contains the regular expression in between parentheses. sTaggedTag contains $1 surrounded by the tag formating code.

To give an example:

sSrcRtf = Regex.Replace("the little dog", "((e).*?\1)", "$1", 
  RegexOptions.Compiled | RegexOptions.Singleline);

doesn't work. But

sSrcRtf = Regex.Replace("the little dog", "((e).*?e)", "$1", 
  RegexOptions.Compiled | RegexOptions.Singleline);

does. (of course, there is some RTF code around $1)

Any idea why this is?

A: 

You're using a reference to a group inside the group you're referencing.

"((e).*?\1)" // first capturing group
"(e)" // second capturing group

I'm not 100% certain, but I don't think you can reference a group from within that group. For starters, what would you expect the backreference to match, since it's not even complete yet?

Welbog
+1  A: 

You technically have two match groups there, the outer and the inner parentheses. Why don't you try addressing the inner set as the second capture, e.g.:

((e).*?\2)

Your parser probably thinks the outer capture is \1, and it doesn't make much sense to backreference it from inside itself.

Also note that your replacement won't do anything, since you are asking to replace the portion that you match with itself. I'm not sure what your intended behavior is, but if you are trying to extract just the match and discard the rest of the string, you want something like:

.*((e).*?\2).*
Adam Bellaire
Thanks. I didn't realize that the outer parentheses would count for a back reference within itself. For the replacement, this is just an example. In the actual code, $1 is surrounded by some RTF code which is generated depending on the type of style required. I didn't post the whole thing because it is a bit long and could distract from the issue at hand.
Sylverdrag
A: 

As others have mentioned, there are some additional groups being captured. Your replacement isn't referencing the correct one.

Your current regex should be rewritten as (options elided):

Regex.Replace("the little dog", @"((e).*?\2)", "$2")
// or
Regex.Replace("the little dog", @"(e).*?\1", "$1")

Here's another example that matches doubled words and indicates which backreferences work:

Regex.Replace("the the little dog", @"\b(\w+)\s+\1\b", "$1")  // good
Regex.Replace("the the little dog", @"\b((\w+)\s+\2)\b", "$1") // no good
Regex.Replace("the the little dog", @"\b((\w+)\s+\2)\b", "$2") // good
Ahmad Mageed