views:

732

answers:

9

I need to replace character (say) x with character (say) P in a string, but only if it is contained in a quoted substring. An example makes it clearer:

axbx'cxdxe'fxgh'ixj'k  -> axbx'cPdPe'fxgh'iPj'k

Let's assume, for the sake of simplicity, that quotes always come in pairs.

The obvious way is to just process the string one character at a time (a simple state machine approach);
however, I'm wondering if regular expressions can be used to do all the processing in one go.

My target language is C#, but I guess my question pertains to any language having builtin or library support for regular expressions.

+2  A: 

Not with plain regexp. Regular expressions have no "memory" so they cannot distinguish between being "inside" or "outside" quotes.

You need something more powerful, for example using gema it would be straighforward:

'<repl>'=$0
repl:x=P
Remo.D
+1  A: 

Sorry to break your hopes, but you need a push-down automata to do that. There is more info here: Pushdown Automaton

In short, Regular expressions, which are finite state machines can only read and has no memory while pushdown automaton has a stack and manipulating capabilities.

Edit: spelling...

Tobias Wärre
+8  A: 

I was able to do this with Python:

>>> import re
>>> re.sub(r"x(?=[^']*'([^']|'[^']*')*$)", "P", "axbx'cxdxe'fxgh'ixj'k")
"axbx'cPdPe'fxgh'iPj'k"

What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string.

This relies on your assumption that the quotes are always balanced. This is also not very efficient.

Greg Hewgill
Consider also that it's the re.sub() function that iterates over the string. the regular expression itself only matches the first x within quotes.
Remo.D
I can't imagine how you would even represent the solution to this problem without using something like re.sub(). After all, a regular expression by itself only matches, and the original question asked about replacing.
Greg Hewgill
you're right: it can't be done witough some "extra machinery", that's why many answered that "plain regular expressions" where not powerful enough
Remo.D
I stole your code and converted into C#. I hope you don't mind. Thanks. :)
jop
I guess that's what I was looking for - a way to iterate over regex matches. Thanks!
Cristi Diaconescu
+1  A: 

Similar discussion about balanced text replaces: http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns#133771

Although you can try this in Vim, but it works well only if the string is on one line, and there's only one pair of 's.

:%s:\('[^']*\)x\([^']*'\):\1P\2:gci

If there's one more pair or even an unbalanced ', then it could fail. That's way I included the c a.k.a. confirm flag on the ex command.

The same can be done with sed, without the interaction - or with awk so you can add some interaction.

One possible solution is to break the lines on pairs of 's then you can do with vim solution.

Zsolt Botykai
I think this will only replace the first x that it found inside the quotes. The subsequent x's will be matched by the non-apostrophe pattern [^']*
jop
You are right. But it can be modified to replace all the x's.
Zsolt Botykai
See the comments in http://stackoverflow.com/questions/138552?sort=votes#138615
jop
+8  A: 

I converted Greg Hewgill's python code to C# and it worked!

[Test]
public void ReplaceTextInQuotes()
{
  Assert.AreEqual("axbx'cPdPe'fxgh'iPj'k", 
    Regex.Replace("axbx'cxdxe'fxgh'ixj'k",
      @"x(?=[^']*'([^']|'[^']*')*$)", "P"));
}

That test passed.

jop
+1  A: 
MizardX
+2  A: 

The trick is to use non-capturing group to match the part of the string following the match (character x) we are searching for. Trying to match the string up to x will only find either the first or the last occurence, depending whether non-greedy quantifiers are used. Here's Greg's idea transposed to Tcl, with comments.

set strIn {axbx'cxdxe'fxgh'ixj'k}
set regex {(?x)                     # enable expanded syntax 
                                    # - allows comments, ignores whitespace
            x                       # the actual match
            (?=                     # non-matching group
                [^']*'              # match to end of current quoted substring
                                    ##
                                    ## assuming quotes are in pairs,
                                    ## make sure we actually were 
                                    ## inside a quoted substring
                                    ## by making sure the rest of the string 
                                    ## is what we expect it to be
                                    ##
                (
                    [^']*           # match any non-quoted substring
                    |               # ...or...
                    '[^']*'         # any quoted substring, including the quotes
                )*                  # any number of times
                $                   # until we run out of string :)
            )                       # end of non-matching group
}

#the same regular expression without the comments
set regexCondensed {(?x)x(?=[^']*'([^']|'[^']*')*$)}

set replRegex {P}
set nMatches [regsub -all -- $regex $strIn $replRegex strOut]
puts "$nMatches replacements. "
if {$nMatches > 0} {
    puts "Original: |$strIn|"
    puts "Result:   |$strOut|"
}
exit

This prints:

3 replacements. 
Original: |axbx'cxdxe'fxgh'ixj'k|
Result:   |axbx'cPdPe'fxgh'iPj'k|
Cristi Diaconescu
+1  A: 
#!/usr/bin/perl -w

use strict;

# Break up the string.
# The spliting uses quotes
# as the delimiter.
# Put every broken substring
# into the @fields array.

my @fields;
while (<>) {
    @fields = split /'/, $_;
}

# For every substring indexed with an odd
# number, search for x and replace it
# with P.

my $count;
my $end = $#fields;
for ($count=0; $count < $end; $count++) {
    if ($count % 2 == 1) {
        $fields[$count] =~ s/a/P/g;
    }    
}

Wouldn't this chunk do the job?

+2  A: 

A more general (and simpler) solution which allows non-paired quotes.

  1. Find quoted string
  2. Replace 'x' by 'P' in the string

    #!/usr/bin/env python
    import re
    
    
    text = "axbx'cxdxe'fxgh'ixj'k"
    
    
    s = re.sub("'.*?'", lambda m: re.sub("x", "P", m.group(0)), text)
    
    
    print s == "axbx'cPdPe'fxgh'iPj'k", s
    # ->   True axbx'cPdPe'fxgh'iPj'k
    
J.F. Sebastian