views:

180

answers:

3

Hello,

I'm looking for a regular expression that will replace strings in an input source code with some constant string value such as "string", and that will also take into account escaping the string-start character that is denoted by a double string-start character (e.g. "he said ""hello""").

To clarify, I will provide some examples of input and expected output:

input: print("hello world, how are you?")
output: print("string")

input: print("hello" + "world")
output: print("string" + "string")

# here's the tricky part:
input: print("He told her ""how you doin?"", and she said ""I'm fine, thanks""")
output: print("string")

I'm working in Python, but I guess this is language agnostic.

EDIT: According to one of the answers, this requirement may not be fit for a regular expression. I'm not sure that's true but I'm not an expert. If I try to phrase my requirement with words, what I'm looking for is to find sets of characters that are between double quotes, wherein even groups of adjacent double quotes should be disregarded, and that sounds to me like it can be figured by a DFA.

Thanks.

A: 

Maybe:

re.sub(r"[^\"]\"[^\"].*[^\"]\"[^\"]",'"string"',input)

EDIT:

No that won't work for the final example.

I don't think your requirements are regular: they can't be matched by a regular expression. This is because at the heart of the matter, you need to match any odd number of " grouped together, as that is your delimiter.

I think you'll have to do it manually, counting "s.

Douglas Leeder
A: 

There's a very good string-matching regular expression over at ActiveState. If it doesn't work straight out for your last example it should be a fairly trivial repeat to group adjacent quoted strings together.

PAG
+3  A: 

If you're parsing Python code, save yourself the hassle and let the standard library's parser module do the heavy lifting.

If you're writing your own parser for some custom language, it's awfully tempting to start out by just hacking together a bunch of regexes, but don't do it. You'll dig yourself into an unmaintainable mess. Read up on parsing techniques and do it right (wikipedia can help).

This regex does the trick for all three of your examples:

re.sub(r'"(?:""|[^"])+"', '"string"', original)
Carl Meyer
I'm not parsing Python and I'm aware of the challenges of parsing. I do not intend to parse using regexes, but only strip the strings before parsing to make my parsing simpler.
Roee Adler
Fair enough, added a regex which I think does what you need.
Carl Meyer
@Carl Meyer — For performance, I'd recommend using non-capturing groups and removing the first quantifier, to prevent the quantifiers from "fighting" in ambiguous cases: r'"(?:""|[^"])+"'
Ben Blank
Very good points, @Ben Blank, thanks. Editing to include your suggestions.
Carl Meyer
"strip the strings before parsing to make my parsing simpler"?? Your lexer will still need to recognise "string" as a string constant ... so why not handle all the varieties of string constant forms in your lexer? BTW, what about "embedded \" quote" ?
John Machin