ansaurus

Question

Regular Expression for Stripping Strings from Source Code

Answer 1

A:

Maybe:

re.sub(r"[^\"]\"[^\"].*[^\"]\"[^\"]",'"string"',input)

EDIT:

No that won't work for the final example.

I don't think your requirements are regular: they can't be matched by a regular expression. This is because at the heart of the matter, you need to match any odd number of " grouped together, as that is your delimiter.

I think you'll have to do it manually, counting "s.

Douglas Leeder 2009-05-27 10:13:18

Answer 2

A:

There's a very good string-matching regular expression over at ActiveState. If it doesn't work straight out for your last example it should be a fairly trivial repeat to group adjacent quoted strings together.

PAG 2009-05-27 14:28:38

Answer 3

+3 A:

If you're parsing Python code, save yourself the hassle and let the standard library's parser module do the heavy lifting.

If you're writing your own parser for some custom language, it's awfully tempting to start out by just hacking together a bunch of regexes, but don't do it. You'll dig yourself into an unmaintainable mess. Read up on parsing techniques and do it right (wikipedia can help).

This regex does the trick for all three of your examples:

re.sub(r'"(?:""|[^"])+"', '"string"', original)

Carl Meyer 2009-05-27 15:04:10

I'm not parsing Python and I'm aware of the challenges of parsing. I do not intend to parse using regexes, but only strip the strings before parsing to make my parsing simpler.

Roee Adler 2009-05-27 16:04:29

Fair enough, added a regex which I think does what you need.

Carl Meyer 2009-05-27 18:44:04

@Carl Meyer — For performance, I'd recommend using non-capturing groups and removing the first quantifier, to prevent the quantifiers from "fighting" in ambiguous cases: r'"(?:""|[^"])+"'

Ben Blank 2009-05-27 18:59:50

Very good points, @Ben Blank, thanks. Editing to include your suggestions.

Carl Meyer 2009-05-27 22:14:48

"strip the strings before parsing to make my parsing simpler"?? Your lexer will still need to recognise "string" as a string constant ... so why not handle all the varieties of string constant forms in your lexer? BTW, what about "embedded \" quote" ?

John Machin 2009-08-18 15:40:44

ansaurus

tags:

views:

answers:

Regular Expression for Stripping Strings from Source Code

related questions