ansaurus

Question

Parsing in Python: what's the most efficient way to supress/normalize strings?

Answer 1

+4 A:

Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.

Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.

Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):

import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
    lambda match: match.group(1) or '"string"', source_code)

The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).

When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).

pts 2009-05-11 06:01:22

I thought the built in tokenize is only useful for Python code, and not for C++/Java/etc?

Roee Adler 2009-05-11 06:05:56

Sure, tokenize is only for Python code. Added some more tips for code in other languages.

pts 2009-05-11 06:13:27

The code does not work, I ran it on e.g. print("hello") and it returned print("hello"). Any suggestions?

Roee Adler 2009-05-12 14:36:32

@pts: Sorry, still does not work on 'print("hello")'. Thanks.

Roee Adler 2009-05-12 19:36:56

Edited my answer: changed to match.group(1). Now it should work.

pts 2009-05-15 08:24:34

@pts: Thanks, it works better, solves almost all cases, but does not deal with double "". For example: print("hel""lo") turns into print("string""string") instead of print("string")

Roee Adler 2009-05-15 09:10:38

Answer 2

+1 A:

Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.

(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

Stephan202 2009-05-11 06:06:40

Answer 3

+1 A:

Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.

a paid nerd 2009-05-11 06:08:02

I would hope they *all* have a grammar ;)

Stephan202 2009-05-11 06:10:36

Short for CFG. Perl doesn't.

a paid nerd 2009-05-11 06:29:29

ansaurus

tags:

views:

answers:

Parsing in Python: what's the most efficient way to supress/normalize strings?

related questions