views:

375

answers:

3

Hello,

I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.

Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":

print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")

Should be transformed to the following result:

print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")

Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.

Few remarks:

  • I would like the "start-of-string" character to be a variable, e.g. ' vs ".
  • I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
  • The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
  • I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)

Thanks, Rax

+4  A: 

Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.

Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.

Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):

import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
    lambda match: match.group(1) or '"string"', source_code)

The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).

When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).

pts
I thought the built in tokenize is only useful for Python code, and not for C++/Java/etc?
Roee Adler
Sure, tokenize is only for Python code. Added some more tips for code in other languages.
pts
The code does not work, I ran it on e.g. print("hello") and it returned print("hello"). Any suggestions?
Roee Adler
@pts: Sorry, still does not work on 'print("hello")'. Thanks.
Roee Adler
Edited my answer: changed to match.group(1). Now it should work.
pts
@pts: Thanks, it works better, solves almost all cases, but does not deal with double "". For example: print("hel""lo") turns into print("string""string") instead of print("string")
Roee Adler
+1  A: 

Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.

(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

Stephan202
+1  A: 

Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.

a paid nerd
I would hope they *all* have a grammar ;)
Stephan202
Short for CFG. Perl doesn't.
a paid nerd