views:

365

answers:

4

Can anyone point me to a program that strips off strings from C source code? Example

#include <stdio.h>
static const char *place = "world";
char * multiline_str = "one \
two \
three\n";
int main(int argc, char *argv[])
{
        printf("Hello %s\n", place);
        printf("The previous line says \"Hello %s\"\n", place);
        return 0;
}

becomes

#include <stdio.h>
static const char *place = ;
char * multiline_str = ;
int main(int argc, char *argv[])
{
        printf(, place);
        printf(, place);
        return 0;
}

What I am looking for is a program very much like stripcmt only that I want to strip strings and not comments.

The reason that I am looking for an already developed program and not just some handy regular expression is because when you start considering all corner cases (quotes within strings, multi-line strings etc) things typically start to be (much) more complex than it first appears. And there are limits on what REs can achieve, I suspect it is not possible for this task. If you do think you have an extremely robust regular expression feel free to submit, but please no naive sed 's/"[^"]*"//g' like suggestions.

(No need for special handling of (possibly un-ended) strings within comments, those will be removed first)

Support for multi-line strings with embedded newlines is not important (not legal C), but strings spanning multiple lines ending with \ at the end must be supported.

This is almost the same as the some other questions, but I found no reference to any tools.

+4  A: 

You can download the source code to StripCmt (.tar.gz - 5kB). It's trivially small, and shouldn't be too difficult to adapt to striping strings instead (it's released under the GPL).

You might also want to investigate the official lexical language rules for C strings. I found this very quickly, but it might not be definitive. It defines a string as:

stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.
Mark Pim
I had not thought of checking the source of stripcmt. It was simple to modify.
hlovdal
+5  A: 

All of the tokens in C (and most other programming languages) are "regular". That is, they can be matched by a regular expression.

A regular expression for C strings:

"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"

The regex isn't too hard to understand. Basically a string literal is a pair of double quotes surrounding a bunch of:

  • non-special (non-quote/backslash/newline) characters
  • escapes, which start with a backslash and then consist of one of:
    • a simple escape character
    • 1 to 3 octal digits
    • x and 1 or more hex digits

This is based on sections 6.1.4 and 6.1.3.4 of the C89/C90 spec. If anything else crept in in C99, this won't catch that, but that shouldn't be hard to fix.

Here's a python script to filter a C source file removing string literals:

import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
  print regex.sub('', line.rstrip('\n'))

EDIT:

It occurred to me after I posted the above that while it is true that all C tokens are regular, by not tokenizing everything we've got an opportunity for trouble. In particular, if a double quote shows up in what should be another token we can be lead down the garden path. You mentioned that comments have already been stripped, so the only other thing we really need to worry about are character literals (though the approach Im going to use can be easily extended to handle comments as well). Here's a more robust script that handles character literals:

import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""

regex = re.compile('|'.join([str_re, chr_re]))

def repl(m):
  m = m.group(0)
  if m.startswith("'"):
    return m
  else:
    return ''
for line in sys.stdin:
  print regex.sub(repl, line.rstrip('\n'))

Essentially we're finding string and character literal token, and then leaving char literals alone but stripping out string literals. The char literal regex is very similar to the string literal one.

Laurence Gonsalves
In this case I think it would be better: ([^"\\\n]|\\.)*
hiena
Your regular expression fails to handle<<char * str = "one \<eol>two \<eol>three\n";>> where the <eol> indicates that there is a newline. This is what I meant by corner cases :)
hlovdal
Using \ to join lines is part of preprocessing, and I was ignoring that. (eg: what if the code is <<char *a = MACRO_THAT_EXPANDS_TO_STRING_LITERAL;>> -- what do you want to do then?) If all you care about is the line-joinging, you can add \n in the abfnrtv character class, and replace the for-loop with sys.stdout.write(regex.sub(repl, sys.stdin.read()). You'll also need to tweak chr_re if you're worried about line-joining inside of char literals.
Laurence Gonsalves
Another option, depending on what you want this for, would be to run all of the code through the preprocessor first.
Laurence Gonsalves
A: 

In ruby:

#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close

prints to the standard output

hiena
A: 

In Python using pyparsing:

from pyparsing import dblQuotedString

source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)

Also prints to stdout.

Paul McGuire