ansaurus

Question

Answer 1

+4 A:

You can download the source code to StripCmt (.tar.gz - 5kB). It's trivially small, and shouldn't be too difficult to adapt to striping strings instead (it's released under the GPL).

You might also want to investigate the official lexical language rules for C strings. I found this very quickly, but it might not be definitive. It defines a string as:

stringcon ::= "{ch}", where ch denotes any printable ASCII character (as specified by isprint()) other than " (double quotes) and the newline character.

Mark Pim 2009-08-18 15:02:35

I had not thought of checking the source of stripcmt. It was simple to modify.

hlovdal 2009-08-18 18:04:35

Answer 2

+5 A:

All of the tokens in C (and most other programming languages) are "regular". That is, they can be matched by a regular expression.

A regular expression for C strings:

"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"

The regex isn't too hard to understand. Basically a string literal is a pair of double quotes surrounding a bunch of:

non-special (non-quote/backslash/newline) characters
escapes, which start with a backslash and then consist of one of:
- a simple escape character
- 1 to 3 octal digits
- x and 1 or more hex digits

This is based on sections 6.1.4 and 6.1.3.4 of the C89/C90 spec. If anything else crept in in C99, this won't catch that, but that shouldn't be hard to fix.

Here's a python script to filter a C source file removing string literals:

import re, sys
regex = re.compile(r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"''')
for line in sys.stdin:
  print regex.sub('', line.rstrip('\n'))

EDIT:

It occurred to me after I posted the above that while it is true that all C tokens are regular, by not tokenizing everything we've got an opportunity for trouble. In particular, if a double quote shows up in what should be another token we can be lead down the garden path. You mentioned that comments have already been stripped, so the only other thing we really need to worry about are character literals (though the approach Im going to use can be easily extended to handle comments as well). Here's a more robust script that handles character literals:

import re, sys
str_re = r'''"([^"\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))*"'''
chr_re = r"""'([^'\\\n]|\\(['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]+))'"""

regex = re.compile('|'.join([str_re, chr_re]))

def repl(m):
  m = m.group(0)
  if m.startswith("'"):
    return m
  else:
    return ''
for line in sys.stdin:
  print regex.sub(repl, line.rstrip('\n'))

Essentially we're finding string and character literal token, and then leaving char literals alone but stripping out string literals. The char literal regex is very similar to the string literal one.

Laurence Gonsalves 2009-08-18 15:32:25

In this case I think it would be better: ([^"\\\n]|\\.)*

hiena 2009-08-18 15:53:07

Your regular expression fails to handle<<char * str = "one \<eol>two \<eol>three\n";>> where the <eol> indicates that there is a newline. This is what I meant by corner cases :)

hlovdal 2009-08-18 16:05:34

Using \ to join lines is part of preprocessing, and I was ignoring that. (eg: what if the code is <<char *a = MACRO_THAT_EXPANDS_TO_STRING_LITERAL;>> -- what do you want to do then?) If all you care about is the line-joinging, you can add \n in the abfnrtv character class, and replace the for-loop with sys.stdout.write(regex.sub(repl, sys.stdin.read()). You'll also need to tweak chr_re if you're worried about line-joining inside of char literals.

Laurence Gonsalves 2009-08-18 16:20:21

Another option, depending on what you want this for, would be to run all of the code through the preprocessor first.

Laurence Gonsalves 2009-08-18 16:21:14

Answer 3

A:

In ruby:

#!/usr/bin/ruby
f=open(ARGV[0],"r")
s=f.read
puts(s.gsub(/"(\\(.|\n)|[^\\"\n])*"/,""))
f.close

prints to the standard output

hiena 2009-08-18 16:28:13

Answer 4

A:

In Python using pyparsing:

from pyparsing import dblQuotedString

source = open(filename).read()
dblQuotedString.setParseAction(lambda : "")
print dblQuotedString.transformString(source)

Also prints to stdout.

Paul McGuire 2009-09-04 16:47:48

ansaurus

tags:

views:

answers:

Removing strings from C source code

related questions