views:

543

answers:

3

I'm using a small Python script to generate some binary data that will be used in a C header.

This data should be declared as a char[], and it will be nice if it could be encoded as a string (with the pertinent escape sequences when they are not in the range of ASCII printable chars) to keep the header more compact than with a decimal or hexadecimal array encoding.

The problem is that when I print the repr of a Python string, it is delimited by single quotes, and C doesn't like that. The naive solution is to do:

'"%s"'%repr(data)[1:-1]

but that doesn't work when one of the bytes in the data happens to be a double quote, so I'd need them to be escaped too.

I think a simple replace('"', '\\"') could do the job, but maybe there's a better, more pythonic solution out there.

Extra point:

It would be convenient too to split the data in lines of approximately 80 characters, but again the simple approach of splitting the source string in chunks of size 80 won't work, as each non printable character takes 2 or 3 characters in the escape sequence. Splitting the list in chunks of 80 after getting the repr won't help either, as it could divide escape sequence.

Any suggestions?

+2  A: 

Better not hack the repr() but use the right encoding from the beginning. You can get the repr's encoding directly with the encoding string_escape

>>> "naïveté".encode("string_escape")
'na\\xc3\\xafvet\\xc3\\xa9'
>>> print _
na\xc3\xafvet\xc3\xa9

For escaping the "-quotes I think using a simple replace after escape-encoding the string is a completely unambiguous process:

>>> '"%s"' % 'data:\x00\x01 "like this"'.encode("string_escape").replace('"', r'\"')
'"data:\\x00\\x01 \\"like this\\""'
>>> print _
"data:\x00\x01 \"like this\""
kaizer.se
that doesn't solve my problem, it still shows double quotes unescaped `'quotehere"'.encode("string_escape")` gives `'quotehere"'`
fortran
+2  A: 

If you're asking a python str for its repr, I don't think the type of quote is really configurable. From the PyString_Repr function in the python 2.6.4 source tree:

 /* figure out which quote to use; single is preferred */
 quote = '\'';
 if (smartquotes &&
     memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
     !memchr(op->ob_sval, '"', Py_SIZE(op)))
  quote = '"';

So, I guess use double quotes if there is a single quote in the string, but don't even then if there is a double quote in the string.

I would try something like writing my own class to contain the string data instead of using the built in string to do it. One option would be deriving a class from str and writing your own repr:

class MyString(str):
    __slots__ = []
    def __repr__(self):
        return '"%s"' % self.replace('"', r'\"')

print repr(MyString(r'foo"bar'))

Or, don't use repr at all:

def ready_string(string):
    return '"%s"' % string.replace('"', r'\"')

print ready_string(r'foo"bar')

This simplistic quoting might not do the "right" thing if there's already an escaped quote in the string.

Matt Anderson
+2  A: 

repr() isn't what you want. There's a fundamental problem: repr() can use any representation of the string that can be evaluated as Python to produce the string. That means, in theory, that it might decide to use any number of other constructs which wouldn't be valid in C, such as """long strings""".

This code is probably the right direction. I've used a default of wrapping at 140, which is a sensible value for 2009, but if you really want to wrap your code to 80 columns, just change it.

If unicode=True, it outputs a L"wide" string, which can store Unicode escapes meaningfully. Alternatively, you might want to convert Unicode characters to UTF-8 and output them escaped, depending on the program you're using them in.

def string_to_c(s, max_length = 140, unicode=False):
    ret = []

    # Try to split on whitespace, not in the middle of a word.
    split_at_space_pos = max_length - 10
    if split_at_space_pos < 10:
        split_at_space_pos = None

    position = 0
    if unicode:
        position += 1
        ret.append('L')

    ret.append('"')
    position += 1
    for c in s:
        newline = False
        if c == "\n":
            to_add = "\\\n"
            newline = True
        elif ord(c) < 32 or 0x80 <= ord(c) <= 0xff:
            to_add = "\\x%02x" % ord(c)
        elif ord(c) > 0xff:
            if not unicode:
                raise ValueError, "string contains unicode character but unicode=False"
            to_add = "\\u%04x" % ord(c)
        elif "\\\"".find(c) != -1:
            to_add = "\\%c" % c
        else:
            to_add = c

        ret.append(to_add)
        position += len(to_add)
        if newline:
            position = 0

        if split_at_space_pos is not None and position >= split_at_space_pos and " \t".find(c) != -1:
            ret.append("\\\n")
            position = 0
        elif position >= max_length:
            ret.append("\\\n")
            position = 0

    ret.append('"')

    return "".join(ret)

print string_to_c("testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing testing", max_length = 20)
print string_to_c("Escapes: \"quote\" \\backslash\\ \x00 \x1f testing \x80 \xff")
print string_to_c(u"Unicode: \u1234", unicode=True)
print string_to_c("""New
lines""")
Glenn Maynard
isn't the 'elif "\\\"".find(c) != -1' the same as 'elif c in "\\\""'? In any case I agree, repr() is not the solution here and you have to do something like this.
Andrew Dalke