views:

912

answers:

6

I'm trying to implement string unescaping with Python regex and backreferences, and it doesn't seem to want to work very well. I'm sure it's something I'm doing wrong but I can't figure out what...

>>> import re
>>> mystring = r"This is \n a test \r"
>>> p = re.compile( "\\\\(\\S)" )
>>> p.sub( "\\1", mystring )
'This is n a test r'
>>> p.sub( "\\\\\\1", mystring )
'This is \\n a test \\r'
>>> p.sub( "\\\\1", mystring )
'This is \\1 a test \\1'

I'd like to replace \\[char] with \[char], but backreferences in Python don't appear to follow the same rules they do in every other implementation I've ever used. Could someone shed some light?

+2  A: 

Well, I think you might have missed the r or miscounted the backslashes...

"\\n" == r"\n"

>>> import re
>>> mystring = r"This is \\n a test \\r"
>>> p = re.compile( r"[\\][\\](.)" )
>>> print p.sub( r"\\\1", mystring )
This is \n a test \r
>>>

Which, if I understood is what was requested.

I suspect the more common request is this:

>>> d = {'n':'\n', 'r':'\r', 'f':'\f'}
>>> p = re.compile(r"[\\]([nrfv])")
>>> print p.sub(lambda mo: d[mo.group(1)], mystring)
This is \
 a test \
>>>

The interested student should also read Ken Thompson's Reflections on Trusting Trust", wherein our hero uses a similar example to explain the perils of trusting compilers you haven't bootstrapped from machine code yourself.

Anders Eurenius
A: 

Your example doesn't solve the problem. It takes:

This is \\n a test \\r

and outputs

This is \\n a test \\r

I would like to output, instead,

This is \n a test \r
eplawless
A: 

You are being tricked by Python's representation of the result string. The Python expression:

'This is \\n a test \\r'

represents the string

This is \n a test \r

which is I think what you wanted. Try adding 'print' in front of each of your p.sub() calls to print the actual string returned instead of a Python representation of the string.

>>> mystring = r"This is \n a test \r"
>>> mystring
'This is \\n a test \\r'
>>> print mystring
This is \n a test \r
Greg Hewgill
A: 

The idea is that I'll read in an escaped string, and unescape it (a feature notably lacking from Python, which you shouldn't need to resort to regular expressions for in the first place). Unfortunately I'm not being tricked by the backslashes...

Another illustrative example:

>>> mystring = r"This is \n ridiculous"
>>> print mystring
This is \n ridiculous
>>> p = re.compile( r"\\(\S)" )
>>> print p.sub( 'bloody', mystring )
This is bloody ridiculous
>>> print p.sub( r'\1', mystring )
This is n ridiculous
>>> print p.sub( r'\\1', mystring )
This is \1 ridiculous
>>> print p.sub( r'\\\1', mystring )
This is \n ridiculous

What I'd like it to print is

This is 
ridiculous
eplawless
+6  A: 

Isn't that what Anders' second example does?

In 2.5 there's also a string-escape encoding you can apply:

>>> mystring = r"This is \n a test \r"
>>> mystring.decode('string-escape')
'This is \n a test \r'
>>> print mystring.decode('string-escape')
This is 
 a test 
>>>
markpasc
A: 

Mark; his second example requires every escaped character thrown into an array initially, which generates a KeyError if the escape sequence happens not to be in the array. It will die on anything but the three characters provided (give \v a try), and enumerating every possible escape sequence every time you want to unescape a string (or keeping a global array) is a really bad solution. Analogous to PHP, that's using preg_replace_callback() with a lambda instead of preg_replace(), which is utterly unnecessary in this situation.

I'm sorry if I'm coming off as a dick about it, I'm just utterly frustrated with Python. This is supported by every other regular expression engine I've ever used, and I can't understand why this wouldn't work.

Thank you for responding; the string.decode('string-escape') function is precisely what i was looking for initially. If someone has a general solution to the regex backreference problem, feel free to post it and I'll accept that as an answer as well.

eplawless