views:

1325

answers:

2

Python 3 cleans up Python's handling of Unicode strings. I assume as part of this effort, the codecs in Python 3 have become more restrictive, according to the Python 3 documentation compared to the Python 2 documentation.

For example, codecs that conceptually convert a bytestream to a different form of bytestream have been removed:

  • base64_codec
  • bz2_codec
  • hex_codec

And codecs that conceptually convert Unicode to a different form of Unicode have also been removed (in Python 2 it actually went between Unicode and bytestream, but conceptually it's really Unicode to Unicode I reckon):

  • rot_13

My main question is, what is the "right way" in Python 3 to do what these removed codecs used to do? They're not codecs in the strict sense, but "transformations". But the interface and implementation would be very similar to codecs.

I don't care about rot_13, but I'm interested to know what would be the "best way" to implement a transformation of line ending styles (Unix line endings vs Windows line endings) which should really be a Unicode-to-Unicode transformation done before encoding to byte stream, especially when UTF-16 is being used, as discussed this other SO question.

+2  A: 

It looks as though all these non-codec modules are being handled on a case-by-case basis. Here's what I've found so far:

  • base64 is now available via base64 module
  • bz2 can now be done using bz2 module
  • hex string encoding/decoding can be done with the hexlify and unhexlify functions of the binascii module (a bit of a hidden feature)

I guess that means there's no standard framework for creating such string/bytearray transformation modules, but they're being done on a case-by-case basis in Python 3.

Craig McQueen
+1  A: 

What specifically is your need for line ending conversion? If it's just for writing to a file or file object, you can specify what line ending format to use with open(), and \n will automatically be converted to that when you write to a file. Admittedly, this only works with files open as text, not data. (You can also specify what encoding to use when writing text to the file, which can be useful sometimes.)

http://docs.python.org/3.1/library/functions.html#open

To do it with regular strings for conversion, you can simply do yourstring = yourstring.replace('\n', '\r\n') for conversion from Linux-style to Windows-style, and yourstring = yourstring.replace('\r\n', '\n') for conversion from Windows-style to Linux-style. You probably already know this, though, and it's probably not what you're looking for. (And, in fact, if you're writing to a text file, it should convert \n to \r\n on a Windows system anyway if universal newline mode is enabled, which is the default.)

As well, if you're wanting to convert between the various Unicode mappings (assuming you're working with byte sequences, as the strings Python uses internally aren't actually set to any specific type of Unicode), it's just a matter of decoding the byte sequence using bytes.decode() or bytearray.decode() and then encoding using str.encode(). For a conversion from UTF-8 to UTF-16:

newstring = yourbytes.decode('utf-8')
yourbytes = newstring.encode('utf-16')

There shouldn't be any problems with newline characters not being converted properly between the two Unicode formats when done this way.

There is also str.translate() and str.maketrans(), though I'm not sure if those will prove useful:

http://docs.python.org/3.1/library/stdtypes.html#str.translate
http://docs.python.org/3.1/library/stdtypes.html#str.maketrans

On a side note, rot_13 can be implemented as so:

import string
rot_13 = str.maketrans({x: chr((ord(x) - ord('A') + 13) % 26 + ord('A') if x.isupper() else ((ord(x) - ord('a') + 13) % 26 + ord('a'))) for x in string.ascii_letters})

# Using hard-coded values:

rot_13 = str.maketrans('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', 'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm')

Either way, using S.translate(rot_13) will cause normal strings to become rot_13 and rot_13 strings to become normal ones.

JAB
Thanks for your answer. The thing that is missing from these solutions is a framework that allows them to be easily applied as a transformation to a stream, in a similar way to the codec framework. See the stream transformation in http://stackoverflow.com/questions/1169742/bug-with-python-utf-16-output-and-windows-line-endings#answer-1170469 for an example of what I want to do. Does Python 3 have a standard framework for such stream transformations, similar to the codec framework?
Craig McQueen
Apparently you can do it exactly like it shows there; the only difference is that you use `sys.stdout.buffer` rather than `sys.stdout`. You'll still have that `\n` problem, though; I'll look into that in a bit.
JAB
(On a side note, if you do end up using that `CRLFWrapper` class from your other question, I'd recommend using `re.sub()` instead of `str.replace()`, with the pattern to match being `(?<!\r)\n` and the replacement string being `\r\n`; this will avoid repeated carriage returns, which may or may not mess things up.)
JAB