views:

518

answers:

5

I'm not exactly sure how to ask this question really, and I'm no where close to finding an answer, so I hope someone can help me.

I'm writing a Python app that connects to a remote host and receives back byte data, which I unpack using Python's built-in struct module. My problem is with the strings, as they include multiple character encodings. Here is an example of such a string:

"^LThis is an example ^Gstring with multiple ^Jcharacter encodings"

Where the different encoding starts and ends is marked using special escape chars:

  • ^L - Latin1
  • ^E - Central Europe
  • ^T - Turkish
  • ^B - Baltic
  • ^J - Japanese
  • ^C - Cyrillic
  • ^G - Greek

And so on... I need a way to convert this sort of string into Unicode, but I'm really not sure how to do it. I've read up on Python's codecs and string.encode/decode, but I'm none the wiser really. I should mention as well, that I have no control over how the strings are outputted by the host.

I hope someone can help me with how to get started on this.

+3  A: 

I would write a codec that incrementally scanned the string and decoded the bytes as they came along. Essentially, you would have to separate strings into chunks with a consistent encoding and decode those and append them to the strings that followed them.

Aaron Maenpaa
+4  A: 

There's no built-in functionality for decoding a string like this, since it is really its own custom codec. You simply need to split up the string on those control characters and decode it accordingly.

Here's a (very slow) example of such a function that handles latin1 and shift-JIS:

latin1 = "latin-1"
japanese = "Shift-JIS"

control_l = "\x0c"
control_j = "\n"

encodingMap = {
    control_l: latin1,
    control_j: japanese}

def funkyDecode(s, initialCodec=latin1):
    output = u""
    accum = ""
    currentCodec = initialCodec
    for ch in s:
        if ch in encodingMap:
            output += accum.decode(currentCodec)
            currentCodec = encodingMap[ch]
            accum = ""
        else:
            accum += ch
    output += accum.decode(currentCodec)
    return output

A faster version might use str.split, or regular expressions.

(Also, as you can see in this example, "^J" is the control character for "newline", so your input data is going to have some interesting restrictions.)

Glyph
+1  A: 

I don't suppose you have any way of convincing the person who hosts the other machine to switch to unicode?

This is one of the reasons Unicode was invented, after all.

R. Bemrose
As I've said, I have no control over the host itself. The host is actually a computer game which my app connects to, and I believe this is how it handles its text-rendering internally.
Fara
+2  A: 

You definitely have to split the string first into the substrings wih different encodings, and decode each one separately. Just for fun, the obligatory "one-line" version:

import re

encs = {
    'L': 'latin1',
    'G': 'iso8859-7',
    ...
}

decoded = ''.join(substr[2:].decode(encs[substr[1]])
             for substr in re.findall('\^[%s][^^]*' % ''.join(encs.keys()), st))

(no error checking, and also you'll want to decide how to handle '^' characters in substrings)

dF
You made exactly the same mistake as me!
+6  A: 

Here's a relatively simple example of how do it...

# -*- coding: utf-8 -*-
import re

# Test Data
ENCODING_RAW_DATA = (
    ('latin_1',    'L', u'Hello'),        # Latin 1
    ('iso8859_2',  'E', u'dobrý večer'),  # Central Europe
    ('iso8859_9',  'T', u'İyi akşamlar'), # Turkish
    ('iso8859_13', 'B', u'Į sveikatą!'),  # Baltic
    ('shift_jis',  'J', u'今日は'),        # Japanese
    ('iso8859_5',  'C', u'Здравствуйте'), # Cyrillic
    ('iso8859_7',  'G', u'Γειά σου'),   # Greek
)

CODE_TO_ENCODING = dict([(chr(ord(code)-64), encoding) for encoding, code, text in ENCODING_RAW_DATA])
EXPECTED_RESULT = u''.join([line[2] for line in ENCODING_RAW_DATA])
ENCODED_DATA = ''.join([chr(ord(code)-64) + text.encode(encoding) for encoding, code, text in ENCODING_RAW_DATA])

FIND_RE = re.compile('[\x00-\x1A][^\x00-\x1A]*')

def decode_single(bytes):
    return bytes[1:].decode(CODE_TO_ENCODING[bytes[0]])

result = u''.join([decode_single(bytes) for bytes in FIND_RE.findall(ENCODED_DATA)])

assert result==EXPECTED_RESULT, u"Expected %s, but got %s" % (EXPECTED_RESULT, result)
zellyn