ansaurus

Question

Dealing with a string containing multiple character encodings.

Answer 1

+3 A:

I would write a codec that incrementally scanned the string and decoded the bytes as they came along. Essentially, you would have to separate strings into chunks with a consistent encoding and decode those and append them to the strings that followed them.

Aaron Maenpaa 2008-10-13 14:32:48

Answer 2

+4 A:

There's no built-in functionality for decoding a string like this, since it is really its own custom codec. You simply need to split up the string on those control characters and decode it accordingly.

Here's a (very slow) example of such a function that handles latin1 and shift-JIS:

latin1 = "latin-1"
japanese = "Shift-JIS"

control_l = "\x0c"
control_j = "\n"

encodingMap = {
    control_l: latin1,
    control_j: japanese}

def funkyDecode(s, initialCodec=latin1):
    output = u""
    accum = ""
    currentCodec = initialCodec
    for ch in s:
        if ch in encodingMap:
            output += accum.decode(currentCodec)
            currentCodec = encodingMap[ch]
            accum = ""
        else:
            accum += ch
    output += accum.decode(currentCodec)
    return output

A faster version might use str.split, or regular expressions.

(Also, as you can see in this example, "^J" is the control character for "newline", so your input data is going to have some interesting restrictions.)

Glyph 2008-10-13 14:52:14

Answer 3

+1 A:

I don't suppose you have any way of convincing the person who hosts the other machine to switch to unicode?

This is one of the reasons Unicode was invented, after all.

R. Bemrose 2008-10-13 14:55:31

As I've said, I have no control over the host itself. The host is actually a computer game which my app connects to, and I believe this is how it handles its text-rendering internally.

Fara 2008-10-13 15:06:17

Answer 4

+2 A:

You definitely have to split the string first into the substrings wih different encodings, and decode each one separately. Just for fun, the obligatory "one-line" version:

import re

encs = {
    'L': 'latin1',
    'G': 'iso8859-7',
    ...
}

decoded = ''.join(substr[2:].decode(encs[substr[1]])
             for substr in re.findall('\^[%s][^^]*' % ''.join(encs.keys()), st))

(no error checking, and also you'll want to decide how to handle '^' characters in substrings)

dF 2008-10-13 15:24:45

You made exactly the same mistake as me!

2008-10-13 15:57:08

Answer 5

+6 A:

Here's a relatively simple example of how do it...

# -*- coding: utf-8 -*-
import re

# Test Data
ENCODING_RAW_DATA = (
    ('latin_1',    'L', u'Hello'),        # Latin 1
    ('iso8859_2',  'E', u'dobrý večer'),  # Central Europe
    ('iso8859_9',  'T', u'İyi akşamlar'), # Turkish
    ('iso8859_13', 'B', u'Į sveikatą!'),  # Baltic
    ('shift_jis',  'J', u'今日は'),        # Japanese
    ('iso8859_5',  'C', u'Здравствуйте'), # Cyrillic
    ('iso8859_7',  'G', u'Γειά σου'),   # Greek
)

CODE_TO_ENCODING = dict([(chr(ord(code)-64), encoding) for encoding, code, text in ENCODING_RAW_DATA])
EXPECTED_RESULT = u''.join([line[2] for line in ENCODING_RAW_DATA])
ENCODED_DATA = ''.join([chr(ord(code)-64) + text.encode(encoding) for encoding, code, text in ENCODING_RAW_DATA])

FIND_RE = re.compile('[\x00-\x1A][^\x00-\x1A]*')

def decode_single(bytes):
    return bytes[1:].decode(CODE_TO_ENCODING[bytes[0]])

result = u''.join([decode_single(bytes) for bytes in FIND_RE.findall(ENCODED_DATA)])

assert result==EXPECTED_RESULT, u"Expected %s, but got %s" % (EXPECTED_RESULT, result)

zellyn 2008-10-13 15:29:10

ansaurus

tags:

views:

answers:

Dealing with a string containing multiple character encodings.

related questions