ansaurus

Question

Working out file encoding: I know the string, know the character, what is the encoding?

Answer 1

A:

You could try

 iconv -f latin1 -t utf8 data_clean.csv

if you know it is indeed iso-latin-1

Although in iso-latin-1 \xA5 is indeed a ¥

Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?

nicomen 2010-08-16 15:16:24

Answer 2

A:

I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that A5 is a bullet point in the Mac OS Roman codepage.

AakashM 2010-08-16 16:17:21

x.decode("mac_roman") works. Thank you!

AP257 2010-08-16 16:48:36

Answer 3

A:

More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?

You can easily write one in Python. (Examples use 3.x syntax.)

import encodings

ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}

def _decode(data, encoding):
    try:
        return data.decode(encoding)
    except UnicodeError:
        return None

def possible_encodings(encoded, decoded):
    return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}

So if you know that your bullet point is U+2022, then

>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}

dan04 2010-08-21 03:11:52

Interesting - thanks.

AP257 2010-08-23 11:24:06

ansaurus

tags:

views:

answers:

Working out file encoding: I know the string, know the character, what is the encoding?

related questions