views:

63

answers:

3

I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file says it is encoded as ISO-8859.

$ file data_clean.csv 
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators

I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.

    row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
    print row[4]    
    description = row[4].encode("UTF-8")
    print description

This gives me the following:

'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight 
¥ Media and communications 

Why is the \xa5 bullet character converting as a yen symbol?

I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.

More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?

A: 

You could try

 iconv -f latin1 -t utf8 data_clean.csv 

if you know it is indeed iso-latin-1

Although in iso-latin-1 \xA5 is indeed a ¥

Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?

nicomen
A: 

I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that A5 is a bullet point in the Mac OS Roman codepage.

AakashM
x.decode("mac_roman") works. Thank you!
AP257
A: 

More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?

You can easily write one in Python. (Examples use 3.x syntax.)

import encodings

ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}

def _decode(data, encoding):
    try:
        return data.decode(encoding)
    except UnicodeError:
        return None

def possible_encodings(encoded, decoded):
    return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}

So if you know that your bullet point is U+2022, then

>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}
dan04
Interesting - thanks.
AP257