views:

394

answers:

4

Inside my python scrip, I get some string back from a function which I didn't write. The encoding of it varies. I need to convert it to ascii format. Is there some fool-proof way of doing this? I don't mind replacing the non-ascii chars with blanks or something else...

+3  A: 

You say "the encoding of it varies". I guess that by "it" you mean a Python 2.x "string", which is really a sequence of bytes.

Answer part one: if you do not know the encoding of that encoded string, then no, there is no way at all to do anything meaningful with it*. If you do know the encoding, then step one is to convert your str into a unicode:

encoded_string = i_have_no_control()
the_encoding = 'utf-8' # for the sake of example
text = unicode(encoded_string, the_encoding)

Then you can re-encode your unicode object as ASCII, if you like.

ascii_garbage = text.encode('ascii', 'replace')

* There are heuristic methods for guessing encodings, but they are slow and unreliable. Here's one excellent attempt in Python.

Jonathan Feinberg
*"no, there is no way at all to do anything meaningful with it"* -- nearly every character set in use today inherits its lower characters from ASCII. In this case, **there is something meaningful** you can do: throw away all non-ASCII characters. This is what the asker wants.The exceptions (UTF-16 and UTF-32) would never be confused with any other character sets, so I believe it's safe to ignore those.
intgr
You're seemingly of the opinion that the only character encodings in the world are defined by Unicode, but that isn't so. There are dozens more commonly used ones, such as shift-jis, windows-1252, etc.What's more, "converting to ascii" usually means "normalizing" characters, such as converting `ä` to `a`, which you certainly can't do by assuming your encoding is one byte per character, and masking non-ascii bytes, as you suggest!
Jonathan Feinberg
**Both** Shift-JIS and Windows-1252 inherit the lower ASCII codepoints from ASCII. Thus, stripping all characters with the high bit set (which is what my answer does) works in the common case. This is not ideal, but in many of cases sufficient. If you simply do not know the encoding, then **obviously** you cannot normalize it. As for autodetection, some character sets in the ISO-8859-* series have so many overlaps and ambiguities that they are essentially impossible to distinguish.
intgr
A: 

If all you want to do is preserve ASCII-compatible characters and throw away the rest, then in most encodings that boils down to removing all characters that have the high bit set -- i.e., characters with value over 127. This works because nearly all character sets are extensions of 7-bit ASCII.

If it's a normal string (i.e., not unicode), you need to decode it in an arbitrary character set (such as iso-8859-1 because it accepts any byte values) and then encode in ascii, using the ignore or replace option for errors:

>>> orig = '1ä2äö3öü4ü'
>>> orig.decode('iso-8859-1').encode('ascii', 'ignore')
'1234'
>>> orig.decode('iso-8859-1').encode('ascii', 'replace')
'1??2????3????4??'

The decode step is necessary because you need a unicode string in order to use encode. If you already have a Unicode string, it's simpler:

>>> orig = u'1ä2äö3öü4ü'
>>> orig.encode('ascii', 'ignore')
'1234'
>>> orig.encode('ascii', 'replace')
'1??2????3????4??'
intgr
Going directly to ascii (as unicode object) is also possible: `'1ä2äö3öü4ü'.decode("ascii", "ignore")`. Just because you use a simplified character set doesn't make the unicode type a bad choice for textual strings IMO.
kaizer.se
If your default encoding doesn't happen to be iso-8859-1, then your very first line there will explode when you attempt to decode that source string as iso-8859-1.
Jonathan Feinberg
@Jonathan Feinberg: **Decoding from iso-8859-1 never fails** because any character sequence has a defined meaning and is legal in ISO-8559-1. What does the default encoding have to do with it? I specify encodings everywhere explicitly.
intgr
@kaizer.se: It works with `'ignore'`, but when you use `'replace'` it would give you a Unicode string with: `u'1\ufffd\ufffd2\ufffd\ufffd\ufffd\ufffd3\ufffd\ufffd\ufffd\ufffd4\ufffd\ufffd'`
intgr
John Machin
+1  A: 

I'd try to normalize the string then encode it. What about :

import unicodedata
s = u"éèêàùçÇ"
print unicodedata.normalize('NFKD',s).encode('ascii','ignore')

This works only if you have unicode as input. Therefor, you must know what can of encoding the function ouputs and decode it. If you don't, there are encoding detection heuristics, but on short strings, there are not reliable.

Of course, you could have luck and the function outputs rely on various unknow encodings but using ascii as a code base, therefor they would allocate the same value for the bytes from 0 to 127 (like utf-8).

In that case, you can just get rid of the unwanted chars by filtering them using OrderedSets :

import string.printable # asccii chars
print "".join(OrderedSet(string.printable) & OrderedSet(s))

Or if you want blanks instead :

print("".join(((char if char in  string.printable else " ") for char in s )))

"translate" can help you to do the same.

The only way to know if your are this lucky is to try it out... Sometimes, a big fat lucky day is what any dev need :-)

e-satis
+1  A: 

If you want an ASCII string that unambiguously represents what you have got, without losing any information, the answer is simple:

Don't muck about with encode/decode, use the repr() function (Python 2.X) or the ascii() function (Python 3.x).

John Machin