views:

428

answers:

3

Hello.

In python, strings may be unicode ( both utf-16 and utf-8 ) and single-byte with different encodings ( cp1251, cp1252 etc ). Is it possible to check what encoding string is? For example,

time.strftime( "%b" )

will return a string with text name of a month. Under MacOS returned string will be utf-16, under Windows with English local it will be single byte with ascii encoding, and under Windows with non-English locale it will be encoded via locale's codepage, for example cp1251. How can i handle such strings?

+1  A: 

charset encoding detection is very complex.

however, what's your real purpose for this? if you just want to value to be in unicode, simply write

unicode(time.strftime("%b"))

and it should work for all the cases you've mentioned above:

  • mac os: unicode(unicode) -> unicode
  • win/eng: unicode(ascii) -> unicode
  • win/noneng: unicode(some_cp) -> will be converted by local cp -> unicode
Francis
+5  A: 

Strings don't store any encoding information, you just have to specify one when you convert to/from unicode or print to an output device :

import locale
lang, encoding = locale.getdefaultlocale()
mystring = u"blabla"
print mystring.encode(encoding)

UTF-8 is not unicode, it's an encoding of unicode into single byte strings.

The best practice is to work with unicode everywhere on the python side, store your strings with an unicode reversible encoding such as UTF-8, and convert to fancy locales only for user output.

Luper Rouch
+1  A: 

If you have a reasonably long string in an unknown encoding, you can try to guess the encoding, e.g. with the Universal Encoding Detector at http://chardet.feedparser.org/ -- not foolproof of course, but sometimes it guesses right;-). But that won't help much with very short strings.

Alex Martelli