views:

326

answers:

5

I would like to convert this string

foo_utf = u'nästy chäräctörs with å and co.' # unicode

into this

foo_ascii = 'nästy chäräctörs with å and co.' # ASCII

.

Any idea how to do this in Python (2.6)? I found unicodedata module but I have no idea how to do the transformation.

+3  A: 

Try the encode method of string.

>>> u'nästy chäräctörs with å and co.'.encode('latin-1')
'n\xe4sty ch\xe4r\xe4ct\xf6rs with \xe5 and co.'
Eli Bendersky
+4  A: 

I don't think you can. Those "nästy chäräctörs" can't be encoded as ASCII, so you'll have to pick a different encoding (UTF-8 or Latin-1 or Windows-1252 or something).

Will McCutchen
This is true. ASCII contains only 127 characters and none with diacritical marks. It's possible to convert it to ANSI if you select the correct code page. In any case, it's best to stick with Unicode unless you have no other choice.
Peter Ruderman
Thanks. That's a good point. I forgot totally about that. :)
bebraw
+2  A: 

You can also use the unicodedata module (http://docs.python.org/library/unicodedata.html) provided in python to convert a lot of unicode values into an Ascii variant. IE fix the different "s and such. Follow that up by the encode() method and you can completely clean up a string.

The method you mainly what out of the unicodedata is normalize and pass it the NFKC flag.

NerdyNick
+2  A: 

There are several options in the codecs module in python's stdlib, depending on how you want the extended characters handled:

>>> import codecs
>>> u = u'nästy chäräctörs with å and co.'
>>> encode = codecs.get_encoder('ascii')
>>> encode(u) 
'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
>>> encode(u, 'ignore')
('nsty chrctrs with  and co.', 31)
>>> encode(u, 'replace')
('n?sty ch?r?ct?rs with ? and co.', 31)
>>> encode(u, 'xmlcharrefreplace')
('n&#228;sty ch&#228;r&#228;ct&#246;rs with &#229; and co.', 31)
>>> encode(u, 'backslashreplace')
('n\\xe4sty ch\\xe4r\\xe4ct\\xf6rs with \\xe5 and co.', 31)

Hopefully one of those will meet your needs. There's more information available in the Python codecs module documentation.

jcdyer
+2  A: 

This really is a Django question, and not a python one. if the string is in one of your .py files, make sure that you have the following line on top of your file: -*- coding: utf-8 -*-

furthermore, your string needs to be of type "unicode" (u'foobar')

And then make sure that your html page works in unicode:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

That should do the whole trick. No encoding/decoding etc. necessary, just make sure that everything is unicode, and you are on the safe side.

mawimawi
Thanks for excellent pointers. I managed to trace the issue down to a str conversion in the code that broke it apart. I found other comments insightful as well. :)
bebraw
Also, actually save the file in utf-8 so it agrees with the coding declaration.
Mark Tolonen