views:

69

answers:

3

Hi All,

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' 
  in position 200: ordinal not in range(128)

I've been trying to convert the unicode to ascii using the code below, but still get a similar error.

UnicodeEncodeError: 'ascii' codec can't encode characters in 
position 0-9: ordinal not in range(128)

I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.

import csv
import unicodedata

...

#message = unicode( unicodedata.normalize(
#                            'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message

csv_writer.writerow([
        dmessage,
])

Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated. Thanks, Joe

+2  A: 

Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.

The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.

waffle paradox
Thanks Waffel Paradox, I'll give the smart_unicode a shot and let you know how that goes.
Joe J
+1  A: 

You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.

So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like ’ instead of .

So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).

Code page 1252 can contain U+2019 (), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).

dmessage= contact.message.encode('cp1252', 'ignore')

alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:

dmessage= contact.message.encode('ascii', 'ignore')
bobince
@bobince: "guess the user's system default code page" ... what problems have you experienced trying to get this authoritatively with `locale.getpreferredencoding()` or `locale.getdefaultlocale()[1]` ?
John Machin
@John: I'm thinking if Django is involved we are talking about a server-side app and there's no guarantee the server's default encoding is anything like the client's. (In the common case that the client is Windows and the server isn't, the encodings will never match.)
bobince
@bobince: The question never specified use though; for all we know the csv file could just be for persistence purposes and will only be used internally.
waffle paradox
@bobince: oh. Next question: so this django gadjet has no knowledge of the user's locale and and can't obtain it?
John Machin
No, there's no access to the user's default encoding for the webapp. You can guess from a combination of the user's preferred language and, using client-side-scripting, the user's browser install language (and OS language if using IE) and if you want to get really fancy, by loading in an HTML file that could be in any encoding and seeing what encoding the browser guesses. But all of these things are different settings and likely to be wrong often. If you have to support the ‘ANSI’ code page, the only reliable course is to ask the user explicitly.
bobince
Thank you all for your comments here. Retaining special characters (outside of the ascii alphabet) is not a priority for me and if I drop a few characters here or there due to encoding issues, that is OK. The text just needs to be readable and my focus is on English. I've also tried the suggestion regarding the encode('ascii','ignore') but with no success. I still get an error when it tries to convert, when I would have thought that the ignore would do just that. I just realized that I am using python 2.4.3. Do you think that makes a difference in the behavior of the encode/decode ops?
Joe J
`encode(..., 'ignore')` will never raise a UnicodeEncodeError on any Python version. You may get a Unicode **De** codeError if you pass it a byte string with non-ASCII characters in it, as it tries to decode the byte string to Unicode first using the `ascii` encoding, before re-encoding it back to bytes.
bobince
+1  A: 

[caveat: I'm not a djangoist; django may have a better solution].

General non-django-specific answer:

If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:

smashcii = {
    0x2019 : u"'",
    # etc
    #

smashed = input_string.translate(smashcii)
John Machin
I'll have to give this method a shot. Might get me past this issue at least. Thank you for you suggestion.
Joe J