views:

52

answers:

2

I use Pylons framework, Mako template for a web based application. I wasn't really bother too deep into the way python handles the unicode strings. I had tense moment when I did see my site crash when the page is rendered and later I came to know that it was related to Unicode Decode error http://wiki.python.org/moin/UnicodeDecodeError

After seeing the error, I started mesh around my python code adding encode, decode calls for string with 'ignore' option but still I could not see the errors gone sometime.

Finally I used to decode to ascii with ignore and made the site running without any crash.

Input to my site comes through many sites. This means that I do not control the languages or language of choice. My site supports international languages and along with English. I have feed aggregation which generally not bother about unicode/ascii/utf-8. While I display the text through mako template, I display as it is.

Not being a web expert, what are the best practices to handle the strings within the python project?. Should I care only while rendering the text or all the phase of the application?

+5  A: 

If you have influence on it, this is the painless way:

  • know your input encoding (or decode with ignore) and decode(encoding) the data as soon as it hits your app
  • work internally only with unicode (u'something' is unicode), also in the database
  • for rendering, export etc, anytime it leaves your app, encode('utf-8') the data
knitti
+2  A: 

this might not be a viable option for you, but let me say that a big number of encoding-related errors vanish when using python 3, just because the separation between unicode strings and byte objects has been made so much clearer. when i have to use python 2, i opt for version 2.6, where you can declare from future import unicode_literals. disbelievers should actually read the link you posted, as it points out some subtleties with Python's en/decoding behavior that fortunately vanished in Python 3.

you say

I do not control the languages or language of choice. My site supports international languages and along with English. I have feed aggregation which generally not bother about unicode/ascii/utf-8

well, whatever you choose to do, it is clear you do not want your web application to crash just because some dænish bløgger whose feeds you consume chose to encode their posts in an obscure scandinavian encoding scheme. the underlying problem is relevant for all web applications since URLs do not carry encoding information, and because you never know what byte sequences a malicious user might want to send you. in this case i do what i call 'safe chain-decoding': i try to decode as utf-8 first, and if that should fail, try again using cp1252. if that fails, i discard the request (HTTP 404) or something similar.

you mention you process feeds and ¿you? ¿the feeds? do not 'bother' about unicode and encodings. could you clarify that statement? it completely evades me how one can successfully build a site that carries text in multiple languages and not care about encodings. clearly using ascii-only will not carry you very far.

flow