ansaurus

Question

UTF-8 and upper()

Answer 1

+9 A:

Do not perform actions on encoded strings; decode to unicode first.

>>> mystring = "işğüı"
>>> print mystring.decode('utf-8').upper()
IŞĞÜI

Ignacio Vazquez-Abrams 2010-02-23 01:00:23

Answer 2

+6 A:

It's actually best, as a general strategy, to always keep your text as Unicode once it's in memory: decode it at the moment it's input, and encode it exactly at the moment you need to output it, if there are specific encoding requirements at input and/or input times.

Even if you don't choose to adopt this general strategy (and you should!), the only sound way to perform the task you require is still to decode, process, encode again -- never to work on the encoded forms. I.e.:

mystring = "işğüı"
print mystring.decode('utf-8').upper().encode('utf-8')

assuming you're constrained to encoded strings at assignment and for output purposes. (The output constraint is unfortunately realistic, the assignment constraint isn't -- just do mystring = u"işğüı", making it unicode from the start, and save yourself at least the .decode call!-)

Alex Martelli 2010-02-23 01:05:43

The same stategy is a good idea for dates/times. Convert to UTC (or at least TZ aware) as soon as it is input and back to the correct timezone when it is output. This solves a bunch of problems with differing timezones and daylight saving.

gnibbler 2010-02-23 01:11:10

@gnibbler: good point, I agree. And for many financial computations, converting input immediately to decimal (rather than accepting default floats) can avoid many "where did this penny disappear" accounting nightmares;-).

Alex Martelli 2010-02-23 01:17:34

ansaurus

tags:

views:

answers:

UTF-8 and upper()

related questions