I have this:
>>> print 'example'
example
>>> print 'exámple'
exámple
>>> print 'exámple'.upper()
EXáMPLE
What I need to do to print:
EXÁMPLE
(Where the 'a' gets its accute accent, but in uppercase.)
I'm using Python 2.6.
I have this:
>>> print 'example'
example
>>> print 'exámple'
exámple
>>> print 'exámple'.upper()
EXáMPLE
What I need to do to print:
EXÁMPLE
(Where the 'a' gets its accute accent, but in uppercase.)
I'm using Python 2.6.
I think it's as simple as not converting to ASCII first.
>>> print u'exámple'.upper()
EXÁMPLE
In python 2.x, just convert the string to unicode before calling upper(). Using your code, which is in utf-8 format on this webpage:
>>> s = 'exámple'
>>> s
'ex\xc3\xa1mple' # my terminal is not utf8. c3a1 is the UTF-8 hex for á
>>> s.decode('utf-8').upper()
u'EX\xc1MPLE' # c1 is the utf-16 aka unicode for á
The call to decode
takes it from its current format to unicode. You can then convert it to some other format, like utf-8, by using encode. If the character was in, say, iso-8859-2 (Czech, etc, in this case), you would instead use s.decode('iso-8859-2').upper()
.
As in my case, if your terminal is not unicode/utf-8 compliant, the best you can hope for is either a hex representation of the characters (like mine) or to convert it lossily using s.decode('utf-8').upper().encode('ascii', 'replace')
, which results in 'EX?MPLE'. If you can't make your terminal show unicode, write the output to a file in utf-8 format and open that in your favourite editor.
I think there's a bit of background we're missing here:
>>> type('hello')
<type 'str'>
>>> type(u'hello')
<type 'unicode'>
As long as you're using "unicode" strings instead of "native" strings, the operators like upper() will operate with unicode in mind. FWIW, Python 3 uses unicode by default, making the distinction largely irrelevant.
Taking a string from unicode
to str
and then back to unicode
is suboptimal in many ways, and many libraries will produce unicode output if you want it; so try to use only unicode
objects for strings internally whenever you can.
first off, i only use python 3.1 these days; its central merit is to have disambiguated byte strings from unicode objects. this makes the vast majority of text manipulations much safer than used to be the case. weighing in the trillions of user questions regarding python 2.x encoding problems, the u'äbc
convention of python 2.1 was just a mistake; with explicit bytes
and bytearray
, life becomes so much easier.
secondly, if py3k is not your flavor, then try to go with from __future__ import unicode_literals
, as this will mimic py3k's behavior on python 2.6 and 2.7. this thing would have avoided the (easily committed) blunder you did when saying print 'exámple'.upper()
. essentially, this is the same as in py3k: print( 'exámple'.encode( 'utf-8' ).upper() )
. compare these versions (for py3k):
print( 'exámple'.encode( 'utf-8' ).upper() )
print( 'exámple'.encode( 'utf-8' ).upper().decode( 'utf-8' ) )
print( 'exámple'.upper() )
The first one is, basically, what you did when used a bare string 'exámple'
, provided you set your default encoding to utf-8
(according to a BDFL pronouncement, setting the default encoding at run time is a bad idea, so in py2 you'll have to trick it by saying import sys; reload( sys ); sys.setdefaultencoding( 'utf-8' )
; i present a better solution for py3k below). when you look at the output of these three lines:
b'EX\xc3\xa1MPLE'
EXáMPLE
EXÁMPLE
you can see that when upper()
got applied to the first text, it acted on bytes, not on characters. python allows the upper()
method on bytes, but it is only defined on the US-ASCII interpretation of bytes. since utf-8 uses values within 8 bits but outside of US-ASCII (128 up to 255, which are not used by US-ASCII), those won't be affected by upper()
, so when we decode back in the second line, we get that lower-case á
. finally, the third line does it right, and yes, surprise, python seems to be aware that Á
is the upper case letter corresponding to á
. i ran a quick test to see what characters python 3 does not convert between upper and lower case:
for cid in range( 3000 ):
my_chr = chr( cid )
if my_chr == my_chr.upper() and my_chr == my_chr.lower():
say( my_chr )
perusing the list reveals very few incidences of latin, cyrillic, or greek letters; most of the output is non-european characters and punctuation. the only characters i could find that python got wrong are Ԥ/ԥ (\u0524, \u0525, 'cyrillic {capital|small} letter pe with descender'), so as long as you stay outside of the Latin Extended-X blocks (check out those, they might yield surprises), you might actually use that method. of course, i did not check the correctness of the mappings.
lastly, here is what i put into my py3k application boot section: a method that redefines the encoding sys.stdout
sees, with numerical character references (NCRs) as fallback; this has the effect that printing to standard output will never raise a unicode encoding error. when i work on ubuntu, _sys.stdout.encoding
is utf-8
; when the same program runs on windows, it might be something quaint like cp850
. the output might looks starnge, but the application runs without raising an exception on those dim-witted terminals.
#===========================================================================================================
# MAKE STDOUT BEHAVE IN A FAILSAFE MANNER
#-----------------------------------------------------------------------------------------------------------
def _harden_stdout():
"""Ensure that unprintable output to STDOUT does not cause encoding errors; use XML character references
so any kind of output gets a chance to render in a decipherable way."""
global _sys_TRM
_sys.stdout = _sys_TRM = _sys_io.TextIOWrapper(
_sys.stdout.buffer,
encoding = _sys.stdout.encoding,
errors = 'xmlcharrefreplace',
line_buffering = true )
#...........................................................................................................
_harden_stdout()
one more piece of advice: when testing, always try to print repr( x )
or a similar thing that reveals the identity of x
. all kinds of misunderstandings can crop up if you just print x
in py2 and x
is either an octet string or a unicode object. it is very puzzling and prone to cause a lot of head-scratching. as i said, try to move at least to py26 with that from future import unicode literals incantation.
and to close, quoting a quote: " Glyph Lefkowitz says it best in his article Encoding:
I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen."
update: just found python 3 correctly converts ſ LATIN SMALL LETTER LONG S to S when uppercasing. neat!