views:

266

answers:

6

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

Is there any possible way in Python to have a character like ë́ be represented as 1?

I'm using UTF-8 encoding for the actual code and web page it is being outputted to.

edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.

A: 

Rather than guessing, let's work out why you're getting those results. Where are you getting the string from? If you're loading it from a file, what does it look like in binary?

It sounds like it could be due to combining characters - you may be able to normalize to a form which uses a single character for the combined glyph, but I don't think that's always possible.

Jon Skeet
+9  A: 

UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).

Here are some examples:

>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt') 
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt') 
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt') 
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8')) 
6

Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):

>>> test = u'ë́aúlt'
>>> print test[0]
ë

If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)

PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...

Regards, Christoph

tux21b
+++1 :-) aus .at
Flavius
I think this answer highlights the problem - the accents over the `ea` are different to those in the question :)
gnibbler
Oh, you are right. I think i lost the character while copying it... sorry for that. Unfortunately there seems to be no single unicode character which can represent the accents. Never have seen something like that before (at least the german umlauts i know can be written in both ways, as single and combined charecter)
tux21b
+1  A: 

The best you can do is to use unicodedata.normalize() to decompose the character and then filter out the accents.

Don't forget to use unicode and unicode literals in your code.

Ignacio Vazquez-Abrams
+2  A: 

The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:

>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á

However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.

To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).

If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:

def withoutcombining(s):
    return ''.join(c for c in s if unicodedata.combining(c)==0)

>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5
bobince
+1 This answer works. Note that the ë́ in the code section displays wrongly, but I believe that is just a font/browser issue.
gnibbler
A: 

which Python version are you using? Python 3.1 doesn't have this issue.

>>> print(len("ë́aúlt"))
6

Regards Djoudi

Djoudi
A: 

You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.

The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.

"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.

You can use the unicodedata.name() function to tell you what each component is.

Here's an example:

# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
    try:
        name = unicodedata.name(c)
    except:
        name = "<no name>"
    print "U+%04X" % ord(c), repr(c), name

Results:

u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T

Now read @bobince's answer :-)

John Machin