ansaurus

Question

Python: How to check if a unicode string contains a cased character?

Answer 1

+5 A:

import unicodedata as ud

def contains_cased(u):
  return any(ud.category(c)[0] == 'L' for c in u)

Alex Martelli 2010-08-18 02:25:05

Arg alex, is there something you don't know ?

e-satis 2010-08-18 07:19:03

-1 Treats East Asian characters as "cased". See my answer.

John Machin 2010-08-18 08:10:12

Answer 2

+1 A:

use module unicodedata,

unicodedata.category(character)

returns "Ll" for lowercase letters and "Lu" for uppercase ones.

here you can find list of unicode character categories

mykhal 2010-08-18 02:27:46

Answer 3

+1 A:

Here is the full scoop on Unicode character categories.

Letter categories include:

Ll -- lowercase
Lu -- uppercase
Lt -- titlecase
Lm -- modifier
Lo -- other

Note that Ll <-> islower(); similarly for Lu; (Lu or Lt) <-> istitle()

You may wish to read the complicated discussion on casing, which includes some discussion of Lm letters.

Blindly treating all "letters" as cased is demonstrably wrong. The Lo category includes 45301 codepoints in the BMP (counted using Python 2.6). A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased".

You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. Here's a simple first attempt:

>>> cased = lambda c: c.upper() != c or c.lower() != c
>>> sum(cased(unichr(i)) for i in xrange(65536))
1970
>>>

Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean.

John Machin 2010-08-18 08:08:26

@John: Wow. Thanks for your explanation. It took me a while to understand it. I took a look at your link, and I think I have to study it more extensively. I have a feeling that what I'm going to find out is going to make me overhaul a lot of my code. Yikes.Thanks!

Albert 2010-08-19 05:40:46

@Albert: Don't panic. As I've hinted, firstly develop a definition of what you mean by "cased". What different treatment will you apply to cased chars as opposed to uncased chars? My example definition was "char which has an uppercase or lowercase 'partner'". Some (maybe all) of the difference between the 1970 chars and the 2153 appears to be due to chars which are classified as `Ll` because they look like a lowercase character, but don't have a `Lu` partner, and vice versa -- you need to decided whether these are "cased" for your purposes. BTW you can change your accepted answer :-)

John Machin 2010-08-19 06:07:46

@John: Well, I'm actually making an API for my web service. My webservice accepts a key that maps out to a specific record in my database. The key is case-sensitive, and the key can be composed of any unicode characteer. So in order to normalize all input, I will convert all key queries into lowercase (if they have uppercase equivalents). A consequence of that is when I create the record keys (which my users can customize), I cannot accept any uppercase character that can be converted to a lowercase equivalent by the toLower() function. So I'm trying to make a filter for that. Any suggestions?

Albert 2010-08-20 12:54:35

@Albert: If your keys are case sensitive, why are you normalising them??? "record keys which users can customize" means what??? "any unicode char" vs "cannot accept any uppercase char" ??? To answer your question literally: Looks like you can't accept a character c when `c.lower() != c` which means that you can't accept any key if `key.lower() != key`. I think that you should start a NEW QUESTION, explaining exactly what you are trying to do, with examples. BTW1: don't forget to accept an answer to this question first. BTW2: Python doesn't have a `toLower` function ...

John Machin 2010-08-20 22:29:52

@John: My mistake. I meant lower() function. Alright, I'll start a new question. Thanks!

Albert 2010-08-21 00:35:06

@John: I respect your expertise in unicode. I have a new question, do you think you can take a look at it, and also at the answers, if they are correct. Thanks! http://stackoverflow.com/questions/3536397/does-python-version-2-5-2-follow-unicode-standards-for-lower-and-upper-functi

Albert 2010-08-21 05:04:27

ansaurus

tags:

views:

answers:

Python: How to check if a unicode string contains a cased character?

related questions