views:

75

answers:

3

Hi!

I'm doing a filter wherein I check if a unicode (utf-8 encoding) string contains no uppercase characters (in all languages). It's fine with me if the string doesn't contain any cased character at all.

For example: 'Hello!' will not pass the filter, but "!" should pass the filter, since "!" is not a cased character.

I planned to use the islower() method, but in the example above, "!".islower() will return False.

According to the Python Docs, "The python unicode method islower() returns True if the unicode string's cased characters are all lowercase and the string contained at least one cased character, otherwise, it returns False."

Since the method also returns False when the string doesn't contain any cased character, ie. "!", I want to do check if the string contains any cased character at all.

Something like this....

string = unicode("!@#$%^", 'utf-8')

#check first if it contains cased characters
if not contains_cased(string):
     return True

return string.islower():

Any suggestions for a contains_cased() function?

Or probably a different implementation approach?

Thanks!

+5  A: 
import unicodedata as ud

def contains_cased(u):
  return any(ud.category(c)[0] == 'L' for c in u)
Alex Martelli
Arg alex, is there something you don't know ?
e-satis
-1 Treats East Asian characters as "cased". See my answer.
John Machin
+1  A: 

use module unicodedata,

unicodedata.category(character)

returns "Ll" for lowercase letters and "Lu" for uppercase ones.

here you can find list of unicode character categories

mykhal
+1  A: 

Here is the full scoop on Unicode character categories.

Letter categories include:

Ll -- lowercase
Lu -- uppercase
Lt -- titlecase
Lm -- modifier
Lo -- other

Note that Ll <-> islower(); similarly for Lu; (Lu or Lt) <-> istitle()

You may wish to read the complicated discussion on casing, which includes some discussion of Lm letters.

Blindly treating all "letters" as cased is demonstrably wrong. The Lo category includes 45301 codepoints in the BMP (counted using Python 2.6). A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased".

You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. Here's a simple first attempt:

>>> cased = lambda c: c.upper() != c or c.lower() != c
>>> sum(cased(unichr(i)) for i in xrange(65536))
1970
>>>

Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean.

John Machin
@John: Wow. Thanks for your explanation. It took me a while to understand it. I took a look at your link, and I think I have to study it more extensively. I have a feeling that what I'm going to find out is going to make me overhaul a lot of my code. Yikes.Thanks!
Albert
@Albert: Don't panic. As I've hinted, firstly develop a definition of what you mean by "cased". What different treatment will you apply to cased chars as opposed to uncased chars? My example definition was "char which has an uppercase or lowercase 'partner'". Some (maybe all) of the difference between the 1970 chars and the 2153 appears to be due to chars which are classified as `Ll` because they look like a lowercase character, but don't have a `Lu` partner, and vice versa -- you need to decided whether these are "cased" for your purposes. BTW you can change your accepted answer :-)
John Machin
@John: Well, I'm actually making an API for my web service. My webservice accepts a key that maps out to a specific record in my database. The key is case-sensitive, and the key can be composed of any unicode characteer. So in order to normalize all input, I will convert all key queries into lowercase (if they have uppercase equivalents). A consequence of that is when I create the record keys (which my users can customize), I cannot accept any uppercase character that can be converted to a lowercase equivalent by the toLower() function. So I'm trying to make a filter for that. Any suggestions?
Albert
@Albert: If your keys are case sensitive, why are you normalising them??? "record keys which users can customize" means what??? "any unicode char" vs "cannot accept any uppercase char" ??? To answer your question literally: Looks like you can't accept a character c when `c.lower() != c` which means that you can't accept any key if `key.lower() != key`. I think that you should start a NEW QUESTION, explaining exactly what you are trying to do, with examples. BTW1: don't forget to accept an answer to this question first. BTW2: Python doesn't have a `toLower` function ...
John Machin
@John: My mistake. I meant lower() function. Alright, I'll start a new question. Thanks!
Albert
@John: I respect your expertise in unicode. I have a new question, do you think you can take a look at it, and also at the answers, if they are correct. Thanks! http://stackoverflow.com/questions/3536397/does-python-version-2-5-2-follow-unicode-standards-for-lower-and-upper-functi
Albert