views:

210

answers:

5

In the "string" module of the standard library,

string.ascii_letters ## Same as string.ascii_lowercase + string.ascii_uppercase

is

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

Is there a similar constant which would include everything that is considered a letter in unicode?

A: 

No there isn't. Such a constant would have to be very large considering the sheer amount of symbols that could be considered 'a letter'. Then you have the problem of who gets to decide what symbols are 'letters', which is not a straightforward question to answer in a lot of the world's more peculiar languages.

The Unicode standard explicitly defines what symbols are letters through assigning "categories".
Max Shawabkeh
+1  A: 

That would be a pretty massive constant. Unicode currently covers over 100.000 different characters. So the answer is no.

The question is why you would need it? There might be some other way of solving whatever your problem is with the unicodedata module, for example.

Update: You can download files with all unicode datapoint names and other information from ftp://ftp.unicode.org/, and do loads of interesting stuff with that.

Lennart Regebro
+7  A: 

There's no string, but you can check whether a character is a letter using the unicodedata module, in particular its category() function.

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'A')
'Lu'
>>> unicodedata.category(u'5')
'Nd'
>>> unicodedata.category(u'ф') # Cyrillic f.
'Ll'
>>> unicodedata.category(u'٢') # Arabic-indic numeral for 2.
'Nd'

Ll means "letter, lowercase". Lu means "letter, uppercase". Nd means "numeric, digit".

Max Shawabkeh
Just to make the answer complete, here is a list of all Unicode categories: http://www.fileformat.info/info/unicode/category/index.htm
Lukáš Lalinský
A: 

As mentioned in previous answers, the string would indeed be way too long. So, you have to target (a) specific language(s).
[EDIT: I realized it was the case for my original intended use, and for most uses, I guess. However, in the meantime, Mark Tolonen gave a good answer to the question as it was asked, so I chose his answer, although I used the following solution]

This is easily done with the "locale" module:

import locale
import string
code = 'fr_FR' ## Do NOT specify encoding (see below)
locale.setlocale(locale.LC_CTYPE, code)
encoding = locale.getlocale()[1]
letters = string.letters.decode(encoding)

with "letters" being a 117-character-long unicode string.

Apparently, string.letters is dependant on the default encoding for the selected language code, rather than on the language itself. Setting the locale to fr_FR or de_DE or es_ES will update string.letters to the same value (since they are all encoded in ISO8859-1 by default).

If you add an encoding to the language code (de_DE.UTF-8), the default encoding will be used instead for string.letters. That would cause a UnicodeDecodeError if you used the rest of the above code.

emm
+3  A: 

You can construct your own constant of Unicode upper and lower case letters with:

import unicodedata as ud
all_unicode = ''.join(unichr(i) for i in xrange(65536))
unicode_letters = ''.join(c for c in all_unicode
                          if ud.category(c)=='Lu' or ud.category(c)=='Ll')

This makes a string 2153 characters long (narrow Unicode Python build). For code like letter in unicode_letters it would be faster to use a set instead:

unicode_letters = set(unicode_letters)
Mark Tolonen
Good answer to the question as I asked it. However, I found another solution which was better suited to my needs (see my own answer below)
emm