ansaurus

Question

How can I check if a Python unicode string contains non-Western letters?

Answer 1

A:

check the code in django.template.defaultfilters.slugify

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')

is what you are looking for, you can then compare the resulting string with the original

Claude Vedovini 2010-06-22 15:34:41

I don't want to turn everything lowercase or convert spaces to dashes, just check to see if a string has the unwanted characters. I changed the question to avoid using the word "filter".

mipadi 2010-06-22 15:50:45

Answer 2

A:

You can use unicodedata to check character category

import unicodedata
unicodedata.category(u'a') # returns 'Ll'
unicodedata.category(u'א') # returns 'Lo'

Ofri Raviv 2010-06-22 19:56:35

That doesn't tell you what script it's from.

dan04 2010-06-23 04:51:52

@Ofri Raviv: WRONG. A lower-case Cyrillic letter will return 'Ll' -- FAIL. The difference that you observed is caused by the Hebrew script being caseless. Case has no relevance to the OP's problem.

John Machin 2010-06-23 07:28:11

Answer 3

+1 A:

For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252 instead of iso-8859-1. You might also allow cp1250 as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...

You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).

I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?

John Machin 2010-06-22 22:26:44

(1) The latter (addresses written in X-ese characters). (2) Perhaps. Right now, it doesn't. The form is part of a web app; the data gets shunted to another system entirely, which handles the management of orders, etc. (3) The form fails validation and prompts the user to enter an appropriate address.

mipadi 2010-06-23 16:54:33

Stephen P 2010-06-28 19:31:52

@StephenP: The OP already has Unicode strings; I was suggesting that he consider the strong possibility that the characters that he needs to watch out for could be found in the cp125x character sets; Windows *USERS* incorrigibly have data encoded in cp125x. This is a fact of life. The ancient ISO-8859-x encodings although sanctified by standards are even more limited, and should be avoided in code; use UTF-8, UTF-16, or GB18030. If one has Unicode data with code points 0080 to 009F, probability(C1 controls) == 0.1%, prob(cp125x-encoded data decoded as latin1) == 99.9%

John Machin 2010-07-01 11:20:37

Answer 4

A:

Checking for ISO-8559-1 would miss reasonable Western characters like 'œ' and '€'. The solution depends on how you define "Western", and how you want to handle non-letters. Here's one approach:

import unicodedata

def is_permitted_char(char):
    cat = unicodedata.category(char)[0]
    if cat == 'L': # Letter
        return 'LATIN' in unicodedata.name(char, '').split()
    elif cat == 'N': # Number
        # Only DIGIT ZERO - DIGIT NINE are allowed
        return '0' <= char <= '9'
    elif cat in ('S', 'P', 'Z'): # Symbol, Punctuation, or Space
        return True
    else:
        return False

def is_valid(text):
    return all(is_permitted_char(c) for c in text)

dan04 2010-06-23 05:08:47

(1) `return unicodedata.name(char, '').startswith('LATIN ')` should suffice (2) memoising the function results might be a good idea, which could be made better by preloading the usual suspects [-A-Za-z0-9,./ '] etc into the memo (3) symbol/punctuation is rather wide (4) should category Space be replaced by '\x20'?

John Machin 2010-06-23 07:09:39

Answer 5

A:

Maybe this will do if you're a django user?

from django.template.defaultfilters import slugify 

def justroman(s):
  return len(slugify(s)) == len(s)

2010-06-28 18:47:14

Answer 6

+2 A:

import unicodedata as ud

latin_letters= {}

def is_latin(uchr):
    try: return latin_letters[uchr]
    except KeyError:
         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))

def only_roman_chars(unistr):
    return all(is_latin(uchr)
           for uchr in unistr
           if uchr.isalpha()) # isalpha suggested by John Machin

>>> only_roman_chars(u"ελληνικά means greek")
False
>>> only_roman_chars(u"frappé")
True
>>> only_roman_chars(u"hôtel lœwe")
True
>>> only_roman_chars(u"123 ångstrom ð áß")
True
>>> only_roman_chars(u"russian: гага")
False

ΤΖΩΤΖΙΟΥ 2010-07-22 12:33:14

Consider `uchr.isalpha()` instead of `unicodedata.category(uchr).startswith('L')`. Consider using a set constructed at module load time: `okletters = set(unichr(i) for i in xrange(sys.maxunicode+1) if unicodedata.name(unichr(i), "").startswith('LATIN '))` i.e. use `uchr in okletters` instead of `'LATIN' in unicodedata.name(uchr)`

John Machin 2010-07-22 20:31:33

@John: uchr.isalpha is a better suggestion, thank you; I will update my answer. For the optimization suggestion, I'd go with a *memoized*-style function.

ΤΖΩΤΖΙΟΥ 2010-07-23 10:10:35

For the `is_latin` function, a subclass of `defaultdict` appropriately overriding `__missing__` would also be a nice solution.

ΤΖΩΤΖΙΟΥ 2010-07-23 10:21:01

ansaurus

tags:

views:

answers:

How can I check if a Python unicode string contains non-Western letters?

related questions