views:

436

answers:

6

In ASCII, validating a name isn't too difficult: just make sure all the characters are alphabetical.

But what about in Unicode (utf-8) ? How can I make sure there are no commas or underscores (outside of ASCII scope) in a given string?

(ideally in Python)

A: 

Something like:

def is_asciibetical(str):
    try:
        str.encode('ascii')
        return True
    except UnicodeEncodeError:
        return False

might do what you're looking for.

Daniel Watkins
try is_asciibetical("!"). It works, but it's not helpful for the OP.
ΤΖΩΤΖΙΟΥ
+1  A: 

Depending on how you define "name", you could go with checking it against this regex:

^\w+$

However, this will allow numbers and underscores. To rule them out, you can do a second test against:

[\d_]

and make your check fail on match. These two could be combined as follows:

^(?:(?![\d_])\w)+$

But for regex performance reasons, I would rather do two separate checks.

From the docs:

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

Tomalak
A: 

The letters property of the string module should give you what you want. This property is locale-specific, so as long as you know the language of the text being passed to you, you can use setlocale() and validate against those characters.

http://docs.python.org/library/string.html#module-string

As you point out, though, in a truly "unicode" world, there's no way at all to know what characters are "alphabetical" unless you know the language. If you don't know the language, you could either default to ASCII, or run through the locales for common languages.

Jarret Hardie
You mean, "there is no *known way to me* to know what characters…". unicodedata.category(u"\u0393") tells you it's an uppercase letter.
ΤΖΩΤΖΙΟΥ
While the general category property in the Unicode Character Database is often good for determining this primary characteristic, many characters have multiple uses depending on language and context; not all cases are covered by unicodedata.category()
Jarret Hardie
See http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values and http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf
Jarret Hardie
But you are right... after all, gotta pick a categorization from somewhere.
Jarret Hardie
+5  A: 

Maybe the unicodedata module is useful for this task. Especially the category() function. For existing unicode categories look at unicode.org. You can then filter on punctuation characters etc.

unbeknown
+5  A: 

Just convert bytestring (your utf-8) to unicode objects and check if all characters are alphabetic:

s.isalpha()

This method is locale-dependent for bytestrings.

zgoda
+1  A: 

This might be a step towards a solution:

import unicodedata
EXCEPTIONS= frozenset(u"'.")
CATEGORIES= frozenset( ('Lu', 'Ll', 'Lt', 'Pd', 'Zs') )
# O'Rourke, Franklin D. Roosevelt

def test_unicode_name(unicode_name):
    return all(
      uchar in EXCEPTIONS
        or unicodedata.category(uchar) in CATEGORIES
      for uchar in unicode_name)

>>> test_unicode_name(u"Michael O'Rourke")
True
>>> test_unicode_name(u"Χρήστος Γεωργίου")
True
>>> test_unicode_name(u"Jean-Luc Géraud")
True

Add exceptions, and further checks that I possibly missed.

ΤΖΩΤΖΙΟΥ