tags:

views:

91

answers:

1

Hello,

For an internationalised project, I have to validate the global syntax for a name (first, last) with Python. But the lack of unicode classes support is really maling things harder.

Is there any regex / library to do that ?

Examples:

Björn, Anne-Charlotte, توماس, 毛, or מיק must be accepted. -Björn, Anne--Charlotte, Tom_ or entries like that should be rejected.

Is there any simple way to do that ?

Thanks.

+5  A: 

Python does support unicode in regular expressions if you specify the re.UNICODE flag. You can probably use something like this:

r'^[^\W_]+(-[^\W_]+)?$'

Test code:

# -*- coding: utf-8 -*-
import re

names = [
            u'Björn',
            u'Anne-Charlotte',
            u'توماس',
            u'毛',
            u'מיק',
            u'-Björn',
            u'Anne--Charlotte',
            u'Tom_',
        ]

for name in names:
    regex = re.compile(r'^[^\W_]+(-[^\W_]+)?$', re.U)
    print u'{0:20} {1}'.format(name, regex.match(name) is not None)

Result:

Björn                True
Anne-Charlotte       True
توماس                True
毛                    True
מיק                  True
-Björn               False
Anne--Charlotte      False
Tom_                 False

If you also want to disallow digits in names then change [^\W_] to [^\W\d_] in both places.

Mark Byers
You might want to add a space to the allowed characters though.
poke
Modified to `^[^\W0-9_]+([ \-'‧][^\W0-9_]+)*?$`, to support the most names. Will be tested as extensively as possible. Thanks a lot =)
Pierre
@Pierre: Use `\Z`, not `$`, otherwise "Fred\n" will be regarded as valid. Perhaps you are assuming that the input has already been sanitised to the extent of stripping leading and trailing whitespace and replacing all internal runs of whitespace by a single space. `\d` as suggested by Mark is NOT the same as `0-9` ... is your change deliberate?
John Machin