views:

112

answers:

3

I have a list of simple names such as Márquez,

because of the á (?< name >[a-zA-Z]+) doesn't seem to be working!

Help would be very much appreciated!

+1  A: 

This regex guide knows all about it, take a look. : )

http://www.regular-expressions.info/unicode.html

rlb.usa
don't see a word about Python in that link
SilentGhost
Regular expression usage differs between implementations. Article that tells nothing about Python regexp implementation won't give reliable information about using regexps in python.
Jacek Konieczny
omg! The link explains some Unicode problems I had in the past... thanks
RadiantHex
+3  A: 

you could use

\w+

with the unicode flag. I assume there's no risk of having digits or underscore in your names.

>>> re.findall('\w+', 'Márquez', re.U)
['Márquez']

You also seem to be missing P after the question mark: (?P< name >[a-zA-Z]+)

SilentGhost
This doesn't look good for Python < 3.0 – I don't like the idea of using the UNICODE option with regular (non-unicode) strings.
Jacek Konieczny
@Jacek: why is your non-ascii string is not unicode?
SilentGhost
In Python 2.x str type and "xxxx" literal contains just a string of bytes. This bytes may be some characters in a specific encoding, but without encoding information these are just bytes. 'Márquez' is: (77, 195, 161, 114, 113, 117, 101, 122) (8 bytes) if your encoding is 'utf-8' or (77, 225, 114, 113, 117, 101, 122) (7 bytes) if the encoding is ISO 8859-1. And the same value would be "Mαrquez" when decoded in ISO 8859-7. The same bytes, different characters. Without knowing the encoding we cannot talk about Unicode characters.
Jacek Konieczny
@SilentGhost: thanks for the great reply. But what if I wanted to include spaces, '.'s and ','s ?
RadiantHex
A: 

For Python < 3 you may want to enable locale:

import locale
locale.setlocale(locale.LC_ALL, '')

And then use re.LOCALE option with your regexpes:

re.findall('\w+', 'Márquez', re.LOCALE)

Though, probably Unicode is the better way to go, though it requires decoding the data from and encoding it to your local encoding.

Jacek Konieczny