Usually, the best practice in python, when using international languages, is to use unicode and to convert early any input to unicode and to convert late to a string encoding (UTF-8 most of the times).
But when I need to do RegEx on unicode I don't find the process really friendly. For example, if I need to find the 'é' character follow by one ore more spaces I have to write (Note: my shell or python file are set to UTF-8):
re.match('(?u)\xe9\s+', unicode)
So I have to write the unicode code of 'é'. That's not really convenient and if I need to built the RegEx from a variable, things start to come ugly. Example:
word_to_match = 'Élisa™'.decode('utf-8') # that return a unicode object
regex = '(?u)%s\s+' % word_to_match
re.match(regex, unicode)
And this is a simple example. So if you have a lot of Regexs to do one after another with special characters in it, I found more easy and natural to do the RegEx on a string encoded in UTF-8. Example:
re.match('Élisa\s+', string)
re.match('Geneviève\s+', string)
re.match('DrØshtit\s+', string)
Is there's something I'm missing ? What are the drawbacks of the UTF-8 approach ?
UPDATE
Ok, I find the problem. I was doing my tests in ipython but unfortunately it seems to mess the encoding. Example:
In the python shell
>>> string_utf8 = 'Test « with theses » quotes Éléments'
>>> string_utf8
'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'
>>> print string_utf8
Test « with theses » quotes Éléments
>>>
>>> unicode_string = u'Test « with theses » quotes Éléments'
>>> unicode_string
u'Test \xab with theses \xbb quotes \xc9l\xe9ments'
>>> print unicode_string
Test « with theses » quotes Éléments
>>>
>>> unicode_decoded_from_utf8 = string_utf8.decode('utf-8')
>>> unicode_decoded_from_utf8
u'Test \xab with theses \xbb quotes \xc9l\xe9ments'
>>> print unicode_decoded_from_utf8
Test « with theses » quotes Éléments
In ipython
In [1]: string_utf8 = 'Test « with theses » quotes Éléments'
In [2]: string_utf8
Out[2]: 'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'
In [3]: print string_utf8
Test « with theses » quotes Éléments
In [4]: unicode_string = u'Test « with theses » quotes Éléments'
In [5]: unicode_string
Out[5]: u'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'
In [6]: print unicode_string
Test « with theses » quotes Ãléments
In [7]: unicode_decoded_from_utf8 = string_utf8.decode('utf-8')
In [8]: unicode_decoded_from_utf8
Out[8]: u'Test \xab with theses \xbb quotes \xc9l\xe9ments'
In [9]: print unicode_decoded_from_utf8
Test « with theses » quotes Éléments
As you can see, ipython is messing with encoding when using the u'' notation. That was the source of my problems. The bug is mentionned here: https://bugs.launchpad.net/ipython/+bug/339642