views:

201

answers:

2

Usually, the best practice in python, when using international languages, is to use unicode and to convert early any input to unicode and to convert late to a string encoding (UTF-8 most of the times).

But when I need to do RegEx on unicode I don't find the process really friendly. For example, if I need to find the 'é' character follow by one ore more spaces I have to write (Note: my shell or python file are set to UTF-8):

re.match('(?u)\xe9\s+', unicode)

So I have to write the unicode code of 'é'. That's not really convenient and if I need to built the RegEx from a variable, things start to come ugly. Example:

word_to_match = 'Élisa™'.decode('utf-8') # that return a unicode object
regex = '(?u)%s\s+' % word_to_match
re.match(regex, unicode)

And this is a simple example. So if you have a lot of Regexs to do one after another with special characters in it, I found more easy and natural to do the RegEx on a string encoded in UTF-8. Example:

re.match('Élisa\s+', string)
re.match('Geneviève\s+', string)
re.match('DrØshtit\s+', string)

Is there's something I'm missing ? What are the drawbacks of the UTF-8 approach ?

UPDATE

Ok, I find the problem. I was doing my tests in ipython but unfortunately it seems to mess the encoding. Example:

In the python shell

>>> string_utf8 = 'Test « with theses » quotes Éléments'
>>> string_utf8
'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'
>>> print string_utf8
Test « with theses » quotes Éléments
>>>
>>> unicode_string = u'Test « with theses » quotes Éléments'
>>> unicode_string
u'Test \xab with theses \xbb quotes \xc9l\xe9ments'
>>> print unicode_string
Test « with theses » quotes Éléments
>>>
>>> unicode_decoded_from_utf8 = string_utf8.decode('utf-8')
>>> unicode_decoded_from_utf8
u'Test \xab with theses \xbb quotes \xc9l\xe9ments'
>>> print unicode_decoded_from_utf8
Test « with theses » quotes Éléments

In ipython

In [1]: string_utf8 = 'Test « with theses » quotes Éléments'

In [2]: string_utf8
Out[2]: 'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'

In [3]: print string_utf8
Test « with theses » quotes Éléments

In [4]: unicode_string = u'Test « with theses » quotes Éléments'

In [5]: unicode_string
Out[5]: u'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'

In [6]: print unicode_string
Test « with theses » quotes Éléments

In [7]: unicode_decoded_from_utf8 = string_utf8.decode('utf-8')

In [8]: unicode_decoded_from_utf8
Out[8]: u'Test \xab with theses \xbb quotes \xc9l\xe9ments'

In [9]: print unicode_decoded_from_utf8
Test « with theses » quotes Éléments

As you can see, ipython is messing with encoding when using the u'' notation. That was the source of my problems. The bug is mentionned here: https://bugs.launchpad.net/ipython/+bug/339642

+1  A: 

If you're using utf-8 in your python source, you can just write:

u'Élisa'

and that would be a unicode string, equivalent to writing:

u'\xc9lisa'

So the 'u' prefix makes the decode thing unneeded. If you leave out the 'u' and write:

'Élisa'

Then you'd have a (utf-8 encoded) bytestring, equivalent to:

'\xc3\x89lisa'
Steven
That's what I did at first but it didn't work. But I just found why: I was doing my tests in ipython and it seems to mess with encoding and decoding. I added some info at the bottom of my question.
Etienne
+1  A: 

You're using Python 2.x? If so, it's generally considered rather bad form to leave your non-ASCII characters in byte strings. Just use Unicode strings the whole way through:

re.match(u'Élisa™\\s+', unicodestring)

It may look a bit funny writing ‘u’ at the start of your string literals, but that goes away in Python 3.x, and it's really not that bad.

Matching UTF-8 strings with regex works for a limited subset of expressions. But if you want to use case-insensitive matches, or non-ASCII characters in a [group], or length-sensitive expressions, it'll go wrong. Best stick with Unicode.

(You probably don't especially need the (?u) if you are only using \s, that only brings in some of the more unusual spaces that you may not want to match anyway. Useful for case-insensitive matching on Unicode strings though.)

bobince
Sorry for not mentioning that I was using python 2.x. That's what I did at first but it didn't work. But I just found why: I was doing my tests in ipython and it seems to mess with encoding and decoding. I added some info at the bottom of my question.
Etienne
Ah, yeah... pushing unencoded non-ASCII characters through the console is always a bit dodgy, unfortunately, especially on Windows. Should be fine in your scripts with the encoding line at the top though.
bobince
`ur'Élisa™\s+'`
Glenn Maynard