ansaurus

Question

To do RegEx, what are the advantages/disadvantages to use UTF-8 string instead of unicode?

Answer 1

+1 A:

If you're using utf-8 in your python source, you can just write:

u'Élisa'

and that would be a unicode string, equivalent to writing:

u'\xc9lisa'

So the 'u' prefix makes the decode thing unneeded. If you leave out the 'u' and write:

'Élisa'

Then you'd have a (utf-8 encoded) bytestring, equivalent to:

'\xc3\x89lisa'

Steven 2009-09-25 16:24:47

That's what I did at first but it didn't work. But I just found why: I was doing my tests in ipython and it seems to mess with encoding and decoding. I added some info at the bottom of my question.

Etienne 2009-09-25 17:27:59

Answer 2

+1 A:

You're using Python 2.x? If so, it's generally considered rather bad form to leave your non-ASCII characters in byte strings. Just use Unicode strings the whole way through:

re.match(u'Élisa™\\s+', unicodestring)

It may look a bit funny writing ‘u’ at the start of your string literals, but that goes away in Python 3.x, and it's really not that bad.

Matching UTF-8 strings with regex works for a limited subset of expressions. But if you want to use case-insensitive matches, or non-ASCII characters in a [group], or length-sensitive expressions, it'll go wrong. Best stick with Unicode.

(You probably don't especially need the (?u) if you are only using \s, that only brings in some of the more unusual spaces that you may not want to match anyway. Useful for case-insensitive matching on Unicode strings though.)

bobince 2009-09-25 16:37:22

Sorry for not mentioning that I was using python 2.x. That's what I did at first but it didn't work. But I just found why: I was doing my tests in ipython and it seems to mess with encoding and decoding. I added some info at the bottom of my question.

Etienne 2009-09-25 17:27:24

Ah, yeah... pushing unencoded non-ASCII characters through the console is always a bit dodgy, unfortunately, especially on Windows. Should be fine in your scripts with the encoding line at the top though.

bobince 2009-09-25 18:10:02

`ur'Élisa™\s+'`

Glenn Maynard 2009-09-25 20:09:15

ansaurus

tags:

views:

answers:

To do RegEx, what are the advantages/disadvantages to use UTF-8 string instead of unicode?

UPDATE

related questions