ansaurus

Question

UTF in Python Regex

Answer 1

+4 A:

You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.

So, for example, this:

re.compile("–")

becomes this:

re.compile(u"\u2013")

Patrick McElhaney 2008-12-16 18:01:19

I was putting an r before the string for raw string

Teifion 2008-12-16 18:31:24

Answer 2

+1 A:

After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.

# encoding: utf-8

Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6

# encoding: utf-8
import re
x = re.compile("–") 
print x.search("xxx–x").start()

Patrick McElhaney 2008-12-16 18:20:34

Answer 3

+2 A:

Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.

unbeknown 2008-12-16 18:27:38

ansaurus

tags:

views:

answers:

UTF in Python Regex

related questions