tags:

views:

674

answers:

3

I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1

I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.

How can I force Python to use a UTF string or in some way match a character such as that?

Thanks for your help

+4  A: 

You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.

So, for example, this:

re.compile("–")

becomes this:

re.compile(u"\u2013")
Patrick McElhaney
I was putting an r before the string for raw string
Teifion
+1  A: 

After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.

# encoding: utf-8

Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6

# encoding: utf-8
import re
x = re.compile("–") 
print x.search("xxx–x").start()
Patrick McElhaney
+2  A: 

Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.

unbeknown