ansaurus

Question

how do i regex search for weird non-ascii characters in python?

Answer 1

A:

After telling Python that your source file uses UTF-8 encoding, did you actually make sure that your editor is saving the file using UTF-8 encoding? The error you get indicates that your editor is probably not using UTF-8.

What text editor are you using?

Greg Hewgill 2010-01-11 02:51:10

I'm using notepad++

Incognito 2010-01-11 02:58:05

Here's how to configure UTF-8 in Notepad++: http://superuser.com/questions/21135/how-can-i-edit-unicode-text-in-notepad

Greg Hewgill 2010-01-11 03:08:15

Answer 2

A:

\x{c0de}

In a regex will match the Unicode character at code point c0de.

Python uses PCRE, right? (If it doesn't, it's probably \uC0DE instead...)

Anon. 2010-01-11 02:51:32

Answer 3

+3 A:

You need to find out what encoding your editor is using, and set that per PEP263; or, make things more stable and portable (though alas perhaps a bit less readable) and use escape sequences in your string literal, i.e., use u'(\xdb|\xb2|\xb0|\xb1|\xc9|\xb9|\xcd)' as the parameter to the re.compile call.

Alex Martelli 2010-01-11 02:52:49

I wouldn't use the byte representation of an Unicode char as it's really locale dependent. I'd escape as Unicode, and thus avoiding any complication.

Paulo Santos 2010-01-11 03:10:30

@Paulo, Unicode encodings are **not** necessarily locale-dependent -- you can use `'utf-8'` or other locale-independent encodings. But in any case, the form I give above is what a `print repr(thestring)` emits, does **not** rely on any encoding, and cannot possibly cause any complications (it's just the same as using `\u00db` and so on, guaranteed to produce absolutely identical Unicode objects, just more concise as it saves you typing the unchanging `00` parts!-).

Alex Martelli 2010-01-11 03:17:59

ansaurus

tags:

views:

answers:

how do i regex search for weird non-ascii characters in python?

related questions