ansaurus

Question

Regex and unicode

Answer 1

+4 A:

Use a subrange of [\u0000-\uFFFF] for what you want.

You can also use the re.UNICODE compile flag. The docs say that if UNICODE is set, \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

See also http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html.

Mark Cidade 2008-08-18 09:43:10

Answer 2

A:

\X seems to be available as a generic word-character in some languages, it allows you to match a single character disregarding of how many bytes it takes up. Might be useful.

grapefrukt 2008-08-18 09:53:13

Answer 3

+1 A:

In Mastering Regular Expressions from Jeffrey Friedl (great book) it is mentioned that you could use \p{Letter} which will match unicode stuff that is considered a letter.

Peter Stuifzand 2008-08-18 10:17:35

ansaurus

tags:

views:

answers:

Regex and unicode

related questions