ansaurus

Question

Answer 1

+4 A:

You need to specify the UNICODE flag, otherwise \w is just equivalent to [a-zA-Z0-9_], which does not include the character 'ç'.

>>> re.compile(r"^\w*$", re.U).match(u"Fran\xe7ais")
<_sre.SRE_Match object at 0x101474168>

KennyTM 2010-08-31 08:36:54

Why does this wort then: `>>> re.compile(r"\w*").match(u"Français")`?

ak 2010-08-31 08:37:51

@ak: Are you sure the match returns `Français` instead of `Fran` with it? Note that without the `$` the regex won't match until the end.

KennyTM 2010-08-31 08:38:32

`\w*` will match absolutely anything. `*` matches 0 or more times.

Turtle 2010-08-31 08:39:42

Oh, dumb me... thanks a bunch!

ak 2010-08-31 08:41:37

Python regex with unicode characters bug?