tags:

views:

238

answers:

3

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.

+2  A: 

You're right that Unicode property classes are not supported by the Python regex parser.

If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].

People would thank you. :)

Jonathan Feinberg
Right, creating character classes crossed my mind. But with roughly 40 categories you end up producing 80 classes, and that's not counting unicode scripts, blocks, planes and whatnot. Might be worth a little open source project, but still a maintenance nightmare. I just discovered that re.VERBOSE doesn't apply to character classes, so no comments here or white space to help readability...
ThomasH
+4  A: 

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

joeforker
Very nice, thanks for the link!
ThomasH
+1  A: 

Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'. The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

ΤΖΩΤΖΙΟΥ
You're right. Problem is, '(?u)\s' is larger than '\p{Zs}', including e.g. newline. So if you really want to match only space separators, the former is overgenerating.
ThomasH