Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
views:
238answers:
3You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M}
or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M}
would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
, and \P{M}
would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
.
People would thank you. :)
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian}
to match Armenian characters. \p{Ll}
or \p{Zs}
work too.
Note that while \p{Ll}
has no equivalent in Python regular expressions, \p{Zs}
should be covered by '(?u)\s'
.
The (?u)
, as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s
means any spacing character.