ansaurus

Question

Match unicode in ply's regexes

Answer 1

A:

Probably POSIX character classes are right for you?

Tomalak 2008-10-26 16:37:58

They don't exist in Python's regex engine

Vinko Vrsalovic 2008-10-26 16:49:24

Answer 2

+1 A:

Check the answers to this question

http://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python

you'd just need to use the other unicode character categories instead

Vinko Vrsalovic 2008-10-26 16:58:56

Answer 3

+1 A:

Solved it with the help of Vinko.

I realised that getting unicode range is plain dumb. So I'll do this:

symbols = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))
symnums = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))

t_IDENTIFIER = "[^%s](\\.|[^%s])*" % (symnums, symbols)

I don't know about unicode character classses. If this unicode stuff starts getting too complicated, I can just put the original one in place. UTF-8 support still ensures the support is on at the STRING tokens, which is more important.

Edit: On other hand, I start understanding why there's not much unicode support in programming languages.. This is an ugly hack, not a satisfying solution.

Cheery 2008-10-26 17:19:46

Answer 4

+2 A:

the re module supports the \w syntax which:

If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

therefore the following examples shows how to match unicode identifiers:

>>> import re
>>> m = re.compile('(?u)[^\W0-9]\w*')
>>> m.match('a')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('9')
>>> m.match('ab')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('a9')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('unicöde')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('ödipus')
<_sre.SRE_Match object at 0xb7d75410>

So the expression you look for is: (?u)[^\W0-9]\w*

Florian Bösch 2008-10-26 21:18:53

Now. This is a satisfying solution!

Cheery 2008-10-26 21:22:00

The quote from the Python documentation is correct, but the examples are misleading. You can simple use the UNICODE flag with \w in stead of the unnecessarily long expression given: `re.match(r'\w+', "ünıcodê", re.UNICODE)`

Walter 2008-10-26 21:53:05

Walter, you have not properly read the question: 1) the goal is an identifier in a programming language, which does not start with 0-9 usually. 2) the parser (ply) takes care of parsing, and it can't be controlled how it will invoke match, therefore (?u) is required.

Florian Bösch 2008-10-27 07:35:26

ansaurus

tags:

views:

answers:

Match unicode in ply's regexes

related questions