ansaurus

Question

Python regex \w doesn't match combining diacritics?

Answer 1

+1 A:

You can use unicodedata.normalize to compose the combining diacritics into one unicode character.

>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>

I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.

Steven 2010-06-29 15:41:38

Yes, that will tell me if I have a match, but after doing the match, I pull out matching groups and then do stuff with them. If I used your approach, then the bytes I have afterwards would not be the same bytes as are in the filename

Rory 2010-06-30 08:57:22

I see. Do you know if the strings are consistent in their use of combining diacritics (always combining, or at least always combining or not within a single string)? If so, you could normalize the results to NFC or NFD again as needed. Otherwise, I think you will have to resort to tricks with detecting the position of combining diacritics in the original string, and try and use that information to decompose only the needed characters (which would of course be more work than just decomposing everything or not at all).

Steven 2010-06-30 15:38:50

Or maybe just change the expression and use the ranges for the combining diacritics you're interested in, and use something like \w[\u0300-\u036F]? instead of just \w

Steven 2010-06-30 16:17:04

No, the input is not consistant in how it uses combining diacritics. Some use the combined character, some use the combiningin diacritc

Rory 2010-07-06 11:45:23

Answer 2

+1 A:

I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).

It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).

Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.

It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).

See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).

Steven 2010-07-06 12:54:27

ansaurus

tags:

views:

answers:

Python regex \w doesn't match combining diacritics?

related questions