views:

74

answers:

2

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.

>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz

(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)

Is there anyway to match combining diacritics with \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.

+1  A: 

You can use unicodedata.normalize to compose the combining diacritics into one unicode character.

>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>

I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.

Steven
Yes, that will tell me if I have a match, but after doing the match, I pull out matching groups and then do stuff with them. If I used your approach, then the bytes I have afterwards would not be the same bytes as are in the filename
Rory
I see. Do you know if the strings are consistent in their use of combining diacritics (always combining, or at least always combining or not within a single string)? If so, you could normalize the results to NFC or NFD again as needed. Otherwise, I think you will have to resort to tricks with detecting the position of combining diacritics in the original string, and try and use that information to decompose only the needed characters (which would of course be more work than just decomposing everything or not at all).
Steven
Or maybe just change the expression and use the ranges for the combining diacritics you're interested in, and use something like \w[\u0300-\u036F]? instead of just \w
Steven
No, the input is not consistant in how it uses combining diacritics. Some use the combined character, some use the combiningin diacritc
Rory
+1  A: 

I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).

It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).

Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.

It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).

See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).

Steven