ansaurus

Question

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

Answer 1

A:

One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.

The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.

The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

Jason Scheirer 2010-08-04 21:56:26

preg_match('/\m(.{1,4})ori/i',$page_title) is better but still matches with words such as "Mooring", "Monitoring" and "Memorial". I think both you and Wrikken are right about multi-byte versus Unicode.

Tama 2010-08-10 22:06:05

Answer 2

A:

Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.

Wrikken 2010-08-04 22:06:35

Thanks Wrikken - I'm going to have to take a good look at how PHP/ regex copes with multi-byte versus Unicode

Tama 2010-08-10 22:06:48

ansaurus

tags:

views:

answers:

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

related questions