views:

32

answers:

2

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.

For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.

I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).

Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".

The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:

preg_match('/\m(.{1})ori/i',$page_title)

Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?

Cheers Tama

A: 

One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.

The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.

The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

Jason Scheirer
preg_match('/\m(.{1,4})ori/i',$page_title) is better but still matches with words such as "Mooring", "Monitoring" and "Memorial". I think both you and Wrikken are right about multi-byte versus Unicode.
Tama
A: 
  1. Use the /u modifier for utf-8 mode in regexes,
  2. You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
Wrikken
Thanks Wrikken - I'm going to have to take a good look at how PHP/ regex copes with multi-byte versus Unicode
Tama