views:

57

answers:

3

I have a search functionality that obtains data from an InnoDB table (utf8_spanish_ci collation) and displays it in an HTML document (UTF-8 charset). The user types a substring and obtains a list of matches where the first substring occurrence is highlighted, e.g.:

Matches for "AL":

Álava
<strong>Al</strong>bacete
<strong>Al</strong>mería
Ciudad Re<strong>al</strong>
Málaga

As you can see from the example, the search ignores both case and accent differences (MySQL takes care of it automatically). However, the code I'm using to hightlight matches fails to do the latter:

<?php

private static function highlightTerm($full_string, $match){
    $start = mb_stripos($full_string, $match);
    $length = mb_strlen($match);

    return
        htmlspecialchars( mb_substr($full_string, 0, $start)) .
        '<strong>' . htmlspecialchars( mb_substr($full_string, $start, $length) ) . '</strong>' .
        htmlspecialchars( mb_substr($full_string, $start+$length) );
}

?>

Is there a sensible way to fix this that doesn't imply hard-coding all possible variations?

Update: System specs are PHP/5.2.14 and MySQL/5.1.48

+2  A: 

You could use the Normalizer to normalize the string to Normalization Form KD (NFKD) where the characters are getting decomposed, so Á (U+00C1) is getting decomposed to the combination of the letter A (U+0041) and the combining mark ́ (U+0301):

$str = Normalizer::normalize($str, Normalizer::FORM_KD);

Then you modify the search pattern to match those optional marks:

$pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui';

The replacement is then done with preg_replace:

preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str))

So the full method is:

private static function highlightTerm($str, $term) {
    $str = Normalizer::normalize($str, Normalizer::FORM_KD);
    $pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui';
    return preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str));
}
Gumbo
Nice approach, thank you for the idea. Unluckily, it availability is **(PHP 5 >= 5.3.0, PECL intl >= 1.0.0)** and the production server runs PHP/5.2.14 so using this would imply installing the corresponding PECL library :(
Álvaro G. Vicario
A: 

For the records, I composed this following advice by Ross McKay (good ideas are his, bad code is mine):

<?php

mb_internal_encoding('UTF-8');

$full_string = 'Málaga';
$match = 'ALA';

echo Foo::highlightTerm($full_string, $match);

class Foo{
    public static function highlightTerm($full_string, $match){
        $full_string_ascii = preg_replace_callback('/[\w]+/ui', array('self', 'callbackHighlightTerm'), $full_string);
        $match_ascii = preg_replace_callback('/[\w]+/ui', array('self', 'callbackHighlightTerm'), $match);

        $start = stripos($full_string_ascii, $match_ascii);

        if($start===FALSE){
            return $full_string;
        }else{
            $length = mb_strlen($match);

            return
                htmlspecialchars( mb_substr($full_string, 0, $start)) .
                '<strong>' . htmlspecialchars( mb_substr($full_string, $start, $length) ) . '</strong>' .
                htmlspecialchars( mb_substr($full_string, $start+$length) );
        }
    }


    private static function callbackHighlightTerm($matches){
        return preg_replace('/[^\w]/i', '', iconv('UTF-8', 'ASCII//TRANSLIT', $matches[0]));
    }
}

?> 

The benefit is that it works in older systems.

Álvaro G. Vicario
A: 

use PEAR I18N_UnicodeNormalizer-1.0.0

include('…');

echo preg_replace(
 '/(\P{L})/ui', // replace all except members of Unicode class "letters", case insensitive
 '', // with nothing → drop accents
 I18N_UnicodeNormalizer::toNFKD('ÅÉÏÔÙåéïôù') // ù → u + `
);

→ AEIOUaeiou

eleg