ansaurus

Question

Answer 1

+2 A:

You could use the Normalizer to normalize the string to Normalization Form KD (NFKD) where the characters are getting decomposed, so Á (U+00C1) is getting decomposed to the combination of the letter A (U+0041) and the combining mark ́ (U+0301):

$str = Normalizer::normalize($str, Normalizer::FORM_KD);

Then you modify the search pattern to match those optional marks:

$pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui';

The replacement is then done with preg_replace:

preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str))

So the full method is:

private static function highlightTerm($str, $term) {
    $str = Normalizer::normalize($str, Normalizer::FORM_KD);
    $pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui';
    return preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str));
}

Gumbo 2010-08-27 09:58:10

Nice approach, thank you for the idea. Unluckily, it availability is **(PHP 5 >= 5.3.0, PECL intl >= 1.0.0)** and the production server runs PHP/5.2.14 so using this would imply installing the corresponding PECL library :(

Álvaro G. Vicario 2010-08-27 10:04:31

Answer 2

A:

For the records, I composed this following advice by Ross McKay (good ideas are his, bad code is mine):

<?php

mb_internal_encoding('UTF-8');

$full_string = 'Málaga';
$match = 'ALA';

echo Foo::highlightTerm($full_string, $match);

class Foo{
    public static function highlightTerm($full_string, $match){
        $full_string_ascii = preg_replace_callback('/[\w]+/ui', array('self', 'callbackHighlightTerm'), $full_string);
        $match_ascii = preg_replace_callback('/[\w]+/ui', array('self', 'callbackHighlightTerm'), $match);

        $start = stripos($full_string_ascii, $match_ascii);

        if($start===FALSE){
            return $full_string;
        }else{
            $length = mb_strlen($match);

            return
                htmlspecialchars( mb_substr($full_string, 0, $start)) .
                '<strong>' . htmlspecialchars( mb_substr($full_string, $start, $length) ) . '</strong>' .
                htmlspecialchars( mb_substr($full_string, $start+$length) );
        }
    }


    private static function callbackHighlightTerm($matches){
        return preg_replace('/[^\w]/i', '', iconv('UTF-8', 'ASCII//TRANSLIT', $matches[0]));
    }
}

?>

The benefit is that it works in older systems.

Álvaro G. Vicario 2010-09-01 07:22:38

Answer 3

A:

use PEAR I18N_UnicodeNormalizer-1.0.0

include('…');

echo preg_replace(
 '/(\P{L})/ui', // replace all except members of Unicode class "letters", case insensitive
 '', // with nothing → drop accents
 I18N_UnicodeNormalizer::toNFKD('ÅÉÏÔÙåéïôù') // ù → u + `
);

→ AEIOUaeiou

eleg 2010-10-18 22:38:14

ansaurus

tags:

views:

answers:

Accent-insensitive substring matching

related questions