ansaurus

Question

PHP Multi Byte str_replace?

Answer 1

+2 A:

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

phsiao 2009-09-20 14:45:07

Aye, UTF-8 file run on cli to a text file (dont output to iso terminal) works.

OIS 2009-09-20 14:52:54

So how can I change my Input encoding then?

Ian 2009-09-20 14:58:49

I checked my text editor, its file encoding is set to UTF-8.

Ian 2009-09-20 15:56:16

If you do a $str = "Ørjan Nilsen" at the beginning, and print $str out at the end, does it give you the right answer? If you read from cli to initialize $str then it may not be set with proper encoding.

phsiao 2009-09-20 16:17:52

Answer 2

A:

Try this function definition:

if (!function_exists('mb_str_replace')) {
    function mb_str_replace($search, $replace, $subject) {
        if (is_array($subject)) {
            foreach ($subject as $key => $val) {
                $subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
            }
            return $subject;
        }
        $pattern = '/['.preg_quote(implode('', (array)$search), '/').']/u';
        if (is_array($search)) {
            if (is_array($replace)) {
                $len = min(count($search), count($replace));
                $table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
                $f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
                $subject = preg_replace_callback($pattern, $f, $subject);
                return $subject;
            }
        }
        $subject = preg_replace($pattern, (string)$replace, $subject);
        return $subject;
    }
}

Gumbo 2009-09-20 15:01:08

Answer 3

+1 A:

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.

NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).

header('Content-Type: text/plain; charset=utf-8');

$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));

$test = Normalizer::normalize($test, Normalizer::FORM_D);

// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';

echo preg_replace($pattern, '', $test);

Output:

aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn

The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)

(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)

Marc Ermshaus 2009-11-18 22:30:41

Thanks, that's actually pretty useful. Though I don't really want to use that in this instance because it results in the loss of accents.

Ian 2009-11-20 16:10:49

I thought that getting rid of accents was what you were trying to do?

Marc Ermshaus 2009-11-25 15:59:31

ansaurus

tags:

views:

answers:

PHP Multi Byte str_replace?

related questions