views:

160

answers:

3

I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there.

I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string.

I would simply like to remove it from the string. strpos(), str_replace(), preg_replace(), trim(), etc. Cannot correctly identify it.

My string is this:

"Â  Â  Â  A lot of couples throughout the World "

If I do this:

$string = str_replace('Â','',$string);

I get this:

"� � � A lot of couples throughout the World"

I even tried utf8_encode() and utf8_decode() before the str_replace, with no luck.

What's the solution? I've been throwing everything I can find at it...

A: 

I use this:

function replaceSpecial($str){
$chunked = str_split($str,1);
$str = ""; 
foreach($chunked as $chunk){
    $num = ord($chunk);
    // Remove non-ascii & non html characters
    if ($num >= 32 && $num <= 123){
            $str.=$chunk;
    }
}   
return $str;
} 
akellehe
You can expand this to allow all ascii characters by changing 32 to 0 and 123 to 255.
akellehe
This will remove MANY more characters than just accents.
shamittomar
right, all non-pretty, non-ascii characters
akellehe
First off, the only ASCII overlap is between 0 and 127. If you allow character 128 or higher, you'll break the encoding (this is due to the multi-byte nature of UTF-8). However, this is a quite dirty method of doing that. What I would do if I was you, is simply use the [`iconv`](http://us3.php.net/manual/en/book.iconv.php) function if you need to convert to ASCII... `$str = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string)`, especially since it'll transliterate characters for you...
ircmaxell
+1up. Thanks for the tip :)
akellehe
Ahh.. I think I understand the solution, but I'm still not clear why PHP doesn't recognize the characters? I think I'll use something like this, but only strip a few specific chars. Thanks!
Travis
+3  A: 
$string = str_replace('Â','',$string);

How is this 'Â' encoded? If your script file is saved as iso-8859-1 the string 'Â' is encoded as the one byte sequence xC2 while the (/one) utf-8 representation is xC3 x82. php's str_replace() works on the byte level, i.e. it only "knows" single-byte characters.

see http://docs.php.net/intro.mbstring

VolkerK
+1, you can therefore write the replace as: `str_replace(chr(195) . chr(130), '', $string)`... (where `195` and `130` are `xC3` and `x82` converted from Hex to decimal, respectively)... Or, since PHP supports hex numbers: `str_replace(chr(0xC3), chr(0x82), '', $string)`...
ircmaxell
I also found that mb_ereg_replace() didn't seem to work properly; Isn't this its purpose? Your information is extremely useful and I'll be sure to read the documentation you linked. Thanks!
Travis
@Travis: The parameters you pass to the mbstring functions have to be encoded properly as well. If you have a string literal in your script (like 'Â') then the encoding depends on how you've saved the script file.
VolkerK