tags:

views:

144

answers:

1

Hello,

Content people have been using Word and pasting things into the old unicode system. I'm now trying to go UTF8.

However, upon importing the data there are characters I cannot get rid of.

I have tried the following stackoverflow thread and none of the functions provided fix this string: http://snipplr.com/view.php?codeview&id=11171 / http://stackoverflow.com/questions/1262038/how-to-replace-microsoft-encoded-quotes-in-php

String: Danâ??s back for more!!

Help?

+2  A: 

In this kind of situation, I generally start with the string I have copy-pasted from word :

$str = 'Danâ’s back !';
var_dump($str);


And, going byte-by-byte in it, I output the hexadecimal code of each byte :

for ($i=0 ; $i<strlen($str) ; $i++) {
    $byte = $str[$i];
    $char = ord($byte);
    printf('%s:0x%02x ', $byte, $char);
}

Which gives an output such as this one :

D:0x44 a:0x61 n:0x6e �:0xc3 �:0xa2 �:0xe2 �:0x80 �:0x99 s:0x73 :0x20 b:0x62 a:0x61 c:0x63 k:0x6b :0x20 !:0x21 


Then, with a bit of guessing, luck, and trial-and-error, you'll find out that :

  • â is a character that fits on two bytes : 0xc3 0xa2
  • and the special-quote is a character that fits on three bytes : 0xe2 0x80 0x99

Hint : it's easier when you don't have two special characters following each other ;-)


After that, it's only a matter of using str_replace to replace the correct sequence of bytes by another character ; for example, to replace the special-quote by a normal one :

var_dump(str_replace("\xe2\x80\x99", "'", $str));

Will give you :

string 'Danâ's back !' (length=14)
Pascal MARTIN
Thank you! Going byte by byte I've managed to replace the following: $str = str_replace("'", "'", $str); $str = str_replace("\xc3\xa2\xc2\x80\xc2\x99", "'", $str); $str = str_replace("\xc3\xa2\xc2\x80\xc2\x93", ' - ', $str); $str = str_replace("\xc3\xa2\xc2\x80\xc2\x9d", '"', $str); $str = str_replace("\xc3\xa2\x3f\x3f", "'", $str);
azz0r