I have received this database full of people names and data in French, which means, using characters such as é,è,ö,û, etc. Around 3000 entries.
Apparently, the data inside has been encoded sometimes using utf8_encode(), and sometimes not. This result in a messed up output: at some places the characters show up fine, at others they don't.
At first i tried to track down every place in the UI where those issues arise and use utf8_decode() where necessary, but it's really not a practicable solution.
I did some testing and there is no reason to use utf8_encode in the first place, so i'd rather remove all that and just work in UTF8 everywhere - at the browser, middleware and database levels. So i need to clean the database, converting all misencoded data by its cleaned up version.
Question : would it be possible to create a function in php that would check if a utf8 string is correctly encoded (without utf8_encode) or not (with utf8_encode), and, if it was, convert it back to its original state?
In other terms: i would like to know how i could detect utf8 content that has been utf8_encode() to utf8 content that has not been utf8_encode()d.
**UPDATE: EXAMPLE **
Here is a good example: you take a string full of special chars and take a copy of that string and utf8_encode() it. The function i'm dreaming of takes both strings, leaves the first one untouched and the second string is now the same as string one.
I tried this:
$loc_fr = setlocale(LC_ALL, 'fr_BE.UTF8','fr_BE@euro', 'fr_BE', 'fr', 'fra', 'fr_FR');
$str1= "éèöûêïà ";
$str2 = utf8_encode($str1);
function convert_charset($str) {
$charset= mb_detect_encoding($str);
if( $charset=="UTF-8" ) {
return utf8_decode($str);
}
else {
return $str;
}
}
function correctString($str) {
echo "\nbefore: $str";
$str= convert_charset($str);
echo "\nafter: $str";
}
correctString($str1);
echo('<hr/>'."\n");
correctString($str2);
And that gives me:
before: éèöûêïà after: �������
before: éèöûêïà after: éèöûêïà
Thanks,
Alex