I am currently scraping some data from the internet and converting into xml documents.
- document being scraped is utf-8 according to its meta tags
The problem is some of the data contains foreign characters, I cannot find a way of reliably converting them into XML / utf-8 friendly entities, the following errors are what I have managed to find by reading through, I would ideally like a solution that would work all the time.
Example 1 works correctly, example 2 fails. My research fixed example 1, but it does not seem to be a blanket solution.
Côte d'Ivoire Côte d'Ivoire (correct)
I managed to get the - ô - parsing correctly using the following function on my xpath.
$w->text(charset_decode_utf_8((string)$match->a));
function charset_decode_utf_8($string) {
if(@!ereg("[\200-\237]",$string) && @!ereg("[\241-\377]",$string)) {
return $string;
}
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
ÖFB Stiegl Cup ÖFB Stiegl Cup (wrong)
Unfortunately on the - Ö - it gets converted into a double entity. I have no idea how to make it convert to a proper html entity.
I have tried:
- using iso-8859-1 encoding when creating my xml document
- using htmlentities with utf-8 encoding
Any help would be greatly appreciated, as I am tearing my hair out trying to get things to save correctly.