views:

575

answers:

2

I am currently scraping some data from the internet and converting into xml documents.

  • document being scraped is utf-8 according to its meta tags

The problem is some of the data contains foreign characters, I cannot find a way of reliably converting them into XML / utf-8 friendly entities, the following errors are what I have managed to find by reading through, I would ideally like a solution that would work all the time.

Example 1 works correctly, example 2 fails. My research fixed example 1, but it does not seem to be a blanket solution.

Côte d'Ivoire  
Côte d'Ivoire (correct)  

I managed to get the - ô - parsing correctly using the following function on my xpath.

$w->text(charset_decode_utf_8((string)$match->a));

function charset_decode_utf_8($string) {
    if(@!ereg("[\200-\237]",$string) && @!ereg("[\241-\377]",$string)) {
        return $string;
    }
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e","'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",$string);
$string = preg_replace("/([\300-\337])([\200-\277])/e","'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",$string);
return $string;
}
ÖFB Stiegl Cup  
ÖFB Stiegl Cup (wrong)  

Unfortunately on the - Ö - it gets converted into a double entity. I have no idea how to make it convert to a proper html entity.

I have tried:

  • using iso-8859-1 encoding when creating my xml document
  • using htmlentities with utf-8 encoding

Any help would be greatly appreciated, as I am tearing my hair out trying to get things to save correctly.

+2  A: 

UTF-8 can be used to store any character (a proof ? it stores them in the webpages you are scraping) ; so, why encode some as entities ?

If you are opening XML documents and see problems with encoding, check the parameters of your editor : does it try to analyse the document as UTF-8 ? (Some editors don't, by default -- if you are opening a document on your hard disk with a browser, it might fail to recognize it as UTF-8 because there is no server to send any header indicating it's UTF-8)

If the problem is not that, can upload an example of problematic XML document somewhere ?

Pascal MARTIN
omg!!! i was just preparing to upload the documents and found that i was using tidy and it was that destroying the characters. as you and others have rightly mentioned, it passes through correctlty now. what a plonker :( thank you very much for your help.
esryl
no problem :-) and thanks for indicating what the problem was (could bo useful to someone else ;-) )
Pascal MARTIN
A: 

Don't bother with entity encoding. Use CDATA blocks instead.

PHP doesn't understand UTF-8. It thinks it's a bytestream. Best to treat it that way. You're shuttling bytes around, and all you need to do is make sure they don't get parsed and they're labeled correctly.

Joeri Sebrechts