views:

812

answers:

1

Hello,

I am scraping webpages (using php's curl) that have accented characters (like "é"). In the source of those webpages, those characters are written using utf-8 (they are not html encoded.)

However, when the result is produced using the following code, I get question marks instead of the accented characters.

$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $website);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file = curl_exec($ch);
curl_close($ch);

The header info returned from the scraped webpage indicates that the Content is set to "html/text." There's no indication that it's utf-8 encoded. I've tried using CURLOPT_HTTPHEADER curl option to change the text encoding, but that doesn't do anything.

What am I missing?

+1  A: 

As per the answer to my question, have a look at http://stackoverflow.com/questions/1277552/characters-changed-in-a-curl-request The answer Dominic Rodger just saved my day with his reply..

Regards Fons

Fons