Hello,
I am scraping webpages (using php's curl) that have accented characters (like "é"). In the source of those webpages, those characters are written using utf-8 (they are not html encoded.)
However, when the result is produced using the following code, I get question marks instead of the accented characters.
$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $website);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file = curl_exec($ch);
curl_close($ch);
The header info returned from the scraped webpage indicates that the Content is set to "html/text." There's no indication that it's utf-8 encoded. I've tried using CURLOPT_HTTPHEADER curl option to change the text encoding, but that doesn't do anything.
What am I missing?