Don't use utf8_decode
. If your text is in UTF-8, pass it as such.
Unfortunately, DOMDocument
defaults to LATIN1 in case of HTML. It seems the behavior is this
- If fetching a remote document, it should deduce the encoding from the headers
- If the header wasn't sent or the file is local, look for the correspondent meta-equiv
- Otherwise, default to LATIN1.
Example of it working:
<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;
And with XML (default is UTF-8):
<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);
echo $d->textContent;
Artefacto
2010-08-19 15:45:25