A: 

Don't use utf8_decode. If your text is in UTF-8, pass it as such.

Unfortunately, DOMDocument defaults to LATIN1 in case of HTML. It seems the behavior is this

  • If fetching a remote document, it should deduce the encoding from the headers
  • If the header wasn't sent or the file is local, look for the correspondent meta-equiv
  • Otherwise, default to LATIN1.

Example of it working:

<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);

echo $d->textContent;

And with XML (default is UTF-8):

<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
    'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);

echo $d->textContent;
Artefacto
A: 

I finally managed to resolve my problem. I was able to confirm that the problem came from Microsoft Word apostrophe (word automatically transform any apostrophe to typographic apostrophe, ' to ’). It seems manipulating my text with DOMDocument transformed those typographic apostrophe into ?.

How I made it work :

  1. Find any apostrophe or similar character and transform it using str_replace
  2. Encode my utf8 text into latin1 (since DOMDocument seems to default to latin1)
  3. Load and do my stuff with DOMDocument
  4. Reencode back my string back into UTF8

The code :

<?php     
          $find[] = '“';  // left side double smart quote
          $find[] = 'â€';  // right side double smart quote
          $find[] = '‘';  // left side single smart quote
          $find[] = '’';  // right side single smart quote
          $find[] = '…';  // elipsis
          $find[] = '—';  // em dash
          $find[] = '–';  // en dash

          $replace[] = '"';
          $replace[] = '"';
          $replace[] = "'";
          $replace[] = "'";
          $replace[] = "...";
          $replace[] = "-";
          $replace[] = "-";
          $row->text = str_replace($find, $replace,  $row->text);
          $row->text = utf8_decode($row->text);
          $dom = new DOMDocument();
          $dom->loadHTML($row->text);

         //SOME DOM MANIPULATION HERE

         //reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto.
         $row->text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));
         $row->text = utf8_encode($row->text);
?>

Credit goes to str_replace doc for the odd character find / replace.

Kyrotomia
DOMDocument did not trnansform anything. The problem is in `utf8_decode`. `’` is Unicode character that doesn't exist in LATIN1, so `utf8_decode` turns it into `?`. Worse, you're using str_replace to replace high-bit characters, which may break your UTRF-8 input. -1
Artefacto