ansaurus

Question

DOMDocument encoding problems / characters transformed

Answer 1

A:

Don't use utf8_decode. If your text is in UTF-8, pass it as such.

Unfortunately, DOMDocument defaults to LATIN1 in case of HTML. It seems the behavior is this

If fetching a remote document, it should deduce the encoding from the headers
If the header wasn't sent or the file is local, look for the correspondent meta-equiv
Otherwise, default to LATIN1.

Example of it working:

<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);

echo $d->textContent;

And with XML (default is UTF-8):

<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
    'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);

echo $d->textContent;

Artefacto 2010-08-19 15:45:25

Answer 2

A:

I finally managed to resolve my problem. I was able to confirm that the problem came from Microsoft Word apostrophe (word automatically transform any apostrophe to typographic apostrophe, ' to ’). It seems manipulating my text with DOMDocument transformed those typographic apostrophe into ?.

How I made it work :

Find any apostrophe or similar character and transform it using str_replace
Encode my utf8 text into latin1 (since DOMDocument seems to default to latin1)
Load and do my stuff with DOMDocument
Reencode back my string back into UTF8

The code :

<?php     
          $find[] = 'â€œ';  // left side double smart quote
          $find[] = 'â€';  // right side double smart quote
          $find[] = 'â€˜';  // left side single smart quote
          $find[] = 'â€™';  // right side single smart quote
          $find[] = 'â€¦';  // elipsis
          $find[] = 'â€”';  // em dash
          $find[] = 'â€“';  // en dash

          $replace[] = '"';
          $replace[] = '"';
          $replace[] = "'";
          $replace[] = "'";
          $replace[] = "...";
          $replace[] = "-";
          $replace[] = "-";
          $row->text = str_replace($find, $replace,  $row->text);
          $row->text = utf8_decode($row->text);
          $dom = new DOMDocument();
          $dom->loadHTML($row->text);

         //SOME DOM MANIPULATION HERE

         //reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto.
         $row->text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));
         $row->text = utf8_encode($row->text);
?>

Credit goes to str_replace doc for the odd character find / replace.

Kyrotomia 2010-08-19 18:17:07

DOMDocument did not trnansform anything. The problem is in `utf8_decode`. `’` is Unicode character that doesn't exist in LATIN1, so `utf8_decode` turns it into `?`. Worse, you're using str_replace to replace high-bit characters, which may break your UTRF-8 input. -1

Artefacto 2010-08-20 01:06:57

ansaurus

tags:

views:

answers:

DOMDocument encoding problems / characters transformed

related questions