views:

71

answers:

2

First of all, my database uses Windows-1250 as native charset. I am outputting the data as UTF-8. I'm using iconv() function all over my website to convert Windows-1250 strings to UTF-8 strings and it works perfect.

The problem is when I'm using PHP DOM to parse some HTML stored in the database (the HTML is an output from a WYSIWYG editor and is not valid, it has no html, head, body tags etc).

The HTML could look something like this, for example:

<p>Hello</p>

Here is a method I use to parse a certain HTML from the database:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

The output from the method above is a garbage with all special characters replaced with weird stuff like Ú�.

One more thing. It does work on my development server.

It does not work on the production server though.

Any suggestions?

PHP version of the production server: PHP Version 5.2.0RC4-dev

PHP version of the development server: PHP Version 5.2.13


UPDATE:

I'm working on a solution myself. I have an inspiration from this PHP bug report (not really a bug though): http://bugs.php.net/bug.php?id=32547

This is my proposed solution. I will try it tomorrow and let you know if it works:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  // this might work
  // it basically just adds head and meta tags to the document
  $html = $doc->getElementsByTagName('html')->item(0);
  $head = $doc->createElement('head', '');
  $meta = $doc->createElement('meta', '');
  $meta->setAttribute('http-equiv', 'Content-Type');
  $meta->setAttribute('content', 'text/html; charset=utf-8');
  $head->appendChild($meta);
  $body = $doc->getElementsByTagName('body')->item(0);
  $html->removeChild($body);
  $html->appendChild($head);
  $html->appendChild($body);

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }
+1  A: 

Two solutions.

You can either set the encoding as a header:

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

Or your can set it as a META tag:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

EDIT: in the event that both of these are set correctly, do the following:

  • Create a small page that has a UTF-8 character in it.
  • Write the page in the same method that you already have.
  • Use Fiddler or Wireshark to examine the raw bytes being transferred in your DEV and PROD environments. You can also double check the headers using Fiddler/Wireshark.

If you are confident that the correct header is being sent, then your best chance of finding the error is to start looking at raw bytes. Identical bytes sent to an identical browser will yield the same result, so you need to start looking for why they are not identical. Fiddler/Wireshark will help with that.

Stargazer712
I dont think this will solve the problem, if it is really working with var_dump
Cem Kalyoncu
He mentions that it does work on his development server, meaning that it is highly likely that the bytes are being _written_ correctly. The most likely problem from there is that the bytes are not being _read_ correctly, and this should fix that.
Stargazer712
The header is sent correctly. There is also the correct meta tag.
Richard Knop
@Richark Knop - See edits.
Stargazer712
@Stargazer712 Ok I will try using the fiddler. By the way, I think the problem is caused by PHP DOM. I think it is messing up the eastern european UTF-8 characters. Do you know any alternative to PHP DOM that I could use to parse HTML?
Richard Knop
+1  A: 

Your "hack" doesn't make sense.

You are converting a Windows-1250 HTML file into UTF-8 and then prepending <?xml encoding="UTF-8">. This won't work. The DOM extension, for HTML files:

  • Takes the charset specified in a meta http-equiv for "content-type".
  • Otherwise assumes ISO-8859-1

I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.

EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without meta elements for content-type, you can add your own to force interpretation as UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

gives:

string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"
Artefacto
If you have worked with non English data (cp1250 or other) you will know that this hack is sometimes the only way to make PHP DOM preserve the UTF-8 special characters. It is also mentioned in the PHP documentation. You can try making a cp1250 database, fetching some data from there and parsing it with PHP DOM. It's a real pain.
Richard Knop
@Rich "It is also mentioned in the PHP documentation." Link please. User notes are not part of the documentation.
Artefacto
@Artefacto It is a user comment here (http://www.php.net/manual/en/domdocument.loadhtml.php). It's third comment from the top. I know it's not official but it's sometimes the only way. This is not the only time Windows-1250 + PHP DOM combination is giving me headaches. Nevertheless, I just slept for a while and I have an idea about how to solve this (not sure it wil work though). I will try it tomorrow if it doesn't work I will probably start a bounty for this question.
Richard Knop
@Artefacto I have an idea about what could be the problem from here: http://bugs.php.net/bug.php?id=32547 But I go sleep now.
Richard Knop
@Artefacto If I solve this I will probably add a comment to PHP documentation for the first time :D
Richard Knop
@Rich As to the bug report – that's what I was saying. Unless you put a meta there, it assumes it's ISO-8859-1. However, my solution is also insufficient because Windows-1250 has characters that ISO-8859-1 doesn't have.
Artefacto
@Artefacto Check my updated question. I have no way to test it from home but I will test it tomorrow at work. And by the way, the "$doc->loadHTML('<?xml encoding="UTF-8">' . $html);" hack is needed because the HTML from the database is not valid. It is an output from a WYSIWYG editor which is something like '<p>Hello</p>'. So it has no html, head, body tags. And if I use loadHTML() method on an invalid HTML like that, PHP DOM will go crazy and mess up all UTF-8 characters.
Richard Knop
@Rich See my edit.
Artefacto
Will test tomorrow and accept your answer if it works ;)
Richard Knop
@Artefacto Works ;)
Richard Knop