ansaurus

Question

PHP DOM UTF-8 problem

Answer 1

+1 A:

Two solutions.

You can either set the encoding as a header:

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

Or your can set it as a META tag:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

EDIT: in the event that both of these are set correctly, do the following:

Create a small page that has a UTF-8 character in it.
Write the page in the same method that you already have.
Use Fiddler or Wireshark to examine the raw bytes being transferred in your DEV and PROD environments. You can also double check the headers using Fiddler/Wireshark.

If you are confident that the correct header is being sent, then your best chance of finding the error is to start looking at raw bytes. Identical bytes sent to an identical browser will yield the same result, so you need to start looking for why they are not identical. Fiddler/Wireshark will help with that.

Stargazer712 2010-08-23 15:21:32

I dont think this will solve the problem, if it is really working with var_dump

Cem Kalyoncu 2010-08-23 15:53:47

He mentions that it does work on his development server, meaning that it is highly likely that the bytes are being _written_ correctly. The most likely problem from there is that the bytes are not being _read_ correctly, and this should fix that.

Stargazer712 2010-08-23 16:00:37

The header is sent correctly. There is also the correct meta tag.

Richard Knop 2010-08-23 16:14:41

@Richark Knop - See edits.

Stargazer712 2010-08-23 16:29:27

@Stargazer712 Ok I will try using the fiddler. By the way, I think the problem is caused by PHP DOM. I think it is messing up the eastern european UTF-8 characters. Do you know any alternative to PHP DOM that I could use to parse HTML?

Richard Knop 2010-08-23 16:33:13

Answer 2

+1 A:

Your "hack" doesn't make sense.

You are converting a Windows-1250 HTML file into UTF-8 and then prepending <?xml encoding="UTF-8">. This won't work. The DOM extension, for HTML files:

Takes the charset specified in a meta http-equiv for "content-type".
Otherwise assumes ISO-8859-1

I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.

EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without meta elements for content-type, you can add your own to force interpretation as UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

gives:

string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"

Artefacto 2010-08-23 18:43:39

If you have worked with non English data (cp1250 or other) you will know that this hack is sometimes the only way to make PHP DOM preserve the UTF-8 special characters. It is also mentioned in the PHP documentation. You can try making a cp1250 database, fetching some data from there and parsing it with PHP DOM. It's a real pain.

Richard Knop 2010-08-23 19:43:00

@Rich "It is also mentioned in the PHP documentation." Link please. User notes are not part of the documentation.

Artefacto 2010-08-23 19:55:18

@Artefacto It is a user comment here (http://www.php.net/manual/en/domdocument.loadhtml.php). It's third comment from the top. I know it's not official but it's sometimes the only way. This is not the only time Windows-1250 + PHP DOM combination is giving me headaches. Nevertheless, I just slept for a while and I have an idea about how to solve this (not sure it wil work though). I will try it tomorrow if it doesn't work I will probably start a bounty for this question.

Richard Knop 2010-08-23 20:04:22

@Artefacto I have an idea about what could be the problem from here: http://bugs.php.net/bug.php?id=32547 But I go sleep now.

Richard Knop 2010-08-23 20:05:16

@Artefacto If I solve this I will probably add a comment to PHP documentation for the first time :D

Richard Knop 2010-08-23 20:06:25

@Rich As to the bug report – that's what I was saying. Unless you put a meta there, it assumes it's ISO-8859-1. However, my solution is also insufficient because Windows-1250 has characters that ISO-8859-1 doesn't have.

Artefacto 2010-08-23 20:08:51

@Artefacto Check my updated question. I have no way to test it from home but I will test it tomorrow at work. And by the way, the "$doc->loadHTML('<?xml encoding="UTF-8">' . $html);" hack is needed because the HTML from the database is not valid. It is an output from a WYSIWYG editor which is something like '<p>Hello</p>'. So it has no html, head, body tags. And if I use loadHTML() method on an invalid HTML like that, PHP DOM will go crazy and mess up all UTF-8 characters.

Richard Knop 2010-08-23 20:15:47

@Rich See my edit.

Artefacto 2010-08-23 20:34:52

Will test tomorrow and accept your answer if it works ;)

Richard Knop 2010-08-23 20:41:26

@Artefacto Works ;)

Richard Knop 2010-08-24 06:35:58

ansaurus

tags:

views:

answers:

PHP DOM UTF-8 problem

related questions