tags:

views:

2815

answers:

6

I'm writing some RSS feeds in PHP and stuggling with character-encoding issues. Should I utf8_encode() before or after htmlentities() encoding? For example, I've got both ampersands and Chinese characters in a description element, and I'm not sure which of these is proper:

$output = utf8_encode(htmlentities($source)); or
$output = htmlentities(utf8_encode($source));

And why?

+1  A: 

You want to do $output = htmlentities(utf8_encode($source));. This is because you want to convert your international characters into proper UTF8 first, and then have ampersands (and possibly some of the UTF-8 characters as well) turned in to HTML entities. If you do the entities first, then some of the international characters may not be handled properly.

If none of your international characters are going to be changed by utf8_encode, then it doesn't matter which order you call them in.

SoapBox
+1  A: 

It's important to pass the character set to the htmlentities function, as the default is ISO-8859-1:

utf8_encode(htmlentities($source,ENT_COMPAT,'utf-8'));

You should apply htmlentities first as to allow utf8_encode to encode the entities properly.

(EDIT: I changed from my opinion before that the order didn't matter based on the comments. This code is tested and works well).

Eran Galperin
Order does matter! utf8_encode before htmlentities() will change how it behaves. Compare string urldecode('%E2%82%AC') with and without applying utf8_encode() first.
porneL
You are right, however it appears that using htmlentities first is the correct method (tested it). Changed my post to reflect it.
Eran Galperin
+1  A: 

Don't use htmlentities()!

Simply use UTF-8 characters. Just make sure you declare encoding of the feed in HTTP headers (Content-Type:application/xml;charset=UTF-8) or failing that, in the feed itself using <?xml version="1.0" encoding="UTF-8"?> on the first line.

porneL
+1  A: 

It might be easier to forget htmlentities and use a CDATA section. It works for the title section, which doesn't seem support encoded HTML characters in Firefox's RSS viewer:

<title><![CDATA[News & Updates  " > » ☂ ☺ ☹ ☃  Test!]]></title>
+4  A: 

First: The utf8_function function converts from ISO 8859-1 to UTF-8. So you only need this function, if your input encoding/charset is ISO 8859-1. But why don’t you use UTF-8 in the first place?

Second: You don’t need htmlentities. You just need htmlspecialchars to replace the special characters by character references. htmlentities would replace “too much” characters that can be encoded directly using UTF-8. Important is that you use the ENT_QUOTES quote style to replace the single quotes as well.

So my proposal:

// if your input encoding is ISO 8859-1
htmlspecialchars(utf8_encode($string), ENT_QUOTES)

// if your input encoding is UTF-8
htmlspecialchars($string, ENT_QUOTES, 'UTF-8')
Gumbo
A: 

After much trial & error, I finally found a way to properly display a string from a utf8-encoded database value, through an xml file, to an html page:

$output = '<![CDATA['.utf8_encode(htmlentities($string)).']]>';

I hope this helps someone.

katy lavallee