views:

335

answers:

5

I want to run text through a filter to ensure it is all UTF-8 encoded. What is the recommended way to do this with PHP?

+2  A: 

Your question is unclear, are you trying to encode something? If so utf8_encode is your friend. Are you trying to determine if it doesn't need to be encoded? If so, utf8_encode is still your friend, because you can check that the result is the same as the input!

Don Neufeld
+1  A: 

Check the multi-byte string functions here

Bahadır
A: 

You need to know in what character set your input string is encoded, or this will go nowhere fast.

If you want to do it correctly, this article may be helpful: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Thomas
A: 

Given a stream of bytes, you have to know what encoding it is to begin with - email use mime headers to specify the encoding, http uses http headers to specify the encoding. Also, you can specify the encoding in a meta tag in a web page, but it is not always honored.

Anyway, once you know what encoding you want to convert from, use iconv to convert it to utf8. look at the iconv section of the php docs, there's lots of good info there.

Ah, Thomas posted the link I was looking for. A must read.

DGM
A: 

The easiest way to check for UTF-8 validity:

  1. If only one line allowed:

    preg_match('/^.*$/Du', $value)

  2. If multiple lines allowed:

    preg_match('/^.*$/sDu', $value)

This works for PHP >= 4.3.5 and does not require any non-default PHP modules.

Tometzky