views:

376

answers:

3

Well, the subject says everything. I'm using json_encode to convert some UTF8 data to JSON and I need to transfer it to some layer that is currently ASCII-only. So I wonder whether I need to make it UTF-8 aware, or can I leave it as it is.

Looking at JSON rfc, UTF8 is also valid charset in JSON output, although not recommended, i.e. some implemenatations can leave UTF8 data inside. The question is whether PHP's implementation dumps everthing as ASCII or opts to leave something as UTF-8.

A: 

Well, json_encode returns a string. According to the PHP documentation for string:

A string is series of characters. Before PHP 6, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality.

So for the time being you do not need to worry about making it UTF-8 aware. Of course you still might want to think about this anyway, to future-proof your code.

Justin Ethier
And if you are using PHP 6, welcome back from the future!
salathe
@salathe: I think you mean "welcome back from SVN".
Ignacio Vazquez-Abrams
I cannot use utf8_encode and _decode, because PHP is not on the other side. I need to dump the data from PHP to JSON, pass it through a layer that only understands ASCII, and finally use it as via JavaScript on the destination. Unless JavaScript has utf8 functions identical to PHP's it is not usable.
Milan Babuškov
@Milan it's possible to get a `urldecode()` equivalent in JS. Alternatively, if you can live with the 33% bloat, consider base64 encoding.
Pekka
@Ignacio Vazquez-Abrams: Nope, I meant _the future_ :-)
salathe
+1  A: 

According to the JSON article in Wikipedia, Unicode characters in strings are always

double-quoted Unicode with backslash escaping

The examples in the PHP Manual on json_encode() seem to confirm this.

So any UTF-8 character outside ASCII/ANSI should be escaped like this: \u0027 (note, as @Ignacio points out in the comments, that this is the recommended way to deal with those characters, not a required one)

However, I suppose json_decode() will convert the characters back to their byte values? You may get in trouble there.

If you need to be sure, take a look at iconv() that could convert your UTF-8 String into ASCII (dropping any unsupported characters) beforehand.

Pekka
I don't want to drop the unsupported characters. I need to preserve them.
Milan Babuškov
@Milan then you should see to it that they stay converted in `\u0027` form. Hang on, I'll check whether I can find out how to do that.
Pekka
From RFC 4627, section 3, "Encoding": "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." Just because Unicode escapes *can* be used doesn't mean that they are *required*.
Ignacio Vazquez-Abrams
@Milan I just remember a similar question: How to convert the names of uploaded files (which may contain any UTF-8 characters) so they can be stored in the local file system, regardless of what character sets that system supports. A great solution somebody had was `urlencode()` ing the strings. That would preserve all UTF-8 characters (don't forget to specify the encoding) but be easily storable in ASCII. Decoding is a simple `urldecode()`. Does that help?
Pekka
@Ignacio thanks for the clarification. I edited the answer to point that out.
Pekka
@Pekka: I really need it to be JSON because of the interface on the other end that is JavaScript based and requires JSON. This also explains why I don't care about PHP's json_decode.
Milan Babuškov
@Milan if you need to be 1000% sure, why not roll your own encoder that converts UTF-8 entities into `\uxxxx` (= does what `json_encode()` seems to be doing right now but nobody can seem to confirm 100%). Alternatively, take a look into `json_encode()`s implementation. I admit those are imperfect solutions, though.
Pekka
@Pekka: I use json_decode to encode complex objects and arrays, so writing my own json_encode is not the task I'm willing to do. I'd rather add UTF-8 awareness to my middle layer instead.
Milan Babuškov
@Milan I don't think you would have to re-implement json_decode() - just do an additional conversion of any UTF-8 characters to their `\uxxx` counterparts after `json_encode()`, but before sending away the data. Anyway, it it probably going to be cleaner to work on the middle layer.
Pekka
Just for clarification, at the time of writing this, PHPs `json_encode` _will escape_ the non-ASCII characters that you are concerned about. While JSON can happily house UTF-8 characters, PHPs implementation currently escapes them.
salathe
@salathe: that's the kind of answer I'm looking for. Why don't you write is as an answer instead of comment so we can vote it up?
Milan Babuškov
+3  A: 

Unlike JSON support in other languages, json_encode() does not have the ability to generate anything other than ASCII.

Ignacio Vazquez-Abrams
Thanks Ignacio, this is the kind of answer I'm looking for. Can you provide some website or other reference to back this up?
Milan Babuškov
I cannot. All I can do is point out the lack of arguments or options in `json_encode()` to produce anything else.
Ignacio Vazquez-Abrams
I guess that is sufficient.
Milan Babuškov