views:

6610

answers:

3

I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.

So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?

I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.

EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.

+3  A: 

ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)

chaos
If read that quote you provided you'll see that you are not required to escape all unicode characters, only a few special characters. But you are required to encode the results (preferably with utf-8). So the question is: "Why bother escaping normal unicode characters if you're utf-8 encoding".
schickb
Also, an ascii encoded string is a pure subset of utf-8. If I use json's escaping for all non-ascii characters, the result is ascii -- and therefore utf-8. Various json libraries (like python simplejson) have modes to force ascii results. I presume for a reason, like perhaps execution in browsers.
schickb
When you bother escaping normal unicode characters is in contexts where they're metacharacters, like strings. (The RFC chunk I quoted is about strings; sorry, wasn't clear about that.) You don't need to do ASCII output all the time; I'd think that's more for debugging with broken browsers.
chaos
+2  A: 

JSON implementations can handle the safe numeric encodings just as well as the UTF-8. Some frameworks, including PHP's implementation of JSON, always do the safe numeric encodings of everything.

As has been mentioned it can be done for maximum compatibility with buggy browsers, etc. But it is also a form of choice. JSON has more uses that Javascript; it can be used as a generic data interchange format which is lightweight, where the format doesn't need all the features of XML. The ability to encode everything in ascii-safe numeric encodings just makes it more flexible with other programming languages, transports, and methods of storage which may not be UTF-8 aware or even binary-safe.

So, I guess you just could decide based on:

  • UTF-8 is more compact
  • ascii-safe encoding is more flexible in applications that aren't as UTF-8 or 8-bit friendly as Javascript on modern browsers is.
thomasrutter
+2  A: 

I had a problem there. When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".

Then with PHP json-decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8-encode() before json-decode().

Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.

Olivier