views:

60

answers:

4

Maybe this is just my unfamiliarity with unicode, so please correct me if I'm mistaken.

Looking at http://json.org/, the spec says that a string can include "any UNICODE character", but this confuses me.

  • JSON is a communication format correct? At the core of it, everything must translate down to bytes.
  • In contrast, UNICODE is a logical format and must be encoded to be able to transmit it, right?

So what did they mean there?

+3  A: 

JSON is a serialization format which can include UNICODE characters. The byte representation of this unicode string is usually sent over the wire, normally through the HTTP protocol which uses HTTP headers to specify the encoding to the client which is UTF-8.

Darin Dimitrov
+3  A: 

From the RFC:

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8
cobbal
+2  A: 

Reasonable question. JSON is oriented towards serialization/communication but, at its core, it is a text format. Hence is correctly specified in terms of characters (units of text), not bytes.

The convertion of that text to/from bytes, that is, the charset encoding, is outside JSON itself. Though, considering that it must support any Unicode text, a Unicode charset encoding should be used (UTF-8, normally).

leonbloy
A: 

You're correct that everything must translate into bytes, and usually that usually occurs through a UTF (Unicode Transformation Format). The JSON RFC explains in section 3 how to tell what UTF is being used.

Matthew Flaschen