ansaurus

Question

How should escaped unicode be handled by json parsers and encoders?

Answer 1

+1 A:

What do you mean by “restricted codepoint”? What spec are you looking at that uses that language? (I can't find any such.)

If you are talking about the surrogates then yes: JavaScript knows almost nothing(*) about surrogates and treats all UTF-16 codepoints in any sequence as valid. JSON, being limited to what JavaScript supports, does the same.

*: the only part of JS I can think of that does anything special with surrogates is the encodeURIComponent function, as it uses UTF-8 encoding, in which an attempt to encode an invalid surrogate sequence cannot work. If you try to:

encodeURIComponent('\ud834\udd1e'.substring(0, 1))

you will get an exception.

(Gah! SO seems not to allow characters from outside the Basic Multilingual Plane to be posted directly. Tsk.)

bobince 2009-10-04 14:19:06

Answer 2

+3 A:

When you decode, it seems that this would be an appropriate use for the unicode replacement character, U+FFFD.

From the Unicode Character Database:

used to replace an incoming character whose value is unknown or unrepresentable in Unicode
compare the use of U+001A as a control character to indicate the substitute function

Adam Goode 2009-10-31 00:03:58

ansaurus

tags:

views:

answers:

How should escaped unicode be handled by json parsers and encoders?

related questions