views:

3562

answers:

3

The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)

But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")?

Are 8-bit characters, that require encoding, preceded by %00?

Or, is the point that unicode characters are supposed to be lost/split?

+4  A: 

According to Wikipedia:

Current standard

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary data or as character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.

Non-standard implementations

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The third edition of ECMA-262 still includes an escape(string) function that uses this syntax, but also an encodeURI(uri) function that converts to UTF-8 and percent-encodes each octet.

So, it looks like its entirely up to the person writing the unencode method...Aren't standards fun?

FlySwat
A: 

What I've always done is first UTF-8 encode a Unicode string to make it a series of 8-bit characters before escaping any of those with %HH.

P.S. - I can only hope the non-standard implementations (%uxxxx) are few and far between.

Neil C. Obremski
A: 

Since URI's were introduced before unicode was around, or atleast in wide use, I imagine this is a very implementation specific question. UTF-8 encoding your text, then escaping that per normal sounds like the best idea, since that's completely backwards compatible with any ASCII/ANSI systems in place, though you might get the odd wierd character or two.

On the other end, to decode, you'd unescape your text, and get a UTF-8 string. If someone using an older system tries to send yours some data in ASCII/ANSI, there's no harm done, that's (almost) UTF-8 encoded already.

Matthew Scharley
This is exactly what should be used. The characters you mention may seem weird, but none of them will be control characters (that's how UTF-8 works) and this is really good.
Paweł Dyda