Efficient way to ASCII encode UTF-8

views:

318

answers:

Efficient way to ASCII encode UTF-8

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:

all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed

My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.

Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.

+4 A:

UTF-7, or, slightly less transparent but more widespread, quoted-printable.

all ASCII chars in the input should stay ASCII chars in the output

(Obviously not fully possible as you need at least one character to act as an escape.)

bobince 2010-04-02 15:02:31

You're reading the requirement as saying that ASCII chars in the input stay as *the same* ASCII chars in the output. That may be what he intended (in which case you're clearly correct) but it's not what he actually said -- and an encoding that fits the stated requirement is certainly possible.

Jerry Coffin 2010-04-02 15:16:54

Heh. Yes I meant ASCII chars should stay the same char. UTF-7 looks like a good candidate. Thanks for the hint.

Andreas Gohr 2010-04-02 15:28:47

@Andreas Gohr - UTF-7 does not preserve the ASCII range from modification.

Jeffrey L Whitledge 2010-04-02 15:37:50

UTF-7 seems to be case-sensitive which I'd like to avoid.

Andreas Gohr 2010-04-02 15:39:16

+1 A:

Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.

Edited to add:

I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.

If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.

Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.

Jeffrey L Whitledge 2010-04-02 15:02:54

URL encoding or numeric character references are two possible options.

toscho 2010-04-02 15:05:39

It depends on the distribution of characters in your strings.

Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.

dan04 2010-04-03 04:35:34

ansaurus

tags:

views:

answers:

Efficient way to ASCII encode UTF-8

related questions