views:

220

answers:

4

If you have binary data that you need to encode, what encoding scheme do you use?

I know about:

  • Hex encoding. Very simple, but quite verbose, expands one byte to two.
  • Base 64. Most common, not so verbose, expands three bytes to four.
  • Base 85. Not common, less verbose again, expands four bytes to five.

Are there any other encoding schemes in common use? If so, what are there advantages and disadvantages?

Edit: This is useful, for example, when trying to store arbitrary data in a cookie. Cookies can only store text, not arbitrary data, so you need to convert it in some way, preferably with a way to convert it back. Further, assume that you are using a stateless server so that you cannot save the state on the server and just put an identifier into the cookie. Of course, if you do this you would also need some way of verifying that what the user is passing back to you is what you passed to the user, for example a signature.

Also, since the current consensus is that you should use base64 since it is widespread, I will also point out that this is what I use... I am just curious if anyone used anything else, and if so, why.

Edit: Just in case someone stumbles across this, if you do want to use Base64 to store data in a cookie, you need to use a modified Base64 implementation. See this answer for the reason why.

+2  A: 

Once upon a time, there was UTF-7. It's officially deprecated, but it still works as an ACE (ASCII Compatible Encoding). Now there's IDN.

bmargulies
+1 for pointing out IDN and UTF-7 as alternative forms of making unicode into ASCII safe transports.
Paul Wagland
+1  A: 

Base64 is the de-facto standard. Using anything else is asking for trouble.

shoosh
+1 for pointing out uuencode.
Paul Wagland
`uuencoding` takes me back to the 1992 usenet days :) Actually, `uuencoding` was largely supplanted in usenet usage by `yenc` (http://www.yenc.org/).
skaffman
yenc. Blimey, that takes you back... then you start thinking about xmodem/ymodem and zmodem to get it from the shell server to your home machine ;-)
Paul Wagland
+4  A: 

For encoding cookie values, you need to be careful. See this older answer:

With Version 0 cookies, values should not contain white space, brackets, parentheses, equals signs, commas, double quotes, slashes, question marks, at signs, colons, and semicolons. Empty values may not behave the same way on all browsers.

Base64 encoding can generate = symbols for certain inputs, and this technically is not permitted in cookies (version 0 cookies, anyway, which are the most widely supported). In practice, I suspect the = will actually work fine, but maybe not.

I would suggest that to be absolutely sure that your encoded binary is cookie-compatible, then basic hex encoding is safest (e.g. in java).

edit: As @Paul helpfully pointed out, there is a modified version of Base 64 that is "URL safe" (and, I assume, "cookie safe"). Using a modified version of a standard algorithm rather dilutes its charm, mind you.

edit: @shoosh pointed out that the = is only used to denote the end of the base64 string, so you could trim the =, set the cookie, then reattach the = again when you need to decode it.

skaffman
+1, nice warning about `=` character, ty
Rubens Farias
In base64 the '=' is used only for padding the last bytes. You can either ensure that they are not emitted or just change them to something else (and then back again)
shoosh
You might want to consider updating your answer to refer to http://en.wikipedia.org/wiki/Base64#URL%5Fapplications - a version of Base4 specifically for use in the HTTP environment.
Paul Wagland
@Paul, thanks, edited. @Shoosh, this is true, yes, so perhaps it would be OK to manually trim the trailing `=` before setting the cookie, and reattach it before decoding.
skaffman
I accepted this answer, since even though my cookie was only "an example" and not what I actually use this for, there is useful warning information here for others. Added to that, the consensus is definitely to go for Base64, and that is what the answer recommends.
Paul Wagland
+2  A: 

Base64 wins because it's so common that I don't have to ever worry about rolling my own encoder/decoder. I haven't run into any applications where I've been worried about saving bandwidth or filespace in encoded binary data.

jball
upvoted, since you were the first to say this, in the comments on the question.
Paul Wagland
+1, here you go, jball =)
Rubens Farias