views:

42

answers:

2

I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64

Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:

if (/([^\u0000-\u00ff])/.test(text)){
        throw new Error("Can't base64 encode non-ASCII characters.");
    }

He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:

Implementations MUST reject the encoding if it contains characters outside the base alphabet when interpreting base encoded data, unless the specification referring to this document explicitly states otherwise.

Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).

A bit confused on what the standard is.

+2  A: 

Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.

Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.

When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.

Jon Skeet
+2  A: 

For me the (first) linked article has a fundamental problem:

Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters

You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.

Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.

Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.

(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)

AakashM
Thanks for the reply. Are there any prerequisites to these byte sequences or can any arbitrary sequence of bytes be base64 encoded?
Rob
@Rob any byte sequence at all. The article is actually quite good on explaining how the 24 bits in any 3 bytes are split into 4 groups of 6 bits, which are then mapped to characters in the base64 alphabet.
AakashM
@AakashM Ok thanks. So then the test from his code could/should be omitted?
Rob
@Rob that starts to get into a different issue. His code uses `charCodeAt` to pull characters from the string - now, my Javascript is really not good enough to be able to say how that will handle character encoding issues. I *think* Javascript strings are always internally UTF-16, but don't trust that. He needs this test because his subsequent code treats `cur` as a byte; but that's the only reason for the test, I think. If you code for `cur` to be anything from 0 to 65535, then the test can go.
AakashM
@AakashM Yeah, looking at the code again he's definitely counting on it being a byte and then using bit wise operations to group 3 8 bit grouping in to 4 6 bit groupings. So it looks like this code only works for the limited character set. I don't even think you can manipulate javascript bits directly (besides the bitwise operations he's using). Perhaps this is the limitation. I need to look for some other examples.
Rob