views:

360

answers:

5

Is it expected behavior that two encodings can map to the same decoding? I'm trying to troubleshoot a digital signature issue by doing sanity checks on base64-encoded intermediate strings.

For example, the following base64 encoding:

R0VUDQoNCg0KRnJpLCAwNCBTZXAgMjAwOSAxMTowNTo0OSBHTVQrMDA6MDANCi8=

and:

R0VUCgoKRnJpLCAwNCBTZXAgMjAwOSAxMDozMzoyOCBHTVQrMDA6MDAKLw==

both decode to:

GET


Fri, 04 Sep 2009 11:05:49 GMT+00:00
/

(With the characters escaped, this is: GET\n\n\n Fri, 04 Sep 2009 11:05:49 GMT+00:00\n/)

The first encoding comes from testing two online base64 encoders.

The second encoding comes from an Objective-C base64 encoder available here.

Is there something wrong with the result I'm generating with the Obj-C encoder?

+7  A: 

It's clear that the encoded strings have patterns similar where they correspond to alphanumeric characters and different where they corresponds to line breaks. So the difference is because somewhere along the "Encode"->"Decode" way the software processes line breaks (CR (\r), LF(\n) or CRLF(\r\n)) differently and that's why you have such results.

Other than that there're no two different ways to encode a given string into Base64 and no two different ways to decode a valid Base64-encoded data.

sharptooth
+4  A: 

Actually, they don't decode to the same thing.

$ echo 'R0VUCgoKRnJpLCAwNCBTZXAgMjAwOSAxMDozMzoyOCBHTVQrMDA6MDAKLw==' | base64 -d | hexdump 
0000000 4547 0a54 0a0a 7246 2c69 3020 2034 6553
0000010 2070 3032 3930 3120 3a30 3333 323a 2038
0000020 4d47 2b54 3030 303a 0a30 002f          
000002b
$ echo 'R0VUDQoNCg0KRnJpLCAwNCBTZXAgMjAwOSAxMTowNTo0OSBHTVQrMDA6MDANCi8=' | base64 -d | hexdump
0000000 4547 0d54 0d0a 0d0a 460a 6972 202c 3430
0000010 5320 7065 3220 3030 2039 3131 303a 3a35
0000020 3934 4720 544d 302b 3a30 3030 0a0d 002f
000002f
moonshadow
+2  A: 

The key is that base 64 strings decode to sequences of bytes, not characters. Comparing the byte arrays produced by each of your base 64 strings shows that the difference lies in how line termination is done - wherever the first has a 13 followed by a 10, the second just has a 10. This is the standard Windows-vs-Unix line termination difference.

AakashM
+3  A: 

As @sharptooth suggested, the line breaks are \r\n in the first one, \n in the second one.

>>> base64.b64decode("R0VUDQoNCg0KRnJpLCAwNCBTZXAgMjAwOSAxMTowNTo0OSBHTVQrMDA6MDANCi8=")
'GET\r\n\r\n\r\nFri, 04 Sep 2009 11:05:49 GMT+00:00\r\n/'
>>> base64.b64decode("R0VUCgoKRnJpLCAwNCBTZXAgMjAwOSAxMDozMzoyOCBHTVQrMDA6MDAKLw==")
'GET\n\n\nFri, 04 Sep 2009 10:33:28 GMT+00:00\n/'
Mark Rushakoff
+12  A: 

Another example to prove that the strings are not equal, using Python:

>>> from base64 import decodestring as d
>>> a = "R0VUDQoNCg0KRnJpLCAwNCBTZXAgMjAwOSAxMTowNTo0OSBHTVQrMDA6MDANCi8="
>>> b = "R0VUCgoKRnJpLCAwNCBTZXAgMjAwOSAxMDozMzoyOCBHTVQrMDA6MDAKLw=="
>>> d(a)
'GET\r\n\r\n\r\nFri, 04 Sep 2009 11:05:49 GMT+00:00\r\n/'
>>> d(b)
'GET\n\n\nFri, 04 Sep 2009 10:33:28 GMT+00:00\n/'
>>> d(a) == d(b)
False

The longer string uses CRLF-linebreaks, the shorther one plain LFs.

Ferdinand Beyer
+1 for the clear answer, and for whipping up an answer in Python. :-)
Quinn Taylor