views:

65

answers:

3

Hello from Germany

So my problem is something very simple, i think. I need to Decode Base64 until there is no Base64, i check with an RegEx if there is some Base64 but i got no Idea how to decode until there is no Base64.

In this short Code i can Decode the Base64 until there is no Base64 because my Text is defined. (Until the Base64 Decode Stuff isn't "Hello World" decode)

# Import Libraries
from base64 import *
import re

# Text & Base64 String
strText = "Hello World"
strEncode = "VmxSQ2ExWXlUWGxUYTJoUVVqSlNXRlJYY0hOT1ZteHlXa1pLVVZWWE9EbERaejA5Q2c9PQo=".encode("utf-8")

# Decode
objRgx = re.search('^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$', strEncode.decode("utf-8"))

strDecode = b64decode(objRgx.group(0).encode("utf-8"))

print(strDecode.decode("utf-8"))

while strDecode != strText.encode("utf-8"):
    strDecode = b64decode(strDecode)

    print(strDecode.decode("utf-8"))

Does anyone have an Idea how i can decode the Base64 until there is the real text (no more base64)

P. S. sorry for my bad english.

+2  A: 

As a heuristic, you could compute the average word length in the result. Natural language will have short words like "As a heuristic, you could look at word length." A string that is still Base64 encoded will have few if any spaces and long strings between the spaces.

As another heuristic, you could calculate the proportions of vowels (a, e, i, o, u) to consonants or the number of capital letters in the middle of words.

Mark Lutton
A: 

You can't, not in an arbitrary sense. The problem is simply that normal, every day words can ALSO be BASE64. So, there's no real way to tell the difference between the two.

BASE64 doesn't have a terminator other than length. It CAN be terminated with = or == but does not HAVE to be terminated. The = are just padding. No padding needed, then no =. So its possible that the BASE64 will end and some text will begin, without you being able to detect it.

Edit for "So there is really no way to do what i want?":

No, not deterministically, not reliably. Even with a heuristic, there will be potential cases where it fails and you will end up consuming too many characters, resulting in garbage at the end of your binary block, and lost of characters in the following text stream.

Now this is for an arbitrary BASE64 block. If you KNOW what the binary data is, then perhaps there's hope.

For example, if you KNOW what the binary data is, most binary formats "know" when they are "done". I don't know of a valid binary format that says "read until you reach EOF". They're typically laced with internal descriptors of "this is how much data the next chunk has" or with terminators saying "I'm done".

In these cases you can treat the BASE64 as a stream. BASE64 is basically pretty simple. It takes 3 bytes and converts them in to 4 characters.

So, a B64 stream reader needs to simply read 4 chars and return the 3 bytes they represent.

If you have, say, a PNG reader, it can start reading the converted stream. And when it is "done", it "closes" the stream, and your original text is "at the end of the BASE64".

It can also work if you know the size of the original attachment. If someone sent "10,000 bytes", then you use your BASE64 stream decoder and simply read "10,000" bytes from it.

More often than not, you will have BASE64 with a = or == terminator. It's the cases where you don't that it's a problem. The stream decoded works either way.

If you don't know the original size of the attachment, or the format of the encoded binary, then you're pretty much out of luck.

Will Hartung
So there is really no way to do what i want?
A: 

So you're dealing with a block of data that may have been repeatedly base64-encoded? Why not just loop the string through b64decode() until it errors, then?

Also I think you probably don't need to sprinkle quite so many .encode("utf-8") around.

Zack
I think he means that he doesn't necessarily know where the base64 data ends, not that the data has been encoded an indeterminate number of times.
hughdbrown