views:

217

answers:

6

If an encrypted file exists and someone wants to decrypt it, there are several methods do try. For example, if you would chose a brute force attack, that's easy: just try all possible keys and you will find the correct one. For this question, it doesn't matter that this might take too long. But trying keys means the following steps:

  1. Chose key
  2. Decrypt data with key
  3. Check if decryption was successful

Besides the problem that you would need to know the algorithm that was used for the encryption, I cannot imagine how one would do #3.

Here is why: After decrypting the data, I get some "other" data. In case of an encrypted plain text file in a language that I can understand, I can now check if the result is a text in that langauge. If it would be a known file type, I could check for specific file headers.

But since one tries to decrypt something secret, it is most likely unknown what kind of information there will be if correctly decrypted.

How would one check if a decryption result is correct if it is unknown what to look for?

A: 

are you seriously asking questions like this? well if it was known whats inside then you would not need to decrypt it anywayz right?

somehow this has nothing to do with programming question its more mathematical. I took some encryption math classes at my university.

And you can not confirm without a lot of data points. Sure if your result makes sense and its clear it is meaningful in plain english (or whatever language is used) but to answer your question.

If you were able to decrypt you should be able to encrypt as well. So encrypt the result using reverse process of decryption and if you get same results you might be golden...if not something is possibly wrong.

grobartn
i love -2 without explenation
grobartn
Well the question is clearly related to programming - practically all of cryptography is - and it's obvious that his problem is that he doesn't know the key, not that he doesn't know the algorithm.
jprete
Correct, even if you know the algorithm, you would by using cryptanalysis only be able to find anything if the original data was a known file format for you. If this is the only chance for a decrypter, why would the person encrypting make such a mistake in chosing a known file format?
Holgerwa
@grobartn: I think it is very much related to programming. If I use encryption/decryption algorithms, the more you know about it, the better. Of course, that's the same for all subjects :)
Holgerwa
A: 

One of the ways is compressing the source data with some standard algorithm like zip. If after decryption you can unzip the result - it's decrypted right. Compression is almost usually done by encryption programs prior to encryption - because it's another step the bruteforcer will need to repeat for each trial and lose time on it and because encrypted data is almost surely uncompressible (size doesn't decrease after compression with a chained algorithm).

sharptooth
+4  A: 

You could use heuristics like the unix

file

command does to check for a known file type. If you have decrypted data that has no recognizable type, decrypting it won't help you anyway, since you can't interpret it, so it's still as good as encrypted.

balpha
Very good point. Existential cryptography - If you don't know what you're looking for, how will you know when you find it? :)
Tom
"Existencial cryptography" -- I like that.
balpha
Yes, exactly! But that means if you don't know what you are looking for, because it is a file format you don't know, you have no chance of ever decrypting it.There must be a flaw here, it would be an unbreakable encryption; besides getting the key from someone who knows it.
Holgerwa
@Holgerwa: Using a secret file format is equivalent to using a secret encryption method, and the latter has long been accepted by most people to be a flawed idea ("security by obscurity"). See http://en.wikipedia.org/wiki/Kerckhoffs%27_principle
balpha
A: 

Without a more clearly defined scenario, I can only point to cryptanalysis methods. I would say it's generally accepted that validating the result is an easy part of cryptanalysis. In comparison to decrypting even a known cypher, a thorough validation check costs little cpu.

Rob Elliott
I don't think you're correct. When you're doing a brute force attack, you need to validate whether you've found the correct one for each key that you try, and it could easily take longer than the decryption and thus multiply the overall time.
Michael Borgwardt
But always under the assumption that the data you are trying to decrypt is in a format you know. Knowing this, wouldn't a person trying to keep a secret just for this reason make up his/her own file format?
Holgerwa
@Holgerwa: That isn't really practical. In most systems, you have access to a plaintext message somewhere (think publically deployed cryptosystems, for example), and if you don't, after the first break you have one, and all other breaks become much easier.
Nick Johnson
+3  A: 

Like you suggest, one would expect the plaintext to be of some know format, e.g., a JPEG image, a PDF file, etc. The idea would be that it is very unlikely that a given ciphertext can be decrypted into both a valid JPEG image and a valid PDF file (but see below).

But it is actually not that important. When one talks about a cryptosystem being secure, one (roughly) talks about the odds of you being able to guess the plaintext corresponding to a given ciphertext. So I pick a random message m and encrypts it c = E(m). I give you c and if you cannot guess m, then we say the cryptosystem is secure, otherwise it's broken.

This is just a simple security definition. There are other definitions that require the system to be able to hide known plaintexts (semantic security): you give me two messages, I encrypt one of them, and you will not be able to tell which message I chose.

The point is, that in these definitions, we are not concerned with the format of the plaintexts, all we require is that you cannot guess the plaintext that was encrypted. So there is no step 3 :-)

By not considering your step 3, we make the question of security as clear as possible: instead of arguing about how hard it is to guess which format you used (zip, gzip, bzip2, ...) we only talk about the odds of breaking the system compared to the odds of guessing the key. It is an old principle that you should concentrate all your security in the key -- it simplifies things dramatically when your only assumption is the secrecy of the key.

Finally, note that some encryption schemes makes it impossible for you to verify if you have the correct key since all keys are legal. The one-time pad is an extreme example such a scheme: you take your plaintext m, choose a perfectly random key k and compute the ciphertext as c = m XOR k. This gives you a completely random ciphertext, it is perfectly secure (the only perfectly secure cryptosystem, btw).

When searching for an encryption key, you cannot know when you've found the right one. This is because c could be an encryption of any file with the same length as m: if you encrypt the message m' with the key *k' = c XOR m' you'll see that you get the same ciphertext again, thus you cannot know if m or m' was the original message.

Instead of thinking of exclusive-or, you can think of the one-time pad like this: I give you the number 42 and tell you that is is the sum of two integers (negative, positive, you don't know). One integer is the message, the other is the key and 42 is the ciphertext. Like above, it makes no sense for you to guess the key -- if you want the message to be 100, you claim the key is -58, if you want the message to be 0, you claim the key is 42, etc. One time pad works exactly like this, but on bit values instead.

About reusing the key in one-time pad: let's say my key is 7 and you see the ciphertexts 10 and 20, corresponding to plaintexts 3 and 13. From the ciphertexts alone, you now know that the difference in plaintexts is 10. If you somehow gain knowledge of one of the plaintext, you can now derive the other! If the numbers correspond to individual letters, you can begin looking at several such differences and try to solve the resulting crossword puzzle (or let a program do it based on frequency analysis of the language in question).

Martin Geisler
I was in the middle of writing a similar answer... It's really a question of - *I get a file, how do I know it's ok*. One important point about the one-time pad encryption is that you can find a key that would decrypt your message into any other message you want - making it a perfect encryption (aside from human factors, like bribe or espionage).
Kobi
@Kobi: exactly, the one-time pad is perfect, except for the fact that you have to distribute tons of keys (each key can only be used once), guard against bribery, etc... :-)
Martin Geisler
"When searching for an encryption key, you cannot know when you've found the right one." - doesn't this basically mean that trying to decrypt anything is senseless? I you cannot tell that you found what you were looking for, why even start? (I think of file formats you don't know, not plain text or anything usual)If I encrypt an already encrypted file, nobody could ever decrypt it if they would not get the key from me, because there would be no decrypting method. Sounds too safe to be true.
Holgerwa
@Holderwa: you're right, it makes no sense to begin searching for the key. I've updated the answer with another explanation of how the one-time pad works. It's not too safe to be true -- the problem lies in the fact that you must distribute a key securely to the receiver, and that you must use a *new* key for each message.
Martin Geisler
@Martin: Thanks a lot, very good explanations. Now that also means, that if you can get rid of the key distribution problem, you are all set. Specifically, if you don't want to encrypt for a transmission to someone else, but for keeping your own, personal or business information safe where you are the only one having the key.
Holgerwa
@Holgerwa: well, if you don't transmit the message, it would be easier to keep the message secret instead of encrypting it and keeping the key secret. You can not just keep a short key secret and use it for several documents, if you do that you lose the security -- I've added a bit about that to the answer.
Martin Geisler
+2  A: 

I wrote a tool a little while ago that checked if a file was possibly encrypted by simply checking the distribution of byte values, since encrypted files should be indistinguishable from random noise. The assumption here then is that an improperly decrypted file retains the random nature, while a properly decrypted file will exhibit structure.

#!/usr/bin/env python

import math
import sys
import os

MAGIC_COEFF=3

def get_random_bytes(filename):
        BLOCK_SIZE=1024*1024
        BLOCKS=10

        f=open(filename)
        bytes=list(f.read(BLOCK_SIZE))

        if len(bytes) < BLOCK_SIZE:
                return bytes

        f.seek(0, 2)
        file_len = f.tell()
        index = BLOCK_SIZE
        cnt=0
        while index < file_len and cnt < BLOCKS:
                f.seek(index)
                more_bytes = f.read(BLOCK_SIZE)
                bytes.extend(more_bytes)
                index+=ord(os.urandom(1))*BLOCK_SIZE
                cnt+=1

        return bytes

def failed_n_gram(n,bytes):
        print "\t%d-gram analysis"%(n)
        N = len(bytes)/n
        states = 2**(8*n)
        print "\tN: %d states: %d"%(N, states)

        if N < states:
                print "\tinsufficient data"
                return False

        histo = [0]*states
        P = 1.0/states

        expected = N/states * 1.0
        # I forgot how this was derived, or what it is suppose to be
        magic = math.sqrt(N*P*(1-P))*MAGIC_COEFF
        print "\texpected: %f magic: %f" %(expected, magic)

        idx=0
        while idx<len(bytes)-n:
                val=0
                for x in xrange(n):
                        val = val << 8
                        val = val | ord(bytes[idx+x])

                histo[val]+=1
                idx+=1

                count=histo[val]
                if count - expected > magic:
                        print "\tfailed: %s occured %d times" %( hex(val), count)
                        return True

        # need this check because the absence of certain bytes is also
        # a sign something is up
        for i in xrange(len(histo)):
                count = histo[i]
                if expected-count > magic:
                        print "\tfailed: %s occured %d times" %( hex(i), count)
                        return True

        print ""

        return False

def main():
        for f in sys.argv[1:]:
                print f
                rand_bytes = get_random_bytes(f)

                if failed_n_gram(3,rand_bytes):
                        continue

                if failed_n_gram(2,rand_bytes):
                        continue

                if failed_n_gram(1,rand_bytes):
                        continue


if __name__ == "__main__":
        main()

I find this works reasonable well:

$ entropy.py ~/bin/entropy.py entropy.py.enc entropy.py.zip 
/Users/steve/bin/entropy.py
        1-gram analysis
        N: 1680 states: 256
        expected: 6.000000 magic: 10.226918
        failed: 0xa occured 17 times
entropy.py.enc
        1-gram analysis
        N: 1744 states: 256
        expected: 6.000000 magic: 10.419895

entropy.py.zip
        1-gram analysis
        N: 821 states: 256
        expected: 3.000000 magic: 7.149270
        failed: 0x0 occured 11 times

Here .enc is the source ran through:

openssl enc -aes-256-cbc -in entropy.py -out entropy.py.enc

And .zip is self-explanatory.

A few caveats:

  1. It doesn't check the entire file, just the first KB, then random blocks from the file. So if a file was random data appended with say a jpeg, it will fool the program. The only way to be sure if to check the entire file.

  2. In my experience, the code reliably detects when a file is unencrypted (since nearly all useful data has structure), but due to its statistical nature may sometimes misdiagnose an encrypted/random file.

  3. As it has been pointed out, this kind of analysis will fail for OTP, since you can make it say anything you want.

  4. Use at your own risk, and most certainly not as the only means of checking your results.

freespace
More sophisticated variants of this are a good way to check if the file was decrypted properly. Generally, only the correct distribution will appear non-random. On the downside, if the file was compressed before encryption, or is a compressed filetype like a jpeg or png image, it may appear sufficiently random that your statistical analysis can't tell the difference.
Nick Johnson
I would actually wager compressed files is more likely fail the analysis because they are making the data more ordered, in a sense, in order to store more using fewer bytes. If nothing else structures required to decompress properly would be sufficiently structured. Though lossless compression like JPG it might do the trick.
freespace
If you've decrypted but not uncompressed the data (particularly if you don't know it was compressed), statistical analysis is likely to fail, was my point. Obviously if you decompress the data first, that doesn't apply.
Nick Johnson
Intuition tells me compression doesn't remove structure, it adds structure. Thus X.zip is more likely to be picked up then X after decryption.
freespace