views:

882

answers:

5

Devise a simple algorithm which creates a file which contains nothing but its own checksum.

Let's say it is CRC-32, so this file must be 4 bytes long.

+1  A: 

Lacking any specific guidance to the contrary, I'd define the checksum of nonexistent data as a nonexistent checksum, so creating an empty file would fulfill the requirement.

Another typical method is a negative checksum -- i.e. after the data you write a value that makes the checksum of the whole file (including the checksum) come out to zero. In this case, you write a checksum of 0, and it all works out.

Jerry Coffin
I've specified CRC-32
psihodelia
The exact checksum you use is (mostly) irrelevant -- it's primarily about how you apply it.
Jerry Coffin
+26  A: 

There might be some smart mathematical way of finding it out (or proving that none exists), if you know how the algorithm works.

But since I'm lazy and CRC32 has only 2^32 values, I would brute force it. While waiting for the algorithm to go through all 2^32 values, I would use Google and Stack Overflow to find whether somebody has a solution to it.

In case of SHA-1, MD5 and other more-or-less cryptographically secure algorithms, I would get intimidated by the mathematicians who designed those algorithms and just give up.

EDIT 1: Brute forcing... This far I've found one; CC4FBB6A in big-endian encoding. There might still be more. I'm checking 4 different encodings: ASCII uppercase and lowercase, and binary big-endian and little-endian.

EDIT 2: Brute force done. Here are the results:

CC4FBB6A (big-endian)
FFFFFFFF (big-endian & little-endian)
32F3B737 (uppercase ASCII)

The code is here. On my overclocked C2Q6600 that takes about 1.5 hours to run. Now that program is single-threaded, but it would be easy to make it multi-threaded, which would give a nice linear scalability.

Esko Luontola
So you're smart, huh? I thought your head would be bigger. :)
psihodelia
+1 CC4FBB6A :-)
Sam
well, is seems that nobody has presented any proper answer (offering smth. better than a brute force), but I will give my voice to you because you have highest num of upvotes; anyway, thank you for your efforts!
psihodelia
+9  A: 

Aside from Jerry Coffin and Esko Luontola's good answers to an unusual problem, I'd like to add:

Mathematically, we're looking for X such that F(X) = X, where F is the checksum function, and X is the data itself. Since the checksum's output is of fixed size, and the input we are looking for is of the same size, there is no guarantee that such an X even exists! It could very well be that every input value of the fixed size is correlated with a different value of that size.

EDIT: Your question didn't specify the exact way the checksum is supposed to be formatted within the file, so I assumed you mean the byte-representation of the checksum. When strings and encodings and formatted-strings come to play, things become more complex.

M.A. Hanin
In fact for a good checksum algorithm wouldn't you want to ensure that X != F(X) to stop a whole bunch of collision attacts
Martin Beckett
No if you made that assumption it is actually more vulnerable. This was by far biggest weakness in famous Enigma. At lest i can tell that is something wrong if you can prove that property.
ralu
@ralu: suppose I take the MD5 function, and define a new hash function to consist of the MD5 sum of its input, preceded by the bit 0 if the first bit of input is 1, and 1 if it is 0. This new hash function provably has no fixed point, and provably is "nearly" as strong as MD5 (if knowing the first bit of a message allows you to crack MD5 in time X, you can crack it without the first bit in time 2X). So I don't think that non-existence of a fixed point is a problem. Enigma had a rather stronger property than not having a fixed point: it never enciphered even a single character to itself.
Steve Jessop
... of course by "crack" I mean reverse, I can't immediately prove that my new hash is as resistant as MD5 other attacks. But I applied a very blunt instrument to avoid a fixed point, giving up an entire bit of information about the input. I could have been more subtle, given that fixed points of MD5 (if any exist) are certainly very rare. It's not immediately obvious to me how flipping one bit of a very few MD5sums would fatally weaken its typical use on data which cannot possibly be a fixed point due to length. You couldn't even tell by examining the data which algorithm I was using.
Steve Jessop
Is there any weakness if fixed point exist, but is unknown?
ralu
A: 

Brute force it. CRC-32 gives you a string of length 8 containing digits and letters of A-F (in other words, it's a hexadecimal number). Try every combination, giving you 168 = many possibilities. Then hash each possibility and see if it gives you the original string.

You can try optimizing it by assuming the solution will use each character no more than two or three times, this might make it finish faster.

If you have access to a CRC32 implementation, you can also try to break the algorithm and find a solution much faster, but I have no idea how you'd do this.

IVlad
"CRC-32 gives you a string of length 8 containing digits and letters of A-F" - no, CRC32 returns a 32 bit integer. Many programs simply represent it in hexadecimal.
Nick Johnson
you can try cksum somefile.bin in your terminal, it will print a string, representing a decimal integer of type integer32
psihodelia
+1  A: 

Brute force. This is Adler32, which I haven't implemented before, and didn't bother testing, so it's quite likely I've messed it up. I wouldn't expect a corrected version to run significantly slower, though, unless I've done something colossally wrong.

This assumes that the 32bit checksum value is written to the file little-endian (I didn't find a fixed point with it big-endian):

#include <iostream>
#include <stdint.h>
#include <iomanip>

const int modulus = 65521;

void checkAllAdlers(uint32_t sofar, int depth, uint32_t a, uint32_t b) {
    if (depth == 4) {
        if ((b << 16) + a == sofar) {
            std::cout << "Got a fixed point: 0x" << 
                std::hex << std::setw(8) << std::setfill('0') << 
                sofar << "\n";
        }
        return;
    }
    for (uint32_t i = 0; i < 256; ++i) {
        uint32_t newa = a + i;
        if (newa >= modulus) newa -= modulus;
        uint32_t newb = b + a;
        if (newb >= modulus) newb -= modulus;

        checkAllAdlers(sofar + (i << (depth*8)), depth + 1, newa, newb);
    }
    return;
}

int main() {
    checkAllAdlers(0, 0, 1, 0);
}

Output:

$ g++     adler32fp.cpp   -o adler32fp -O3 && time ./adler32fp
Got a fixed point: 0x03fb01fe

real    0m31.215s
user    0m30.326s
sys     0m0.015s

[Edit: several bugs fixed already, I have no confidence whatever in the correctness of this code ;-) Anyway, you get the idea: a 32 bit checksum which uses each byte of input only once is very cheap to brute force. Checksums are usually designed to be fast to compute, whereas hashes are usually much slower, even though they have superficially similar effects. If your checksum was "2 rounds of Adler32" (meaning that the target checksum was the result of computing the checksum and then computing the checksum of that checksum) then my recursive approach wouldn't help so much, there'd be proportionally less in common between inputs with a common prefix. MD5 has 4 rounds, SHA-512 has 80.]

Steve Jessop