views:

393

answers:

6

I am looking for a way to compute SHA-1 checksums of very large files without having to fully load them into memory at once.

I don't know the details of the SHA-1 implementation and therefore would like to know if it is even possible to do that.

If you know the SAX XML parser, then what I look for would be something similar: Computing the SHA-1 checksum by only always loading a small part into memory at a time.

All the examples I found, at least in Java, always depend on fully loading the file/byte array/string into memory.

If you even know implementations (any language), then please let me know!

+2  A: 

Yes, it can be used to hash streams since it's iterative: you go 512 bits each iteration, and you obtain a new 512 bits block that you can use for the next one.

Here you can find the pseudocode: link. It should be quite easy to implement in Java. You just need to do some padding when you encounter last block and some bitwise operations!

WARNING: the only thing is that usually unsigned ints are needed while Java offers just signed one, you should do some tricks to avoid problems..

Jack
+3  A: 

Yes, it's entirely possible. SHA-1 is a block algorithm, so it operates on one block at a time. The exact block size varies with the size of hash you're producing, but it's always quite small (e.g., 20 - 50 bytes or so). That, of course, is assuming you mean to include SHA-1 256, 384 and/or 512 (aka SHA-256, SHA-384 and SHA-512). If you're only including the original 160-bit version, then the block size is always 20 bytes (160 bits).

Edit: oops -- rereading the spec, the block sizes are 512 bits for SHA-1, SHA-224, SHA-256, and 1024 bits for SHA-384 and SHA-512. Thanks Charles!

Edit2: I almost forgot about the part where he's looking for source code, not just advice. Here's some code. First a header:

// Sha.h:
#ifndef SHA_1_H_INCLUDED
#define SHA_1_H_INCLUDED

// This is a relatively straightforward implementation of SHA-1. It makes no particular
// attempt at optimization, instead aiming toward easy verification against the standard.
// To that end, many of the variable names are identical to those used in FIPS 180-2 and
// FIPS 180-3. 
//
// The code should be fairly portable, within a few limitations:
// 1. It requires that 'char' have 8 bits. In theory this is avoidable, but I don't think
// it's worth the bother.
// 2. It only deals with inputs in (8-bit) bytes. In theory, SHA-1 can deal with a number of 
// bits that's not a multiple of 8, but I've never needed it. Since the padding always results
// in a byte-sized stream, the only parts that would need changing would be reading and padding
// the input. The main hashing portion would be unaffected.
//
// Compiles cleanly with:
//    MS VC++ 9.0SP1 (x86 or x64): -W4 -Za
//    gc++ 3.4: -ansi -pedantic -Wall
//    comeau 4.3.3: --vc71 
// Appears to work corectly in all cases. 
// You can't use maximum warnings with Comeau though -- this code itself doesn't give problems
// (that I know of) but Microsoft's headers give it *major* heartburn.
// 
//
// Written by Jerry Coffin, February 2008
// 
// You can use this software any way you want to, with following limitations
// (shamelessly stolen from the Boost software license):
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
// SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
// FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
// ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
// DEALINGS IN THE SOFTWARE.
// 
// If you put this to real use, I'd be happy to hear about it. If you find a bug, 
// I'd be interested in hearing about that too. There's even a pretty good chance 
// that I'll try to fix it, though I certainly can't guarantee that.
// 
#include <algorithm>
#include <vector>
#include <string>
#include <assert.h>
#include <iostream>
#include <sstream>
#include <iomanip>

#if defined(_MSC_VER) && _MSC_VER < 1600
typedef unsigned int uint32_t;
typedef unsigned __int64 uint64_t;
#else
#include <stdint.h>
#endif

namespace crypto { 
namespace {
    struct ternary_operator { 
        virtual uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) = 0;
    };
}

class sha1 { 
    static const size_t hash_size = 5;
    static const size_t min_pad = 64;
    static const size_t block_bits = 512;
    static const size_t block_bytes = block_bits / 8;
    static const size_t block_words = block_bytes / 4;

    std::vector<uint32_t> K;
    std::vector<uint32_t> H;
    std::vector<uint32_t> W;
    std::vector<ternary_operator *> fs;
    uint32_t a, b, c, d, e, T;
    static const size_t block_size = 16;
    static const size_t bytes_per_word = 4;
    size_t total_size;

    // hash a 512-bit block of input.
    //
    void hash_block(std::vector<uint32_t> const &block);

    // Pad the input to a multiple of 512 bits, and add the length
    // in binary to the end.
    static std::string pad(std::string const &input, size_t size);

    // Turn 64 bytes into a block of 16 uint32_t's.
    std::vector<uint32_t> make_block(std::string const &in);

public:

    // Construct a SHA-1 object. More expensive that typical 
    // ctor, but not expected to be copied a lot or anything
    // like that, so it should be fairly harmless.
    sha1();

    // The two ways to provide input for hashing: as a stream or a string.
    // Either way, you get the result as a vector<uint32_t>. It's a fairly
    // small vector, so even if your compiler doesn't do return-value 
    // optimization, the time taken for copying it isn't like to be 
    // significant.
    //
    std::vector<uint32_t> operator()(std::istream &in);
    std::vector<uint32_t> operator()(std::string const &input);

    friend std::ostream &operator<<(std::ostream &os, sha1 const &s);
};
}

#endif

And the implementation:

// Sha1.cpp:
#include "sha.h"
// Please see comments in sha.h for licensing information, etc.
//

// Many people don't like the names I usually use for namespaces, so I've kept this one
// short and simple.
//
namespace crypto {
namespace {  
uint32_t ROTL(uint32_t const &value, unsigned bits) { 
    uint32_t mask = (1 << bits) - 1;
    return value << bits | (value >> (32-bits))&mask;
}

struct f1 : ternary_operator {
    uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) { 
        return (x & y) ^ (~x&z);
    }
};

struct f2 : ternary_operator {
    uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) {
        return x ^ y ^ z;
    }
};

struct f3 : ternary_operator {
    uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) {
        return (x&y) ^ (x&z) ^ (y&z);
    }
};

uint32_t word(int a, int b, int c, int d) {
    a &= 0xff;
    b &= 0xff;
    c &= 0xff;
    d &= 0xff;
    int val =  a << 24 | b << 16 | c << 8 | d;
    return val;
}
}

// hash a 512-bit block of input.
//
void sha1::hash_block(std::vector<uint32_t> const &block) {
    assert(block.size() == block_words);

    int t;
    std::copy(block.begin(), block.end(), W.begin());
    for (t=16; t<80; t++) {
        W[t] = ROTL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
    }

    a = H[0]; b = H[1]; c = H[2]; d = H[3]; e = H[4];

    for (t=0; t<80; t++) {
        T = ROTL(a, 5) + (*fs[t])(b, c, d) + e + K[t] + W[t];
        e = d;
        d = c;
        c = ROTL(b, 30);
        b = a;
        a = T;
    }
    H[0] += a; H[1] += b; H[2] += c; H[3] += d; H[4] += e;
}

// Pad the input to a multiple of 512 bits, and put the length
// in binary at the end. 
std::string sha1::pad(std::string const &input, size_t size) {
    size_t length = size * 8 + 1;
    size_t remainder = length % block_bits;
    size_t pad_len = block_bits-remainder;

    if (pad_len < min_pad)
        pad_len += block_bits;
    ++pad_len;

    pad_len &= ~7;
    std::string padding(pad_len/8, '\0');

    for (size_t i=0; i<sizeof(padding.size()); i++)
        padding[padding.size()-i-1] = (length-1) >> (i*8) & 0xff;
    padding[0] |= (unsigned char)0x80;

    std::string ret(input+padding);
    return ret;
}

// Turn 64 bytes into a block of 16 uint32_t's.
std::vector<uint32_t> sha1::make_block(std::string const &in) { 
    assert(in.size() >= block_bytes);

    std::vector<uint32_t> ret(block_words);

    for (size_t i=0; i<block_words; i++) {
        size_t s = i*4;
        ret[i] = word(in[s], in[s+1], in[s+2], in[s+3]);
    }
    return ret;
}


// Construct a SHA-1 object. More expensive that typical 
// ctor, but not expected to be copied a lot or anything
// like that, so it should be fairly harmless.
sha1::sha1() : K(80), H(5), W(80), fs(80), total_size(0) {
    static const uint32_t H0[] = { 
        0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
    };
    static const uint32_t Ks[] = {
        0x5a827999, 0x6ed9eba1, 0x8f1bbcdc, 0xca62c1d6
    };

    std::copy(H0, H0+hash_size, H.begin());

    std::fill_n(K.begin()+00, 20, Ks[0]);
    std::fill_n(K.begin()+20, 20, Ks[1]);
    std::fill_n(K.begin()+40, 20, Ks[2]);
    std::fill_n(K.begin()+60, 20, Ks[3]);

    static f1 sf1;
    static f2 sf2;
    static f3 sf3;

    std::fill_n(fs.begin()+00, 20, &sf1);
    std::fill_n(fs.begin()+20, 20, &sf2);
    std::fill_n(fs.begin()+40, 20, &sf3);
    std::fill_n(fs.begin()+60, 20, &sf2);
}

// The two ways to provide input for hashing: as a stream or a string.
// Either way, you get the result as a vector<uint32_t>. It's a fairly
// small vector, so even if your compiler doesn't do return-value 
// optimization, the time taken for copying it isn't like to be 
// significant.
// 
std::vector<uint32_t> sha1::operator()(std::string const &input) { 
    std::string temp(pad(input, total_size + input.size()));
    std::vector<uint32_t> block(block_size);

    size_t num = temp.size()/block_bytes;

    for (unsigned block_num=0; block_num<num; block_num++) {
        size_t s;
        for (size_t i=0; i<block_size; i++) {
            s = block_num*block_bytes+i*4;
            block[i] = word(temp[s], temp[s+1], temp[s+2], temp[s+3]);
        }
        hash_block(block);  
    }
    return H;
}

std::vector<uint32_t> sha1::operator()(std::istream &in) { 
    char raw_block[65];

    while (in.read(raw_block, block_bytes)) {
        total_size += block_bytes;
        std::string b(raw_block, in.gcount());
        hash_block(make_block(b));
    }
    std::string x(raw_block, in.gcount());
    return operator()(x);
}

std::ostream &operator<<(std::ostream &os, sha1 const &s) { 
    // Display a SHA-1 result in hex.
    for (size_t i=0; i<(s.H).size(); i++)
        os << std::fixed << std::setprecision(8) << std::hex << std::setfill('0') << (s.H)[i] << " ";
    return os << std::dec << std::setfill(' ') << "\n";
}

}

#ifdef TEST
#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>

// A minimal test harness to check that it's working correctly. Strictly black-box
// testing, with no attempt at things like coverage analysis. Nonetheless, I believe
// it should cover most of the code -- the core hashing code all gets used for every
// possible value. The padding code should be tested fairly thoroughly as well -- the
// first test is a fairly simple case, and the second the more complex one (where the 
// padding requires adding another block).
class tester {
    bool verify(uint32_t *test_val, std::vector<uint32_t> const &hash, std::ostream &os) {
        // Verify that a result matches a test value and report result.
        for (size_t i=0; i<hash.size(); i++)
            if (hash[i] != test_val[i]) {
                os << "Mismatch. Expected: " << test_val[i] << ", but found: " << hash[i] << "\n";
                return false;
            }
            os << "Message digest Verified.\n\n";
            return true;
    }

public:

    bool operator()(uint32_t *test_val, std::string const &input) {
        std::cout << "Testing hashing from string:\n\"" << input << "\"\n";
        crypto::sha1 hasher1;
        std::vector<uint32_t> hash = hasher1(input);
        std::cout << "Message digest is:\n\t" << hasher1;
        bool verified = verify(test_val, hash, std::cerr);

        crypto::sha1 hasher2;
        std::cout << "Testing hashing from Stream:\n";
        std::istringstream buf(input);
        hash = hasher2(buf);
        std::cout << "Message digest is:\n\t" << hasher2;

        return verified & verify(test_val, hash, std::cerr);
    }
};

int main() {
    // These test values and results come directly from the FIPS pub.
    //
    char const *input1 = "abc";
    char const *input2 = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
    uint32_t result1[] = {0xA9993E36, 0x4706816A, 0xBA3E2571, 0x7850C26C, 0x9CD0D89D};
    uint32_t result2[] = {0x84983E44, 0x1C3BD26E, 0xBAAE4AA1, 0xF95129E5, 0xE54670F1 };
    bool correct = tester()(result1, input1);
    correct &= tester()(result2, input2);
    if (correct)
        std::cerr << "All Tests passed!\n";
    else
        std::cerr << "Test Failed!\n";
}
#elif defined(MAIN) 

#include <sstream>
#include <fstream>
#include <iostream>

int main(int argc, char **argv) { 
    if (argc < 2) {
        std::cerr << "Usage: sha1 [filename]\n";
        return EXIT_FAILURE;
    }
    for (int i=1; i<argc; i++) {
        crypto::sha1 hash;
        std::ifstream in(argv[i], std::ios_base::binary);
        if (in.good()) {
            hash(in);
            std::cout << "SHA-1(" << argv[i] << ") = " << hash << "\n";
        }
    }
    return 0;
}
#endif
Jerry Coffin
IIRC, SHA-1 (as defined in FIPS 180-3) always has a block size of 512 bits / 64 bytes and always produces a 160 bit (20 bytes) digest. Other sized SHAs are all called something different.
Charles Bailey
It is important to use the features in the standard runtime library and duplicate the code simply because of confusion or ignorance of its existence. I don't downvote this answer because it is correct and asks for what the OP wanted, but it is clear that the OP does not understand or know of MessageDigest.update().
GregS
+2  A: 

Yes it is. You only need to read blocks of 512 bits (64 bytes) at a time to calculate a SHA-1 hash.

You need to keep track of the stream length and preform the correct padding in the last one or two blocks but yes, it's perfectly feasible.

I've written such an implementation in C++ before, but I'm afraid I'm not free to distribute it.

Charles Bailey
+1  A: 

SHA-1 processes data in blocks, so you can process your file in stream. I'm pretty confident the bouncy castle crypto library implements SHA-1 at a low enough level that you can stream your data. I have done very similar using other block cyphers with the bouncy castle crypto library.

http://www.bouncycastle.org/

AaronLS
SHA1 is not a block cipher
GregS
My mistake, it is not a cypher. I guess I was thinking of the cases where it was used as the basis of a block cypher. Never the less my point is that it processes the data in blocks and I have personally written code that used input and output streams with bouncycastle, and it's usually as easy as changing a few parameters to switch from one standard to another. I hopefully ill have time soon to find the code.
AaronLS
+7  A: 

The Java docs say to use a MessageDigest class to compute SHA-1 on any arbitrary size data.

Robert Harvey
Uh oh! Why is this the least reputed answer? Java's built-in MessageDigest does exactly what he needs.
alex
@alex: the only concern I can think of is whether "SHA" is guaranteed to be available, or whether it's up to the Spi. Probably a risk worth taking, though: most applications which handle files are already going to be making assumptions about platform capabilities, and it's not exactly a difficult thing to test.
Steve Jessop
SHA1 is available in the Sun JCE providers since at least 1.4.2 and probably earlier. I don't know about IBM providers.
GregS
+2  A: 

You can do exactly this, transparently and almost effortlessly, by using the DigestInputStream or DigestOutputStream classes. Or you can use MessageDigest manually and it is almost as easy.

GregS