views:

53

answers:

1

Hi

I am trying to get an sha-1 for a number of files. What I currently do is cycle the files in a given path, open and read each file separately and load the contents in a buffer and then send it to openssl's SHA function to get the hash. The code looks something like this:

    void ReadHashFile(LPCTSTR name)
{
 FILE * pFile;
 long lSize;
 char * buffer;
 size_t result;

 pFile = _tfopen ( name , L"rb" );
 if (pFile==NULL) {fputs ("File error",stderr); return;}

 // obtain file size:
 fseek (pFile , 0 , SEEK_END);
 lSize = ftell (pFile);
 rewind (pFile);

 if(lSize == -1){fputs ("Read Error",stderr);return;}

 // allocate memory to contain the whole file:
 buffer = (char*) malloc (sizeof(char)*lSize);
 if (buffer == NULL) {fputs ("Memory error",stderr); return;}

 // copy the file into the buffer:
 result = fread (buffer,1,lSize,pFile);
 if (result != lSize) {fputs ("Reading error",stderr); return;}

 /* the whole file is now loaded in the memory buffer. */

 // terminate
 fclose (pFile);

 //Do what ever with buffer
 unsigned char ibuf[] = "compute sha1";
 unsigned char obuf[20];

 SHA1((const unsigned char*)buffer, strlen((const char*)buffer), obuf);
 fwprintf(stderr, L"file %s\n", name);
 int i;
 for (i = 0; i < 20; i++) {
  printf("%02x ", obuf[i]);
 }
 printf("\n");


 free(buffer);
}

Some files seem to be unreadable, some give me a -1 size others I can only read the first 2-3 bytes which gives a lot of files the same sha even though they are different.

I would appreciate it if someone can help me with this or if anyone has experience in file hashing. Oh and is there a way of getting a file's sha1 without loading the entire file in memory first, I mean considering large files, this solution wont work.

Regards

+6  A: 

If you get trouble reading the file contents, prior to invoking the hash function code, then your problem is not related to hashing.

You should use the standard fopen() function, rather than _tfopen(). In C, things which begin with an underscore character are often best avoided. Especially since _tfopen() seems to map to either fopen() or the Windows-specific _wfopen() depending on whether so-called "unicode support" is activated. Alternatively, in a purely Windows application, you may rely on Win32 functions such as CreateFile().

Reading the whole file in memory and then hashing it is crude. It will fail to process files which are larger than available RAM, for instance. Also, in order to know the file size, you have to seek into it, which is not reliable (there may be pseudo-files which are actually pipes into some data-generating process, for which seeking is not possible). Hash functions can process data by chunks; you should use a small buffer (8 kB is the traditional size) and employ the SHA1_Init(), SHA1_Update() and SHA1_Final() functions.

fread() does not necessarily read as much data as you requested. And this is not an error.

When you call SHA1(), you use strlen() on your buffer, which is bogus. strlen() returns the length of a character string; in plain words, the number of bytes until the next byte of value zero. Many files contain bytes of value 0. And if the file does not, then there is no guarantee that your buffer contains any byte of value 0, so that the call to strlen() may end up reading memory outside of the allocated buffer (this is bad). Since you went to the trouble of obtaining the file length and allocating a buffer that big, you should at least use that length instead of trying to recompute it with a function which does not do that.

To sum up: your code should look like that (untested):

/*
 * Hash a file, which name is given. Hash output is written out in
 * buffer "out[]". The hash output consists in exactly 20 bytes.
 * On success, 0 is returned; on error, returned value is -1 and
 * out[] is unaltered.
 */
int
do_sha1_file(char *name, unsigned char *out)
{
    FILE *f;
    unsigned char buf[8192];
    SHA_CTX sc;
    int err;

    f = fopen(name, "rb");
    if (f == NULL) {
        /* do something smart here: the file could not be opened */
        return -1;
    }
    SHA1_Init(&sc);
    for (;;) {
        size_t len;

        len = fread(buf, 1, sizeof buf, f);
        if (len == 0)
            break;
        SHA1_Update(&sc, buf, len);
    }
    err = ferror(f);
    fclose(f);
    if (err) {
        /* some I/O error was encountered; report the error */
        return -1;
    }
    SHA1_Final(out, &sc);
    return 0;
}

And do not forget to include the relevant file headers ! (<stdio.h>, and the sha.h from OpenSSL)

Thomas Pornin
+1 for the analytical explanation.
Thomas Matthews