tags:

views:

80

answers:

5

Using .net, I would like to be able to hash the first N bytes of potentially large files, but I can't seem to find a way of doing it.

The ComputeHash function (I'm using SHA1) takes a byte array or a stream, but a stream seems like the best way of doing it, since I would prefer not to load a potentially large file into memory.

To be clear: I don't want to load a potentially large piece of data into memory if I can help it. If the file is 2GB and I want to hash the first 1GB, that's a lot of RAM!

+1  A: 

Open the file as a FileStream, copy the first n bytes into a MemoryStream, then hash the MemoryStream.

Toby
But wouldn't the memory stream occupy n bytes? Say I want to hash the first Gigabyte. Wouldn't this consume a gig of RAM?
The second paragraph mentions it, though apparently not clear enough. I'll update it.
@freelookenstein: Yes, the MemoryStream uses a byte array as storage, so you might as well use the byte array directly instead.
Guffa
+1  A: 
fileStream.Read(array, 0, N); 

http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx

Andrey
You need more code than that to read N bytes from the file. You have to get the return value from the call to determine how many bytes were *actually* read into the buffer, and repeat the call until you have read all N bytes. This is a common mistake when using the Read method, so you should not show an example that simply adds to the confusion.
Guffa
@Guffa this is a direction, not the complete solution. i agree about need to capture the actual number of bytes read, but why should i repeat call until i have N of them? if file has 20 bytes and N is equal to 10 why would `FileStream.Read` return less then 10?
Andrey
@Andrey: Because there is no guarantee that it returns the maximum number of bytes that you request. Read the documentation: http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx
Guffa
@Guffa, docs are vague on this point: "This might be less than the number of bytes requested if that number of bytes are not currently available", my understanding of not available is "have not been written yet", so tail is shorter then your buffer and it makes this code legit.
Andrey
@Andrey: It's up to the underlying system to determine what data is currently available. The system may for example return the data that is in the cache while it's getting more data from disk. You shouldn't assume that the Read method always returns all data that you request, just because the data is available somewhere. The documentation is very clear, saying that the number of bytes returned may be less than the number of bytes requested, so that's what you got to code for.
Guffa
+1  A: 

As others have pointed out, you should read the first few bytes into an array.

What should also be noted that you don't want to make a direct call to Read and assume that the bytes have been read.

Rather, you want to make sure that the number of bytes that are returned are the number of bytes that you requested, and make another call to Read in the event that the number of bytes returned doesn't equal the initial number requested.

Also, if you have rather large streams, you will want to create a proxy for the Stream class where you pass it the underlying stream (the FileStream in this case) and override the Read method to forward the call to the underlying stream until you read the number of bytes that you need to read. Then, when that number of bytes is returned, you would return -1 to indicate that there are no more bytes to be read.

casperOne
+5  A: 

You can hash large volumes of data using a CryptoStream - something like this should work:

var sha1 = SHA1Managed.Create();

FileStream fs = \\whatever
using (var cs = new CryptoStream(fs, sha1, CryptoStreamMode.Read))
{
    byte[] buf = new byte[16];
    int bytesRead = cs.Read(buf, 0, buf.Length);
    long totalBytesRead = bytesRead;

    while (bytesRead > 0 && totalBytesRead <= maxBytesToHash)
    {
        bytesRead = cs.Read(buf, 0, buf.Length);
        totalBytesRead += bytesRead;
    }
}

byte[] hash = sha1.Hash;
Lee
+1  A: 

If you are concerned about keeping too much data in memory, you can create a stream wrapper that throttles the maximum number of bytes read.

Without doing all the work, here's a sample boiler plate you could use to get started.

Edit: Please review comments for recommendations to improve this implementation. End edit

public class LimitedStream : Stream
{
    private int current = 0;
    private int limit;
    private Stream stream;
    public LimitedStream(Stream stream, int n)
    {
        this.limit = n;
        this.stream = stream;
    }

    public override int ReadByte()
    {
        if (current >= limit)
            return -1;

        var numread = base.ReadByte();
        if (numread >= 0)
            current++;

        return numread;
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        count = Math.Min(count, limit - current);
        var numread = this.stream.Read(buffer, offset, count);
        current += numread;
        return numread;
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotImplementedException();
    }

    public override void SetLength(long value)
    {
        throw new NotImplementedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotImplementedException();
    }

    public override bool CanRead
    {
        get { return true; }
    }

    public override bool CanSeek
    {
        get { return false; }
    }

    public override bool CanWrite
    {
        get { return false; }
    }

    public override void Flush()
    {
        throw new NotImplementedException();
    }

    public override long Length
    {
        get { throw new NotImplementedException(); }
    }

    public override long Position
    {
        get { throw new NotImplementedException(); }
        set { throw new NotImplementedException(); }
    }

    protected override void Dispose(bool disposing)
    {
        base.Dispose(disposing);
        if (this.stream != null)
        {
            this.stream.Dispose();
        }
    }
}

Here is an example of the stream in use, wrapping a file stream, but throttling the number of bytes read to the specified limit:

using (var stream = new LimitedStream(File.OpenRead(@".\test.xml"), 100))
{
    var bytes = new byte[1024];
    stream.Read(bytes, 0, bytes.Length);
}
kbrimington
The `Read` method returns the number of bytes actually placed in the buffer. If you ignore the return value, you risk getting a buffer containing less data than you think (in this case less than 100 bytes). You have to loop until you have gotten all the data that you requested, or until the `Read` method returns zero.
Guffa
@Guffa: Thanks! I'll edit the post to refer to your comment.
kbrimington