ansaurus

Question

Answer 1

+1 A:

Open the file as a FileStream, copy the first n bytes into a MemoryStream, then hash the MemoryStream.

Toby 2010-08-09 15:07:56

But wouldn't the memory stream occupy n bytes? Say I want to hash the first Gigabyte. Wouldn't this consume a gig of RAM?

2010-08-09 15:10:41

The second paragraph mentions it, though apparently not clear enough. I'll update it.

2010-08-09 15:15:57

@freelookenstein: Yes, the MemoryStream uses a byte array as storage, so you might as well use the byte array directly instead.

Guffa 2010-08-09 15:17:47

Answer 2

+1 A:

fileStream.Read(array, 0, N);

http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx

Andrey 2010-08-09 15:09:31

You need more code than that to read N bytes from the file. You have to get the return value from the call to determine how many bytes were *actually* read into the buffer, and repeat the call until you have read all N bytes. This is a common mistake when using the Read method, so you should not show an example that simply adds to the confusion.

Guffa 2010-08-09 15:16:33

@Guffa this is a direction, not the complete solution. i agree about need to capture the actual number of bytes read, but why should i repeat call until i have N of them? if file has 20 bytes and N is equal to 10 why would `FileStream.Read` return less then 10?

Andrey 2010-08-09 15:42:42

@Andrey: Because there is no guarantee that it returns the maximum number of bytes that you request. Read the documentation: http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx

Guffa 2010-08-09 16:25:35

@Guffa, docs are vague on this point: "This might be less than the number of bytes requested if that number of bytes are not currently available", my understanding of not available is "have not been written yet", so tail is shorter then your buffer and it makes this code legit.

Andrey 2010-08-09 16:37:45

@Andrey: It's up to the underlying system to determine what data is currently available. The system may for example return the data that is in the cache while it's getting more data from disk. You shouldn't assume that the Read method always returns all data that you request, just because the data is available somewhere. The documentation is very clear, saying that the number of bytes returned may be less than the number of bytes requested, so that's what you got to code for.

Guffa 2010-08-09 17:19:55

Answer 3

+1 A:

As others have pointed out, you should read the first few bytes into an array.

What should also be noted that you don't want to make a direct call to Read and assume that the bytes have been read.

Rather, you want to make sure that the number of bytes that are returned are the number of bytes that you requested, and make another call to Read in the event that the number of bytes returned doesn't equal the initial number requested.

Also, if you have rather large streams, you will want to create a proxy for the Stream class where you pass it the underlying stream (the FileStream in this case) and override the Read method to forward the call to the underlying stream until you read the number of bytes that you need to read. Then, when that number of bytes is returned, you would return -1 to indicate that there are no more bytes to be read.

casperOne 2010-08-09 15:15:43

Answer 4

+5 A:

You can hash large volumes of data using a CryptoStream - something like this should work:

var sha1 = SHA1Managed.Create();

FileStream fs = \\whatever
using (var cs = new CryptoStream(fs, sha1, CryptoStreamMode.Read))
{
    byte[] buf = new byte[16];
    int bytesRead = cs.Read(buf, 0, buf.Length);
    long totalBytesRead = bytesRead;

    while (bytesRead > 0 && totalBytesRead <= maxBytesToHash)
    {
        bytesRead = cs.Read(buf, 0, buf.Length);
        totalBytesRead += bytesRead;
    }
}

byte[] hash = sha1.Hash;

Lee 2010-08-09 15:45:19

Answer 5

+1 A:

If you are concerned about keeping too much data in memory, you can create a stream wrapper that throttles the maximum number of bytes read.

Without doing all the work, here's a sample boiler plate you could use to get started.

Edit: Please review comments for recommendations to improve this implementation. End edit

public class LimitedStream : Stream
{
    private int current = 0;
    private int limit;
    private Stream stream;
    public LimitedStream(Stream stream, int n)
    {
        this.limit = n;
        this.stream = stream;
    }

    public override int ReadByte()
    {
        if (current >= limit)
            return -1;

        var numread = base.ReadByte();
        if (numread >= 0)
            current++;

        return numread;
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        count = Math.Min(count, limit - current);
        var numread = this.stream.Read(buffer, offset, count);
        current += numread;
        return numread;
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotImplementedException();
    }

    public override void SetLength(long value)
    {
        throw new NotImplementedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotImplementedException();
    }

    public override bool CanRead
    {
        get { return true; }
    }

    public override bool CanSeek
    {
        get { return false; }
    }

    public override bool CanWrite
    {
        get { return false; }
    }

    public override void Flush()
    {
        throw new NotImplementedException();
    }

    public override long Length
    {
        get { throw new NotImplementedException(); }
    }

    public override long Position
    {
        get { throw new NotImplementedException(); }
        set { throw new NotImplementedException(); }
    }

    protected override void Dispose(bool disposing)
    {
        base.Dispose(disposing);
        if (this.stream != null)
        {
            this.stream.Dispose();
        }
    }
}

Here is an example of the stream in use, wrapping a file stream, but throttling the number of bytes read to the specified limit:

using (var stream = new LimitedStream(File.OpenRead(@".\test.xml"), 100))
{
    var bytes = new byte[1024];
    stream.Read(bytes, 0, bytes.Length);
}

kbrimington 2010-08-09 15:56:27

The `Read` method returns the number of bytes actually placed in the buffer. If you ignore the return value, you risk getting a buffer containing less data than you think (in this case less than 100 bytes). You have to loop until you have gotten all the data that you requested, or until the `Read` method returns zero.

Guffa 2010-08-09 17:57:16

@Guffa: Thanks! I'll edit the post to refer to your comment.

kbrimington 2010-08-09 18:00:09

ansaurus

tags:

views:

answers:

How do I hash first N bytes of a file?

related questions