views:

653

answers:

4

I am building a library that allows a user to download files from a URL. One of the options I am considering is letting the user specify the expected MD5 checksum for the file; the library's GetFile(string url) function ensures that the checksum for the downloaded stream matches the one specified by the user.

Being aware that the NetworkStream returned by HttpWebResponse.GetResponseStream() is not seekable, I found a way to duplicate the Stream thanks to the answers to this question: http://stackoverflow.com/questions/147941/how-can-i-read-an-http-response-stream-twice-in-c. Before I went any farther though, I wanted to figure out what the memory implications of this duplication would be; unfortunately, multiple searches on Google and MSDN have came to naught.

The library imposes no restriction on the size of the file to be downloaded. My question is, if the user selects a 2GB file, is the MemoryStream implementation in .NET 2.0 smart enough to use the PageFile and RAM efficiently enough that the system doesn't start to crawl due to a VM crunch? Also, Jon Skeet's comment on another question gave me something to think about - he averred that even after disposing a MemoryStream, the memory is not 100% free'ed. How and when can I ensure that the memory is actually released? Will it be released based on the system's requirements (and necessity)?

Thanks, Manoj

A: 

I'm pretty sure you'll get an OutOfMemoryException. Easy way to try is try to read a DVD ISO image or something into memory using a memory stream. If you can read the whole thing, then you should be fine. If you get an exception, well, there you go.

Chris
+3  A: 

You're saving it to a file, right? Why not save it chunk by chunk, updating a hash as you go, and then just check the hash at the end? I don't think you need to read the response twice, nor buffer it. As another answer points out, that would fail when you got over 1GB anyway.

Don't forget that as well as the current size of the MemoryStream, any time it has to grow you'll end up with (temporarily) the new array plus the old array in memory at the same time. Of course that wouldn't be a problem if you knew the content length beforehand, but it would still be nicer to just write it to disk and hash as you go.

Jon Skeet
Not saving the stream to a file; the stream is passed on to the user to do whatever they deem fit with the data. The option to compute the hash is provided to the user as a nice-to-have, but judging from the complexity of the task at hand, I will reconsider the addition of this feature to the library. Thanks!
+2  A: 

MemoryStream is backed by an array. Even if you have a 64 bit OS this isn't going to work for more than 1GB as the framework won't allocate a larger array.

Joshua
A: 

Afaik the CLR managed heap will not allocate anything bigger than 2 GB and the MemoryStream is backed by a live, contigous, byte[]. Large Object Heap doesn't allocations handle over 2GB, not even on x64.

But to store an entire file in memory just to compute a hash seems pretty low tech. You can compute the hash as you receive the bytes, chunk by chunk. After each IO completion you can hash the received bytes, then submit the write to the file. At the end, you have the hash computed and the file uploaded, huraay.

BTW, If you seek code to manipulate files, steer clear of any sample that contains the words ReadToEnd...

class Program
    {
        private static AutoResetEvent done = new AutoResetEvent(false);
        private static AsyncCallback _callbackReadStream;
        private static AsyncCallback _callbackWriteFile;

        static void Main(string[] args)
        {

        try
        {
            _callbackReadStream = new AsyncCallback(CallbackReadStream);
            _callbackWriteFile = new AsyncCallback(CallbackWriteFile);
            string url = "http://...";
            WebRequest request = WebRequest.Create(url);
            request.Method = "GET";
            request.BeginGetResponse(new AsyncCallback(
                CallbackGetResponse), request);
            done.WaitOne();
        }
        catch (Exception e)
        {
            Console.Error.WriteLine(e.Message);
        }
    }

    private class State
    {
        public Stream ReponseStream { get; set; }
        public HashAlgorithm Hash { get; set; }
        public Stream FileStream { get; set; }
        private byte[] _buffer = new byte[16379];
        public byte[] Buffer { get { return _buffer; } }
        public int ReadBytes { get; set; }
        public long FileLength {get;set;}
    }

    static void CallbackGetResponse(IAsyncResult ar)
    {
        try
        {
            WebRequest request = (WebRequest)ar.AsyncState;
            WebResponse response = request.EndGetResponse(ar);

            State s = new State();
            s.ReponseStream = response.GetResponseStream();
            s.FileStream = new FileStream("download.out"
                , FileMode.Create
                , FileAccess.Write
                , FileShare.None);
            s.Hash = HashAlgorithm.Create("MD5");

            s.ReponseStream.BeginRead(
                s.Buffer
                , 0
                , s.Buffer.Length
                , _callbackReadStream
                , s); 
        }
        catch (Exception e)
        {
            Console.Error.WriteLine(e.Message);
            done.Set();
        }
    }

    private static void CallbackReadStream(IAsyncResult ar)
    {
        try
        {
            State s = (State)ar.AsyncState;
            s.ReadBytes = s.ReponseStream.EndRead(ar);
            s.Hash.ComputeHash(s.Buffer, 0, s.ReadBytes);
            s.FileStream.BeginWrite(
                s.Buffer
                , 0
                , s.ReadBytes
                , _callbackWriteFile
                , s);
        }
        catch (Exception e)
        {
            Console.Error.WriteLine(e.Message);
            done.Set();
        }
    }

    static private void CallbackWriteFile(IAsyncResult ar)
    {
        try
        {
            State s = (State)ar.AsyncState;
            s.FileStream.EndWrite(ar);

            s.FileLength += s.ReadBytes;

            if (0 != s.ReadBytes)
            {
                s.ReponseStream.BeginRead(
                    s.Buffer
                    , 0
                    , s.Buffer.Length
                    , _callbackReadStream
                    , s);
            }
            else
            {
                Console.Out.Write("Downloaded {0} bytes. Hash(base64):{1}",
                    s.FileLength, Convert.ToBase64String(s.Hash.Hash));
                done.Set();
            }
        }
        catch (Exception e)
        {
            Console.Error.WriteLine(e.Message);
            done.Set();
        }

    }
}
Remus Rusanu
I don't have the liberty to write the contents of the stream to disk (even temporarily) because the library is not authorized to write data to the user's file-system.
I don't understand then. You said your component downloads the file(s). How can it download it w/o writing to disk? Are you saying you want to let the browser handle the download, yet somehow get the download stream too for md5 checksum?Or you want to download it twice, once by the browser once by your code for the checksum? To compute the checksum you don't have to write it anywhere, just call ComputeHash on each buffer and then discard the buffer.
Remus Rusanu
In order to retrieve an object from the network URI, the library provides the following interface:Stream GetObject(string uri);One of the options being considered was to overload GetObject with a flag that specified that an MD5 Digest be computed for the stream on the user's behalf:Stream GetObject(string uri, bool fVerifyDigest);The network file server provides the MD5 Digest for the object being downloaded as one of the HTTPWebResponse headers. The idea was to compute the hash of the stream and compare it with the value returned by the server. Make sense?
I see. You should return your own class derived from Stream that implements this (computes the hash ont he fly). Similar to how DeflateStream, GZipStream or CryptoStream and other Stream derived classes work. You construct your Stream to wrap the HttpResult stream. You override Read and in the implementation you request the bytes from the http stream, add them to the hash, then return them to caller.
Remus Rusanu