views:

108

answers:

4

I'm hashing a file with one or more hash algorithms. When I tried to parametrize which hash types I want, it got a lot messier than I was hoping.

I think I'm missing a chance to make better use of generics or LINQ. I also don't like that I have to use a Type[] as the parameter instead of limiting it to a more specific set of type (HashAlgorithm descendants), I'd like to specify types as the parameter and let this method do the constructing, but maybe this would look better if I had the caller new-up instances of HashAlgorithm to pass in?

public List<string> ComputeMultipleHashesOnFile(string filename, Type[] hashClassTypes)
        {
            var hashClassInstances = new List<HashAlgorithm>();
            var cryptoStreams = new List<CryptoStream>();

            FileStream fs = File.OpenRead(filename);
            Stream cryptoStream = fs;

            foreach (var hashClassType in hashClassTypes)
            {
                object obj = Activator.CreateInstance(hashClassType);
                var cs = new CryptoStream(cryptoStream, (HashAlgorithm)obj, CryptoStreamMode.Read);

                hashClassInstances.Add((HashAlgorithm)obj);
                cryptoStreams.Add(cs);

                cryptoStream = cs;
            }

            CryptoStream cs1 = cryptoStreams.Last();

            byte[] scratch = new byte[1 << 16];
            int bytesRead;
            do { bytesRead = cs1.Read(scratch, 0, scratch.Length); }
            while (bytesRead > 0);

            foreach (var stream in cryptoStreams)
            {
                stream.Close();
            }

            foreach (var hashClassInstance in hashClassInstances)
            {
                Console.WriteLine("{0} hash = {1}", hashClassInstance.ToString(), HexStr(hashClassInstance.Hash).ToLower());
            }
        }
+3  A: 

Why are you supplying the types as Types and creating them rather than just allowing the user to pass in instances of HashAlgorithm? It seems like that would alleviate the problem altogether.

If this is a requirement, then what you have is really the only solution, since you can't specify a variable number of type parameters on a generic type or function (which it seems like you'd need, since it's an array now), and you can't enforce the types passed in to be of a particular inheritance line (any more than you can enforce that an integer parameter be between 1 and 10). That sort of validation can only be done at runtime.

Adam Robinson
+1  A: 

Just a minor point here, nothing ground breaking. Whenever you foreach over a list you can use linq. It's especially nice for one liners:

cryptoStreams.ForEach(s => s.Close());
hashClassInstances.ForEach(h => CW("{0} ...", h.ToString()...);
Joel
And as another minor point, this has nothing to do with LINQ. The `List<T>` class provides a `ForEach` method. This isn't a part of the `System.Linq` namespace.
Adam Robinson
And as another minor point, even if you wrote the `ForEach` extension method yourself on `IEnumerable<T>`, it doesn't offer any improvement whatsoever in terms of readability, but *does* result in a minor performance hit and a program that is much harder to test or debug.
Aaronaught
+1  A: 

What about something like this?

    public string ComputeMultipleHashesOnFile<T>(string filename, T hashClassType)
        where T : HashAlgorithm
    {

    }

The where clause restricts the T parameter to be of HashAlgorithm type. So you can create a class inheriting from HashAlgorithm and implement the abstract class members:

public class HA : HashAlgorithm
{
    protected override void HashCore(byte[] array, int ibStart, int cbSize)
    {
        throw new NotImplementedException();
    }

    protected override byte[] HashFinal()
    {
        throw new NotImplementedException();
    }

    public override void Initialize()
    {
        throw new NotImplementedException();
    }
}
CRice
This doesn't allow for multiple algorithms. His original code used an array for the `Type`s.
Adam Robinson
Well then, make another method (which handles array of [string, HashAlgorithm]) call this one
CRice
+1  A: 

Let's start by breaking the problem down. Your requirement is that you need to compute several different kinds of hashes on the same file. Assume for the moment that you don't need to actually instantiate the types. Start with a function that has them already instantiated:

public IEnumerable<string> GetHashStrings(string fileName,
    IEnumerable<HashAlgorithm> algorithms)
{
    byte[] fileBytes = File.ReadAllBytes(fileName);
    return algorithms
        .Select(a => a.ComputeHash(fileBytes))
        .Select(b => HexStr(b));
}

That was easy. If the files might be large and you need to stream it (keeping in mind that this will be much more expensive in terms of I/O, just cheaper for memory), you can do that too, it's just a little more verbose:

public IEnumerable<string> GetStreamedHashStrings(string fileName,
    IEnumerable<HashAlgorithm> algorithms)
{
    using (Stream fileStream = File.OpenRead(fileName))
    {
        return algorithms
            .Select(a => {
                fileStream.Position = 0;
                return a.ComputeHash(fileStream);
            })
            .Select(b => HexStr(b));
    }
}

It's a little gnarly and in the second case it's highly questionable whether or not the Linq-ified version is any better than an ordinary foreach loop, but hey, we're having fun, right?

Now that we've disentangled the hash-generation code, instantiating them first isn't really that much more difficult. Again we'll start with code that's clean - code that uses delegates instead of types:

public IEnumerable<string> GetHashStrings(string fileName,
    params Func<HashAlgorithm>[] algorithmSelectors)
{
    if (algorithmSelectors == null)
        return Enumerable.Empty<string>();
    var algorithms = algorithmSelectors.Select(s => s());
    return GetHashStrings(fileName, algorithms);
}

Now this is much nicer, and the benefit is that it allows instantiation of the algorithms within the method, but doesn't require it. We can invoke it like so:

var hashes = GetHashStrings(fileName,
    () => new MD5CryptoServiceProvider(),
    () => new SHA1CryptoServiceProvider());

If we really, really, desperately need to start from the actual Type instances, which I'd try not to do because it breaks compile-time type checking, then we can do that as the last step:

public IEnumerable<string> GetHashStrings(string fileName,
    params Type[] algorithmTypes)
{
    if (algorithmTypes == null)
        return Enumerable.Empty<string>();
    var algorithmSelectors = algorithmTypes
        .Where(t => t.IsSubclassOf(typeof(HashAlgorithm)))
        .Select(t => (Func<HashAlgorithm>)(() =>
            (HashAlgorithm)Activator.CreateInstance(t)))
        .ToArray();
    return GetHashStrings(fileName, algorithmSelectors);
}

And that's it. Now we can run this (bad) code:

var hashes = GetHashStrings(fileName, typeof(MD5CryptoServiceProvider),
    typeof(SHA1CryptoServiceProvider));

At the end of the day, this seems like more code but it's only because we've composed the solution effectively in a way that's easy to test and maintain. If we wanted to do this all in a single Linq expression, we could:

public IEnumerable<string> GetHashStrings(string fileName,
    params Type[] algorithmTypes)
{
    if (algorithmTypes == null)
        return Enumerable.Empty<string>();
    byte[] fileBytes = File.ReadAllBytes(fileName);
    return algorithmTypes
        .Where(t => t.IsSubclassOf(typeof(HashAlgorithm)))
        .Select(t => (HashAlgorithm)Activator.CreateInstance(t))
        .Select(a => a.ComputeHash(fileBytes))
        .Select(b => HexStr(b));
}

That's all there really is to it. I've skipped the delegated "selector" step in this final version because if you're writing this all as one function you don't need the intermediate step; the reason for having it as a separate function earlier is to give as much flexibility as possible while still maintaining compile-time type safety. Here we've sort of thrown it away to get the benefit of terser code.


Edit: I will add one thing, which is that although this code looks prettier, it actually leaks the unmanaged resources used by the HashAlgorithm descendants. You really need to do something like this instead:

public IEnumerable<string> GetHashStrings(string fileName,
    params Type[] algorithmTypes)
{
    if (algorithmTypes == null)
        return Enumerable.Empty<string>();
    byte[] fileBytes = File.ReadAllBytes(fileName);
    return algorithmTypes
        .Where(t => t.IsSubclassOf(typeof(HashAlgorithm)))
        .Select(t => (HashAlgorithm)Activator.CreateInstance(t))
        .Select(a => {
            byte[] result = a.ComputeHash(fileBytes);
            a.Dispose();
            return result;
        })
        .Select(b => HexStr(b));
}

And again we're kind of losing clarity here. It might be better to just construct the instances first, then iterate through them with foreach and yield return the hash strings. But you asked for a Linq solution, so there you are. ;)

Aaronaught
Any particular reason you're chaining the `Select` statements in this fashion? I don't see a reason that you couldn't instantiate, use, then `Dispose` the instance from within the same block.
Adam Robinson
@Adam Robinson: Readability, mainly. It's either chain the `Select` statements or chain a bunch of method calls, and the former formats nicer in SO's narrow little code block. As I mentioned a few times, I'd probably just use a single `foreach` loop if I were writing this for myself, but he asked for Linq, so that's what I supplied. :P
Aaronaught
@Aaron: What I'm saying is that the two select statements could be combined into one with basically no change to the code...
Adam Robinson
@Adam: So? This isn't code golf. I would rather represent the two operations as separate and distinct as opposed to chaining them as `HexStr(ComputeHash(fileBytes))`. I could write the entire function on one line if I wanted to but that isn't going to make it significantly faster and definitely doesn't make it easier to understand. If the OP wants to flatten it then that's up to him; as you've pointed out here, it's fairly self-evident how to do that.
Aaronaught
@Aaron: No offense was intended, I just thought that the extraneous `Select` statements made the code more difficult to read.
Adam Robinson
@Adam: No offense taken. I suppose we will have to agree to disagree on this point of code style. :P
Aaronaught
Great! I like where you went in the middle with Delegates instead of Types, interesting idea. The linq examples are also what I was looking for, although you are correct that it loses some clarity. I am hitting large files so I need the cryptoStreams to chain together so that the file is only read through once and the hashes are done simultaneously. I'm adapting in your other ideas though. Thanks.
DanO