views:

145

answers:

3

I have the need to relatively quickly be able to determine if a set of files on a user's machine has been processed before by my application. The app in question uploads the files for the user to a server, and if the files have been uploaded before, it skips the upload. My plan so far has been to hash the files and then store the results along with an identifier of how they were uploaded to the server. The problem I think I'm going to run into is that storing this data could become quite cumbersome due to the length of the hashes. I'm expecting around 30-40 files right now but that could double or (hypothetically) even triple.

Would this be possible to store using a Dictionary, with the hashes as the key and the server information as the value? I would then have that Dictionary stored in the App's Properties.Settings.Default object. Is that feasible to store with that system or will I run into some sort of problem there? Note that due to the nature of the application, there is no chance of two users ever having the same set of data, so I don't need to compare uploads between users. Additionally, what would the performance be like for this type of operation? Users are expected to have at least a Pentium-M 1.5 GHz processor with 1 GB of RAM.

+2  A: 

I probably wouldn't put the dictionary into the app.config file, although I guess you could, depending on the server information. I'd probably just put it in a text file on its own unless you found that to be more of a problem for some reason. It feels like it's more data for the application than configuration of the application.

Performance shouldn't be an issue at all - dictionaries are designed to still be efficient with millions of entries, let alone the tens or hundreds you're talking about.

Jon Skeet
I actually wasn't going to store it in the app.config file but the user.config file. Although your point is a good one and I'll probably keep it separate. No need for the user.config file to balloon! As for the Dictionary, I was wondering if there's a length to the size key it can store? If I just concatenate the hashes together will that work? And for performance, I'm worried about the hashing of the files. Will the laptops these users have be able to do this in a reasonable period of time?
jasonh
There's no need to start concatenating the hashes - each hash will be fairly short, and dictionaries can cope with long keys anyway. And yes, laptops should be absolutely fine for hashing - most hashes are relatively computationally cheap; the bulk of the time will be taken just reading the file.
Jon Skeet
I think I missed a vital piece of information. The files go together as a set, so it wouldn't really make sense to create one dictionary entry per file, would it?
jasonh
+1  A: 

In reference to getting the hash values, I thought I'd mention this...

Using a hash value is good, so long as you get the same result each time without fail. I've read somewhere that .GetHashCode() isn't the same between different versions of .NET, so if you're planning on saving the hash in a persistent state, I'd avoid .GetHashCode(). If it is all done at once, then .GetHashCode() is ideal for comparing if things are the same.

If you need to persist the hash there are hashing classes available in .NET. I'm admittedly not an expert with this, but I think SHA1 has a hashing method.

Hugoware
When it comes to files, a hash usually refers to SHA1, MD5 etc - not GetHashCode(). I certainly *assumed* that was what the OP meant...
Jon Skeet
It probably was but then again some people don't know that so I thought I'd throw it out there anyways.
Hugoware
Yes, that was exactly what I meant. Thanks Mr. Skeet. :)
jasonh
A: 

Why not compare the File Modified DateTime instead? For this you need to save the modified date on the server.

Vivek
I'd rather not do it that way. I have SCP access to the server, but I'd like to keep network traffic to a minimum, hence the need to avoid re-uploading the same data. Pulling down a catalog of what the server has could become exceedingly slow as the userbase grows and extremely wasteful given the impossibility of file collisions between users.
jasonh