views:

159

answers:

2

I want to create a "temporary cache lookup" to speed up a file lookup on my webserver.

What I have now is a folder full of images, and if a user requests an image I use a File.Exists(...) to check to see if the image exists. If it doesn't, download it from another server, and either way, redirect the user to it.

The problem with this is that a lot of requests at once cause File.Exists() to hang. I would like to keep a quick and dirty HashSet of filenames known to be in the local folder, so that if the the user requests a file, and it exists in the HashSet, just redirect to it without doing the File.Exists(), if it doesn't exist int he HashSet, do the File.Exists() then add it.

I know that the HashSet would get blown away if the server ever gets restarted, but I'm not worried about that, because it would quickly "rebuild" itself with the most requested images using the above scenario.

The main question is, since this object would be getting called from multiple users, and various requests would all be adding items to the set, would this cause a problem? Up to now all I've used in the static global sense on webservers is DB connection strings or an email address to send alerts to, etc.

Edit with regards to race conditions:
Yes I was thinking of race conditions. But is a race condition even possible on a single HashSet? Since it only allows unique values, wouldnt the second attempt to add a value fail? At which point I would just ignore the error and continue on.

+3  A: 

It makes sense. However, make sure you take care of possible race conditions. You could use the ReaderWriterLockSlim class to control access to the object from different threads.

UPDATE:

You absolutely need to lock the object appropriately, as the Add method is not an atomic operation. You can even leave the object in an inconsistent state if you add two things simultaneously.

Mehrdad Afshari
You mentioned in another answer that Lock(){} would not be performance friendly. What advantages does ReaderWriterLockSlim offer over Lock(){} ?
Neil N
If you use `lock`, you'll have to use it **every time** you read or write to the object. Only one thread can read from the object at the same time. By using a reader-writer lock, only a single writer can access the object but multiple readers can read from it simultaneously. However, if you can keep the locking regions small enough, the performance difference can be negligible, and `lock` statement is much easier to implement.
Mehrdad Afshari
@Mehrdad: Why would a `ReaderWriterLockSlim` be any better than a plain `lock` block in this situation? To avoid race conditions you'd always need to enter the `RWLS` in upgradeable mode, check the `HashSet` and then, if necessary, upgrade to write mode and update the `HashSet` once you've downloaded the file etc. Since only one thread can be in upgradeable mode at any time, this wouldn't really be much different to a plain `lock`, or would it?
LukeH
Luke, the problem you are thinking of is addressed by the possibility to Upgrade a Read lock to a Write lock. So you still can 'quickly' verify that a file exist.
Henk Holterman
I actually do exactly this; and despite claims by other posters about the system getting out of sync; if you know your environment, it is a completely acceptable approach, IMHO.
Noon Silk
@Henk: You can't upgrade a plain read lock to a write lock, you'd need to exit the read lock first and then enter the write lock, which introduces the (slim) possibility of a race. Only an *upgradeable* read lock can be bumped up to a write lock, and only one thread can hold an upgradeable lock at a time (ie, it blocks all other upgradeable locks).
LukeH
If this were my application I would use a plain old lock. The semantics are easier to understand and I doubt the ReaderWriterLockSlim is going to be that much faster that it is going to matter. As we all know dealing with thread synchronization issues are hard enough. There is no need to complicate something unnecessarily.
Brian Gideon
@Brian G: But in reality, after the initial build up of the cache I will be doing 99% reads, and only 1% writes, it makes sense to allow multiple threads to read.
Neil N
+4  A: 

In addition to the race condition problems people have mentioned, you are asking for a number of other problems.

First off, a cache without a cache lifetime policy has a name -- we call caches that grow without bound and never release their data "memory leaks". They have a way of bringing servers to their knees. Think hard about how you're going to know when your cache is getting too big and its time to throw it away.

Second, caches are worse than useless if they are liars. Your cache alleges to be a speedup over the file system, but exactly what mechanism do you propose to keep the cache and the file system consistent? Suppose a user accesses a file, the information gets cached, and then some other operation deletes that file. Now the cache is out of sync with the thing is is supposed to be abstracting over; this sounds like a potential source of bugs. The next time someone accesses that file, the cache is going to say that it exists even though it does not any longer.

These are of course solvable problems; the question is not whether they are solvable, but rather, whether the expense of solving them is going to be paid for by the increase in performance you eke out.

Or, you could ignore these problems and hope for the best. For example, if you believe that the cost of running into a memory leak or cache inconsistency is a cost of doing business that the increase in performance pays for. If that's the case then it would be wise to intentionally make the decision to take a brittle low-cost solution, rather than doing so accidentally and being surprised later when things break.

Eric Lippert