views:

117

answers:

2

I built an app that performs work on thousands of files, then writes modified copies of these files to the disk. I am using a ThreadPool but it was spawning so many threads the pc was becoming unresponsive 260 total), so i changed the max from the default of 250 down to 50, this solved that issue (app only spawns about 60 threads total), however now that the files are becoming ready so quickly, its tying up the UI to the point where the pc is unresponsive.

Is there a way to limit the amount of I/O - i mean, i like using 50 threads to perform the work on the files, but not 50 threads writing at the same time when they are processed. I would rather not re-architect the writing of the files part if i can keep from it - i was hoping i could limit the amount of I/O (simultaneous) the threads from this pool could consume.

+7  A: 

Use a semaphore to limit no. of threads wanting to write to disk simultaneously.

http://msdn.microsoft.com/en-us/library/system.threading.semaphore.aspx

Limits the number of threads that can access a resource or pool of resources concurrently.

Hasan Khan
Awesome, thanks
schmoopy
+4  A: 

You really don't need so many threads. A disk can only support its maximum read and write throughput, which a single thread can easily max-out if it is dedicated to IO i.e. reading or writing. You also cannot read and write to a hard disk simultaneously (although this is complicated with OS caching layers, etc), so having concurrent threads reading and writting can be very counter-productive. There is also little to be gained from having more threads than processors\cores for your non-IO tasks as any additional threads will spend much of their time waiting for a core to become available e.g. if you have 50 threads and 4 cores, a minimum of 46 of the threads will be idle at any given time. The wasted threads will contribute to both memory consumption also incur performance overhead as they will all be fighting to get a crack at some time on a core, and the OS has to arbitrate this fight.

A more straightforward approach would be have a single thread whose job it is to read in the files, and then add the data to a blocking queue (e.g. see ConcurrentQueue), meanwhile have a number of worker threads that are waiting on file data in the queue (e.g. a number threads equal to the number of processors\cores). These worker threads will munch their way through the queue as items are added, and block when it is empty. When a worker thread finishes a piece of work, it can add that to another blocking queue which is being monitored either by the reader thread or a dedicated writer thread. Its job is to write the files out.

This pattern seeks to balance IO and CPU amongst a much smaller bunch of co-operating threads, where the number of IO threads is limited to what is physically capable by a hard drive, and a number of CPU worker threads that is sensible for the number of processors\cores you have. In essence it separates IO and CPU work so that things behave more predictably.

Further to this, if IO really is the problem (and not a huge amount of threads all fighting each other), then you can place some pauses (e.g. Thread.Sleep) in your file reading and writing threads to limit how much work they do.

Update

Perhaps it is worth explaining why there are so many threads being generated in the first place. This is a degenerative case for threadpool use, and is centred around queueing workitems that have a component of IO in them.

The threadpool executes work items from its queue and monitors how long executing work items are taking. If currently executing workitems are taking a long time to complete (I think half a second from memory) then it will start adding more threads to the pool as it believes this will get the queue processed quicker\more fairly. However, if the additional concurrent workitems are also performing work IO against a shared disk, then performance of the disk will actually reduce, meaning that workitems will take even longer to execute. Because workitems are taking longer to execute, the threadpool adds more threads. This is the degenerative case, where performance gets worse and worse as more threads are added.

The use of a semaphore as suggested would have to be done carefully, as the semaphore could cause blocking of threadpool threads, the threadpool would see workitems taking a long time to execute, and it will still start adding more threads.

chibacity
The ThreadPool performs complex calculates based on the contents of each file and using the ThreadPool sped this part of the process up LOTS :-)
schmoopy
@schmoopy I'm not sure I understand your comment. I'm familiar with the threadpool and file processing, hence I answered your question. In detail I might add...
chibacity