views:

768

answers:

11
  1. I have a method that uses a connection (e.g. a method that downloads a page).
  2. I have to execute this method multiple times (e.g. download 1000 pages).
  3. Doing it the synchronous and sequential way takes a long time.
  4. I have limited resources ( 8 max threads and/or 50 max simultaneous connections )
  5. I want to use all resources to accelerate it.
  6. I know that parallelization (PLINQ, Parallel Extensions, etc.) could solve the problem, but I have already tried it, and this approach fails because of the scarce resources.
  7. I don't want to reinvent the wheel that parallelizes this kind of task while managing the resources, someone must have done it before and must have provided a library/tutorial for this.

Can anyone help?

Update Things get much more complicated when you start to mix Asynchronous calls with Parallelization for maximum performance. This is implemented on several downloaders, like Firefox downloader, it gets 2 downloads simultaneously and when one of them is complete it gets the next file and so on. Maybe it seems very simple to implement, but when I implemented it I had, and still have, trouble to make it generic (useful for WebRequest and DbCommand) and to deal with problems (ie. timeouts)

Bounty Hunters The bounty will be granted to the first one that links a reliable and free($$) .NET library that provides a simple C# way to parallelize async tasks as HttpWebRequests.BegingetResponse and SqlCommand.BeginExecuteNonQuery. The parallelization must not wait for N tasks to complete to then start the next N, but it must start a new task as soon as one of the N initial ones finishes. The method must was provide timeout handling.

+3  A: 

You could use the .NET System.Threading.ThreadPool class. You can set the maximum number of threads to be active at any one time using ThreadPool.SetMaxThreads().

Kev
And what about connections? How do I manage them?
Jader Dias
I am thinking about that one :)
Kev
tie connections to the threads, so each thread manages n connections, up to your max. i.e. 1 thread could do 50, or 2 threads could do 25 each.
Darryl Braaten
Q : "Is it max 50 http connections per thread or across all threads?" A: Is across all threads.
Jader Dias
How do you plan on taking advantage of multiple connections in a single thread. Don't asynchronous HTTP requests use a new thread for the callback?
mbeckish
Eek! Don't throttle the master thread-pool... you'll cripple anything else running in the runtime. Very very bad idea. I've seen apps deadlock because of this.
Marc Gravell
Yep the more I thought about this more I thought this was wrong.
Kev
+4  A: 

Look into a counting semaphore for the connections. http://en.wikipedia.org/wiki/Semaphore_(programming)

EDIT: To answer your comment the .NET Framework has one already. http://msdn.microsoft.com/en-us/library/system.threading.semaphore.aspx

Jeremy Wilde
Ok, I have written a couple of classes that does it. But I wonder if it is already implemented in the .NET Framework or someones library.
Jader Dias
+2  A: 

I would strongly recommend staying away from the threadpool except for very short tasks. If you choose to use a semaphore make sure that you only block in the code that is queuing the work items, not at the start of the workitem code or you will quickly dead lock the thread pool if your (semaphore max count * 2) is greater than max pool threads.

In practice you really can never safely acquire a lock on a pool thread, nor can you safely make calls to most async APIs(or sync APIs like HttpWebRequest.GetResponse as it also performs async ops under its covers on the thread pool).

Matt Davison
+2  A: 
  1. Create a data structure to keep track of what pages have been fetched, and what still needs to be fetched. e.g. a queue

  2. Using the Producer/Consumer Queue pattern, dispatch 8 consumer threads to do your fetches. That way, you know that you will never exceed your 8 thread limit.

See here for a good example.

mbeckish
I response to your question: I plan to use multiple connections per threading using multiple short-lived threads to start multiple long-duration async requests. I say short-lived from my application point of view, besides if the underlying library uses an active thread to wait for the async callback
Jader Dias
I already implemented a library that could do that, but its not robust as I need.
Jader Dias
"I plan to use multiple connections per threading using multiple short-lived threads"So then the "8 max threads" isn't a strict requirement?
mbeckish
"but its not robust as I need"Please elaborate. From your description above, all you are looking to do is 1000 HTTP fetches in parallel without exceeding 8 threads and 50 connections.
mbeckish
I don't understand. You require a max of eight threads, but you are launching async requests. This will spin up many more threads in the ThreadPool. You are going to use 8 threads to call an API that already effectively creates threads to do the work? You'll only need one thread to do this.
spender
@mbeckish: "So then the 8 max threads isn't a strict requirement?" It isn't strict as long as the threads are not CPU bound.
Jader Dias
@mbeckish: "not robust" means: "my code has lot of bugs, I can't fix them and I am searching for someone's else code".
Jader Dias
@mbeckish: "all you are looking to do is 1000 HTTP fetches in parallel without exceeding 8 threads and 50 connections." Yep, that'll do if it is flexible(generic) enough.
Jader Dias
+2  A: 

Jeffrey Richter has a Power Threading Library that might help you. Its chock full of samples and is pretty powerful. I couldn't find a quick sample with connections but there are plenty of examples that might work for you in regards to coordinating multiple asynchronous operations.

It can be downloaded from here and there are several articles and samples here. Also, this link has a detailed article from Jeffrey explaining concurrent asynchronous operations.

Sailing Judo
I strongly agree with this. AsyncEnumerator is a fantastic thread management tool. And while it doesn't manage connections, it makes it rediculously easy to just create connection management objects.
Mystere Man
+5  A: 

Can you give more information why Parallel Linq won't work?

My point of view, your task is best suit with PLinq. If you run on 8 cores machine, PLinq will split to 8 tasks, and queue all remaining tasks for you.

Here is draft code,

PagesToDownload.AsParallel().ForAll(DownloadMethodWithLimitConnections);

I don't understand why PLinq consume up your resources. Based on my test, PLinq performance is even better than using ThreadPool.

chaowman
perhaps I havent used it the right way, or have lied about using it, I'll recheck my tests to see what I did wrong
Jader Dias
Are you saying that on a single core machine, the above statement won't be run concurrently? If the Download method is IO Bound rather than CPU Bound, it would make sense to use more threads than CPUs...
Rob Fonseca-Ensor
@Fonseca-Ensor: Parallel Linq is CPU-bound. This question, we need to create tasks for each core first. Then you can use Future<T> which support IO-bound to create connections. For more info, please read http://blogs.msdn.com/pfxteam/archive/2008/03/16/8272833.aspx
chaowman
+3  A: 

See the CCR. This is the 'right' way to do it although you may find the libraries learning curve a bit to much...

Matt Davison
wow! I'm impressed. This article http://msdn.microsoft.com/en-us/magazine/cc163556.aspx helps more than the MSDN documentation.
Jader Dias
I'm currently trying to get my head around the CCR. This task is being made all the harder because i'm also learning C# at the same time!
Harry
I feel for you... It takes a different mindset to get your head wrapped around it.
Matt Davison
Getting there. C# hasn't been too bad (From a VB.net perspective)CCR Is DEFINETLY The way to manage mulitple concurrent operations
Harry
here is an example in CCR which is very simular to your problemhttp://social.msdn.microsoft.com/Forums/en-US/roboticsccr/thread/d75b73b6-7b99-4ffb-a2e0-7aa29f26f1e8/
Harry
+1  A: 

The async WebRequest methods can appear slugggish because they block while performing DNS lookup, then switch to asynchronous behaviour. Having followed this path myself, it seems inefficient to spin up eight threads to feed requests into an API that already spins up threads to do the bulk of the work. You might reconsider some of your approaches bearing this shortcoming of the async WebRequest API. Our solution eventually involved using the synchronous API, each one on its own thread. I'd be interested in anyone commenting on the correctness of this approach.

spender
+1 for the DNS lookup blocking the thread
Jader Dias
your approach is correct but I don't know if it will consume more CPU resources than the async way.
Jader Dias
"You'll only need one thread to do this." Good point!
Jader Dias
+2  A: 

Here's what I don't get: you say max 50 connections, but only 8 threads. Each connection by definition "occupies" / runs in a thread. I mean, you're not using DMA or any sort of other magic to take the load off the CPU, so each transfer needs an execution context. If you can launch 50 async requests at once, fine, great, do that -- you should be able to launch them all from the same thread, since calling an async read function takes essentially no time at all. If you e.g. have 8 cores and want to make sure an entire core is dedicated to each transfer (that would probably be dumb, but it's your code, so...), you can only run 8 transfers at once.

My suggestion is to just launch 50 async requests, inside a sync block so that they all start before you allow any of them to complete (simplifies the math). Then, use a count semaphore as suggested by Jeremy or a synchronized Queue as suggested by mbeckish to keep track of the work remaining. At the end of your async-complete callback, launch the next connection (if appropriate). That is, start 50 connections, then when one finishes, use the "completed" event handler to launch the next one, until all the work is done. This shouldn't need any kind of additional libraries or frameworks.

Coderer
It sounded easy to do the first time I tried it. And I have tried for at least two months. My code has undebugable bugs, and that's why I'm calling for help. Maybe some library has a robuster way to handle that.
Jader Dias
Could you describe the bugs you're getting, or at least elaborate what you've tried and failed? I haven't tried the *scale* that you're trying, but I've done bits and pieces of the same thing, which is where my suggestion came from. You might get better help with more information.
Coderer
I have abandoned the code a month ago, but soon I'll work on it again and then I'll explain better my bugs.
Jader Dias
+1  A: 

This is how you'd do it with the base class library in .net 3.5: The call to SetMinThreads is optional - see what happens with & without it.

You should handle timeouts within your replacement to DoSomethingThatsSlow

public class ThrottledParallelRunnerTest
{
    public static void Main()
    {
        //since the process is just starting up, we need to boost this
        ThreadPool.SetMinThreads(10, 10);

        IEnumerable<string> args = from i in Enumerable.Range(1, 100)
                                   select "task #" + i;
        ThrottledParallelRun(DoSomethingThatsSlow, args, 8);
    }

    public static void DoSomethingThatsSlow(string urlOrWhatever)
    {
        Console.Out.WriteLine("{1}: began {0}", urlOrWhatever, DateTime.Now.Ticks);
        Thread.Sleep(500);
        Console.Out.WriteLine("{1}: ended {0}", urlOrWhatever, DateTime.Now.Ticks);
    }

    private static void ThrottledParallelRun<T>(Action<T> action, IEnumerable<T> args, int maxThreads)
    {
        //this thing looks after the throttling
        Semaphore semaphore = new Semaphore(maxThreads, maxThreads);

        //wrap the action in a try/finally that releases the semaphore
        Action<T> releasingAction = a =>
                                        {
                                            try
                                            {
                                                action(a);
                                            }
                                            finally
                                            {
                                                semaphore.Release();
                                            }
                                        };

        //store all the IAsyncResult - will help prevent method from returning before completion
        List<IAsyncResult> results = new List<IAsyncResult>();
        foreach (T a in args)
        {
            semaphore.WaitOne();
            results.Add(releasingAction.BeginInvoke(a, null, null));
        }

        //now let's make sure everything's returned. Maybe collate exceptions here?
        foreach (IAsyncResult result in results)
        {
            releasingAction.EndInvoke(result);
        }
    }
}
Rob Fonseca-Ensor
Don't throttle the master thread-pool... you'll cripple anything else running in the runtime. Very very bad idea. I've seen apps deadlock because of this.
Marc Gravell
Hi MarcThat's why I left the WaitOne call outside of the releasingAction - this method will use at most "maxThreads" threads (plus the thread that calls the method). Put some more Console messages in there and run it if you don't believe me
Rob Fonseca-Ensor
+1  A: 

You should take a look at F# asynchronous workflows.

You really don't want your code to be parallel but asynchronous

asynchronous refers to programs that perform some long running operations that don't necessary block a calling thread, for example accessing the network, calling web services or performing any other I/O operation in general

This is a very interesting article on this concept explained using C# iterators.

This is a great book about F# and asynchronous programming.

The learning curve is very bad (a lot of odd stuff: F# syntax, the Async<'a> type , monads, etc.) but is a VERY powerful approach and can be used in real life with great C# interop.

The main idea here is continuation: while your're wating for some I/O operations let your threads do something else!

Luca Martinetti