views:

389

answers:

8

Consider this problem: I have a program which should fetch (let's say) a 100 records from the database, and then for each one it needs to get updated information from a web service. There are two ways to introduce parallelism in this scenario:

  1. I start of each request to the web service on a new Thread. The number of simultaneous threads is controlled by some external parameter (or dynamically adjusted somehow).

  2. I create smaller batches (let's say of 10 records each) and launch each batch on a separate thread (so taking our example, 10 threads).

Which is a better approach, and why do you think so?

A: 

Dynamic/configurable, since the optimum number all depends on the environment and what the bottleneck actually is.

Stu
+3  A: 

This sounds like a job for a ThreadPool. Just queue up the jobs, and let .net handle the rest.

Patrick
A: 

This sounds like a job for a ThreadPool. Just queue up the jobs, and let .net handle the rest.

Well, I was thinking of the ThreadPool in terms of dynamically controlling. But I guess I am trying to figure out if there is any performance different between the two approaches (ThreadPool could be used in both, actually).

And if not performance, is there any best practice one should follow.

Vaibhav
+2  A: 

Two things to consider.

1. How long will it take to process a record?

If record processing is very quick, the overhead of handing off records to threads can become a bottleneck. In this case, you would want to bundle records so that you don't have to hand them off so often.

If record processing is reasonably long-running, the difference will be negligible, so the simpler approach (1 record per thread) is probably the best.

2. How many threads are you planning on starting?

If you aren't using a threadpool, I think you either need to manually limit the number of threads, or you need to break the data into big chunks. Starting a new thread for every record will leave your system thrashing if the number of records get large.

Derek Park
+1  A: 

Two things to consider.

Yes, those are useful considerations. Since this is calling a public domain webservice, so I guess we might want to run some tests to see if the overhead is more than the job itself (I doubt it).

And yes, the use of ThreadPool is something that we would have considered definitely.

Vaibhav
A: 

The computer running the program is probably not the bottleneck, so: Remember that the HTTP protocol has a keep-alive header, that lets you send several GET requests on the same sockets, which saves you from the TCP/IP hand shake. Unfortunately I don't know how to use that in the .net libraries. (Should be possible.)

There will probably also be a delay in answering your requests. You could try making sure that you allways have a given number of outstanding requests to the server.

Hugo
A: 

Get the Parallel Fx. Look at the BlockingCollection. Use a thread to feed it batches of records, and 1 to n threads pulling records off the collection to service. You can control the rate at which the collection is fed, and the number of threads that call to web services. Make it configurable via a ConfigSection, and make it generic by feeding the collection Action delegates, and you'll have a nice little batcher you can reuse to your heart's content.

Will
+5  A: 

Option 3 is the best:

Use Async IO.

Unless your request processing is complex and heavy, your program is going to spend 99% of it's time waiting for the HTTP requests.

This is exactly what Async IO is designed for - Let the windows networking stack (or .net framework or whatever) worry about all the waiting, and just use a single thread to dispatch and 'pick up' the results.

Unfortunately the .NET framework makes it a right pain in the ass. It's easier if you're just using raw sockets or the Win32 api. Here's a (tested!) example using C#3 anyway:

using System.Net; // need this somewhere

// need to declare an class so we can cast our state object back out
class RequestState
{
 public WebRequest Request { get; set; }
}

static void Main( string[] args )
{
 // stupid cast neccessary to create the request
 HttpWebRequest request = WebRequest.Create( "http://www.stackoverflow.com" ) as HttpWebRequest;

 request.BeginGetResponse(
  (asyncResult) => { /* callback to be invoked when finished */
   var state = (RequestState)asyncResult.AsyncState; // fetch the request object out of the AsyncState
   var webResponse = state.Request.EndGetResponse( asyncResult ) as HttpWebResponse;

   Debug.Assert( webResponse.StatusCode == HttpStatusCode.OK ); // there we go;

   Console.WriteLine( "Got Response from server:" + webResponse.Server );
  },
  new RequestState { Request = request } /* pass the request through to our callback */ );

 // blah
 Console.WriteLine( "Waiting for response. Press a key to quit" );
 Console.ReadKey();
}

EDIT:

In the case of .NET, the 'completion callback' actually gets fired in a ThreadPool thread, not in your main thread, so you will still need to lock any shared resources, but it still saves you all the trouble of managing threads.

Orion Edwards
Do you really need to pass the request using the state object or can you use the request as a closure bound variable?
zvikara