views:

514

answers:

4

I'm using parallel linq, and I'm trying to download many urls concurrently using essentily code like this:

int threads = 10;
Dictionary<string, string> results = urls.AsParallel( threads ).ToDictionary( url => url, url => GetPage( url );

Since downloading web pages is Network bound rather than CPU bound, using more threads than my number of processors/cores is very benificial, since most of the time in each thread is spent waiting for the network to catch up. However, judging form the fact that running the above with threads = 2 has the same performance as threads = 10 on my dual core machine, I'm thinking that the treads sent to AsParallel is limited to the number of cores.

Is there any way to override this behavior? Is there a similar library available that doesn't have this limitation?

(I've found such a library for python, but need something that works in .Net)

+5  A: 

Do the URLs refer to the same server? If so, it could be that you are hitting the HTTP connection limit instead of the threading limit. There's an easy way to tell - change your code to:

int threads = 10;
Dictionary<string, string> results = urls.AsParallel(threads)
    .ToDictionary(url => url, 
                  url => {
                      Console.WriteLine("On thread {0}",
                                        Thread.CurrentThread.ManagedThreadId);
                      return GetPage(url);
                  });

EDIT: Hmm. I can't get ToDictionary() to parallelise at all with a bit of sample code. It works fine for Select(url => GetPage(url)) but not ToDictionary. Will search around a bit.

EDIT: Okay, I still can't get ToDictionary to parallelise, but you can work around that. Here's a short but complete program:

using System;
using System.Collections.Generic;
using System.Threading;
using System.Linq;
using System.Linq.Parallel;

public class Test
{

    static void Main()
    {
        var urls = Enumerable.Range(0, 100).Select(i => i.ToString());

        int threads = 10;
        Dictionary<string, string> results = urls.AsParallel(threads)
            .Select(url => new { Url=url, Page=GetPage(url) })
            .ToDictionary(x => x.Url, x => x.Page);
    }

    static string GetPage(string x)
    {
        Console.WriteLine("On thread {0} getting {1}",
                          Thread.CurrentThread.ManagedThreadId, x);
        Thread.Sleep(2000);
        return x;
    }
}

So, how many threads does this use? 5. Why? Goodness knows. I've got 2 processors, so that's not it - and we've specified 10 threads, so that's not it. It still uses 5 even if I change GetPage to hammer the CPU.

If you only need to use this for one particular task - and you don't mind slightly smelly code - you might be best off implementing it yourself, to be honest.

Jon Skeet
I'm getting the same symptom. I ran your analysis and got only 1 thread.. i guess the performance increase from 1 to 2 threads I saw was in my head
Tristan Havelick
@DrFredEdison: So what happens if you use the Select/ToDictionary form as in the sample instead?
Jon Skeet
I'm seeing pretty much the same result as you. I get about 5 threads used for each test run now.. Thanks for getting me this far.. I think it will get work for what I need at the moment.
Tristan Havelick
A: 

Monitor your network traffic. If the URLs are from the same domain it may be limiting the bandwidth. More connections might not actually provide any speed-up.

Ben S
+2  A: 

By default, .Net has limit of 2 concurrent connections to an end service point (IP:port). Thats why you would not see a difference if all urls are to one and the same server.

It can be controlled using ServicePointManager.DefaultPersistentConnectionLimit property.

Sunny
A: 

Hi, I think there are already good answers to the question, but I'd like to make one important point. Using PLINQ for tasks that are not CPU bound is in principle wrong design. Not to say that it won't work - it will, but using multiple threads when it is unnecessary can cause troubles.

Unfortunatelly, there is no good way to solve this problem in C#. In F# you could use asynchornous workflows that run in parallel, but don't block the thread when performing asynchronous calls (under the cover, it uses BeginOperation and EndOperation methods). You can find more information here:

The same idea can to some extent be used in C#, but it looks a bit weird (but it is more efficient). I wrote an article about that and there is also a library that should be slightly more evolved than my original idea:

Tomas Petricek