views:

271

answers:

9
+4  Q: 

Multiple Threads

I post a lot here regarding multithreading, and the great stackoverflow community have helped me alot in understand multithreading.

All the examples I have seen online only deal with one thread.

My application is a scraper for an insurance company (family company ... all free of charge). Anyway, the user is able to select how many threads they want to run. So lets say for example the user wants the application to scrape 5 sites at one time, and then later in the day he choses 20 threads because his computer isn't doing anything else so it has the resources to spare.

Basically the application builds a list of say 1000 sites to scrape. A thread goes off and does that and updates the UI and builds the list.

When thats finished another thread is called to start the scraping. Depending on the number of threads the user has set to use it will create x number of threads.

Whats the best way to create these threads? Should I create 1000 threads in a list. And loop through them? If the user has set 5 threads to run, it will loop through 5 at a time.

I understand threading, but it's the application logic which is catching me out.

Any ideas or resources on the web that can help me out?

+3  A: 

You could consider using a thread pool for that:

using System;
using System.Threading;

public class Example
{
    public static void Main()
    {
        ThreadPool.SetMaxThreads(100, 10);

        // Queue the task.
        ThreadPool.QueueUserWorkItem(new WaitCallback(ThreadProc));

        Console.WriteLine("Main thread does some work, then sleeps.");

        Thread.Sleep(1000);

        Console.WriteLine("Main thread exits.");
    }

    // This thread procedure performs the task.
    static void ThreadProc(Object stateInfo)
    {
        Console.WriteLine("Hello from the thread pool.");
    }
}
John Gietzen
A: 

You might want to take a look at the ProcessQueue article on CodeProject.

Essentially, you'll want to create (and start) the number of threads that are appropriate, in your case that number comes from the user. Each of these threads should process a site, then find the next site needed to process. Even if you don't use the object itself (though it sounds like it would suit your purposes pretty well, though I'm obviously biased!) it should give you some good insight into how this sort of thing would be done.

Adam Robinson
A: 

The basic logic is:

You have a single queue in which you put the URLs to scrape then you create your threads and use a queue object to which every thread has access. Let the threads start a loop:

  1. lock the queue
  2. check if there are items in the queue, if not, unlock queue and end thread
  3. dequeue first item in the queue
  4. unlock queue
  5. process item
  6. invoke an event that updates the UI (Remember to lock the UI Controller)
  7. return to step 1

Just let the Threads do the "get stuff from the queue" part (pulling the jobs) instead of giving them the urls (pushing the jobs), that way you just say

YourThreadManager.StartThreads(numberOfThreadsTheUserWants);

and everything else happens automagically. See the other replies to find out how to create and manage the threads .

dbemerlin
+1  A: 

I think this example is basically what you need.

public class WebScraper
{
    private readonly int totalThreads;
    private readonly List<System.Threading.Thread> threads;
    private readonly List<Exception> exceptions;
    private readonly object locker = new object();
    private volatile bool stop;

    public WebScraper(int totalThreads)
    {
        this.totalThreads = totalThreads;
        threads = new List<System.Threading.Thread>(totalThreads);
        exceptions = new List<Exception>();

        for (int i = 0; i < totalThreads; i++)
        {
            var thread = new System.Threading.Thread(Execute);
            thread.IsBackground = true; 
            threads.Add(thread);
        }
    }

    public void Start()
    {
        foreach (var thread in threads)
        {
            thread.Start();
        }
    }

    public void Stop()
    {
        stop = true;
        foreach (var thread in threads)
        {
            if (thread.IsAlive)
            {
                thread.Join();                      
            }
        }
    }

    private void Execute()
    {
        try
        {
            while (!stop)
            {
                // Scrap away!                      
            }
        }
        catch (Exception ex)
        {
            lock (locker)
            {
                // You could have a thread checking this collection and
                // reporting it as you see fit.
                exceptions.Add(ex);
            }
        }
    }
}
ChaosPandion
What has thread priority to do with whether the thread keeps the application alive? You need to set if the thread is a background thread or not (with Thread.IsBackGround property).
Oliver Hanappi
Yeah, I'm +0 here because of that.
John Gietzen
@Oliver, @John: Thanks guys, this is a case of bad information not being verified. :(
ChaosPandion
+2  A: 

This scraper, does it use a lot of CPU when its running?

If it does a lot of communication with these 1000 remote sites, downloading their pages, that may be taking more time than the actual analysis of the pages.

And how many CPU cores does your user have? If they have 2 (which is common these days) then beyond two simultaneous threads performing analysis, they aren't going to see any speed up.

So you probably need to "parallelize" the downloading of the pages. I doubt you need to do the same for the analysis of the pages.

Take a look into asynchronous IO, instead of explicit multi-threading. It lets you launch a bunch of downloads in parallel and then get called back when each one completes.

Daniel Earwicker
There will be huge speedups using more threads than cores if network IO is what takes time though.
leeeroy
@leeeroy - why?
Daniel Earwicker
Network IO is slow so threads spend a lot of time waiting for a response to the request, this time can be used by other threads to send other request or process the results. For IO heavy applications (especially networking) multithreading is the biggest speedup someone can gain as it can take up to a second to get a webpage, a second that could have been used by another thread to process 10 responses of fast pages.
dbemerlin
@dbemerlin - check out the last paragraph of my answer.
Daniel Earwicker
A: 

If you really just want the application, use something someone else already spent time developing and perfecting:

http://arachnode.net/

arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Whether interested or involved in screen scraping, data mining, text mining, research or any other application where a high-performance crawling application is key to the success of your endeavors, arachnode.net provides the solution you need for success.

If you also want to write one yourself because it's a fun thing to write (I wrote one not long ago, and yes, it is alot of fun ) then you can refer to this pdf provided by arachnode.net which really explains in detail the theory behind a good web crawler:

http://arachnode.net/media/Default.aspx?Sort=Downloads&amp;PageIndex=1

Download the pdf entitled: "Crawling the Web" (second link from top). Scroll to Section 2.6 entitled: "2.6 Multi-threaded Crawlers". That's what I used to build my crawler, and I must say, I think it works quite well.

BFree
A: 

I solved a similar problem by creating a worker class that uses a callback to signal the main app that a worker is done. Then I create a queue of 1000 threads and then call a method that launches threads until the running thread limit is reached, keeping track of the active threads with a dictionary keyed by the thread's ManagedThreadId. As each thread completes, the callback removes its thread from the dictionary and calls the thread launcher.

If a connection is dropped or times out, the callback reinserts the thread back into the queue. Lock around the queue and the dictionary. I create threads vs using the thread pool because the overhead of creating a thread is insignificant compared to the connection time, and it allows me to have a lot more threads in flight. The callback also provides a convenient place with which to update the user interface, even allowing you to change the thread limit while it's running. I've had over 50 open connections at one time. Remember to increase your MacConnections property in your app.config (default is two).

ebpower
A: 

Consider using the event-based asynchronous pattern (AsyncOperation and AsyncOperationManager Classes)

Islam Ibrahim
A: 

I would use a queue and a condition variable and mutex, and start just the requested number of threads, for example, 5 or 20 (and not start 1,000).

Each thread blocks on the condition variable. When woken up, it dequeues the first item, unlocks the queue, works with the item, locks the queue and checks for more items. If the queue is empty, sleep on the condition variable. If not, unlock, work, repeat.

While the mutex is locked, it can also check if the user has requested the count of threads to be reduced. Just check if count > max_count, and if so, the thread terminates itself.

Any time you have more sites to queue, just lock the mutex and add them to the queue, then broadcast on the condition variable. Any threads that are not already working will wake up and take new work.

Any time the user increases the requested thread count, just start them up and they will lock the queue, check for work, and either sleep on the condition variable or get going.

Each thread will be continually pulling more work from the queue, or sleeping. You don't need more than 5 or 20.

Jim Flood