views:

294

answers:

3

I'm currently writing a sitemap generator that scrapes a site for urls and builds an xml sitemap. As most of the waiting is spent on requests to uri's I'm using threading, specifically the build in ThreadPool object.

In order to let the main thread wait for the unknown amount of threads to complete I have implemented the following setup. I don't feel this is a good solution though, can any threading gurus advise me of any problems this solution has, or suggest a better way to implement it?

The EventWaitHandle is set to EventResetMode.ManualReset

Here is the thread method

protected void CrawlUri(object o)
    {

        try
        {
            Interlocked.Increment(ref _threadCount);
            Uri uri = (Uri)o;

            foreach (Match match in _regex.Matches(GetWebResponse(uri)))
            {
                Uri newUri = new Uri(uri, match.Value);

                if (!_uriCollection.Contains(newUri))
                {
                    _uriCollection.Add(newUri);
                    ThreadPool.QueueUserWorkItem(_waitCallback, newUri);
                }
            }
        }
        catch
        {
            // Handle exceptions
        }
        finally
        {
            Interlocked.Decrement(ref _threadCount);
        }

        // If there are no more threads running then signal the waithandle
        if (_threadCount == 0)
            _eventWaitHandle.Set();
    }

Here is the main thread method

// Request first page (based on  host)
Uri root = new Uri(context.Request.Url.GetLeftPart(UriPartial.Authority));

// Begin threaded crawling of the Uri
ThreadPool.QueueUserWorkItem(_waitCallback, root);
Thread.Sleep(5000); // TEMP SOLUTION: Sleep for 5 seconds
_eventWaitHandle.WaitOne();

 // Server the Xml Sitemap
 context.Response.ContentType = "text/xml";
 context.Response.Write(GetXml().OuterXml);

Any ideas are much appreciated :)

A: 

When doing this kind of logic I generally try to make an object representing each asynchronous task and the data it needs to run. I would typically add this object to the collection of tasks to be done. The thread pool gets these tasks secheduled, and I would let the object itself remove itself from the "to be done" collection when the task finishes, possibly signalling on the collection itself.

So you're finished when the "to be done" collection is empty; the main thread is probably awoken once by each task that finishes.

krosenvold
+1  A: 

Well, first off you can create a ManualResetEvent that starts unset, so you don't have to sleep before waiting on it. Secondly you're going to need to put thread synchronization around your Uri collection. You could get a race condition where one two threads pass the "this Uri does not exist yet" check and they add duplicates. Another race condition is that two threads could pass the if (_threadCount == 0) check and they could both set the event.

Last, you can make the whole thing much more efficient by using the asynchronous BeginGetRequest. Your solution right now keeps a thread around to wait for every request. If you use async methods and callbacks, your program will use less memory (1MB per thread) and won't need to do context switches of threads nearly as much.

Here's an example that should illustrate what I'm talking about. Out of curiosity, I did test it out (with a depth limit) and it does work.

public class CrawlUriTool
{
    private Regex regex;
    private int pendingRequests;
    private List<Uri> uriCollection;
    private object uriCollectionSync = new object();
    private ManualResetEvent crawlCompletedEvent;

    public List<Uri> CrawlUri(Uri uri)
    {
        this.pendingRequests = 0;
        this.uriCollection = new List<Uri>();
        this.crawlCompletedEvent = new ManualResetEvent(false);
        this.StartUriCrawl(uri);
        this.crawlCompletedEvent.WaitOne();

        return this.uriCollection;
    }

    private void StartUriCrawl(Uri uri)
    {
        Interlocked.Increment(ref this.pendingRequests);

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);

        request.BeginGetResponse(this.UriCrawlCallback, request);
    }

    private void UriCrawlCallback(IAsyncResult asyncResult)
    {
        HttpWebRequest request = asyncResult.AsyncState as HttpWebRequest;

        try
        {
            HttpWebResponse response = (HttpWebResponse)request.EndGetResponse(asyncResult);

            string responseText = this.GetTextFromResponse(response); // not included

            foreach (Match match in this.regex.Matches(responseText))
            {
                Uri newUri = new Uri(response.ResponseUri, match.Value);

                lock (this.uriCollectionSync)
                {
                    if (!this.uriCollection.Contains(newUri))
                    {
                        this.uriCollection.Add(newUri);
                        this.StartUriCrawl(newUri);
                    }
                }
            }
        }
        catch (WebException exception)
        {
            // handle exception
        }
        finally
        {
            if (Interlocked.Decrement(ref this.pendingRequests) == 0)
            {
                this.crawlCompletedEvent.Set();
            }
        }
    }
}
RandomEngy
This looks very interesting, thank you. I'll hopefully get the chance to try it out over the weekend. I can't believe I forgot about asynchronous request's :o
WDuffy
A: 

You could look into the CTP of the Task Parallel Library which should make this simpler for you. What you're doing can be divided into "tasks", chunks or units of work, and the TPL can parallelize this for you if you supply the tasks. It uses a thread pool internally as well, but it's easier to use and comes with a lot of options like waiting for all tasks to finish. Check out this Channel9 video where the possibilities are explained and where a demo is shown of traversing a tree recursively in parallel, which seems very applicable to your problem.

However, it's still a preview and won't be released until .NET 4.0, so it comes with no warranties and you'll have to manually include the supplied System.Threading.dll (found in the install folder) into your project and I don't know if that's an option to you.

JulianR
That does look really interesting Julian. I won't be able to use it in this implementation but I'll defo be looking into it. Thanks :)
WDuffy