ansaurus

Question

Asynch Design C#

Answer 1

+1 A:

foreach (var url in File.ReadAllLines("urls.txt"))
{
    var client = new WebClient();
    client.DownloadStringCompleted += (sender, e) => 
    {
        if (e.Error == null)
        {
            // e.Result will contain the downloaded HTML
        }
        else
        {
            // some error occurred: analyze e.Error property
        }
    };
    client.DownloadStringAsync(new Uri(url));
}

Darin Dimitrov 2010-01-05 14:41:18

Ouch. Firing all requests off in one go isn't going to scale very well. With a sufficiently large list, there will be several points of failure, including timeouts at this end, routers getting overloaded, port exhaustion etc etc. The requests will *need* scheduling to go beyond a toy crawler.

spender 2010-01-05 14:59:40

Then just use a ThreadPool.

Jan Jongboom 2010-01-05 19:00:52

Answer 2

A:

Here is my opinion about storing the data

I would suggest you to use a relationnal database for storing the page list because it will make your task easier for :

retrieving the page to crawl (basically the N pages with the oldest SuccessFullCrawlDate)
adding the newly discovered pages
marking pages as being crawled (set the SuccessFullCrawlDate flags)
in case of a program crash, your data would already be safe
you could add columns to store the number of retries to automatically discard those that failed more than N times ...

An example of relational model would be :

//this would contain all the crawled pages
table Pages {
    Id bigint,
    Url nvarchar(2000)
    Created DateTime,
    LastSuccessfullCrawlDate DateTime,
    NumberOfRetry  int //increment this when a failure occures, if it reach 10 => set Ignored to True
    Title nvarchar(200)   //this is is where you would put the html
    Content nvarchar(max) //this is is where you would put the html
    Ignored Bool,         //set it to True to ignore this page
}

You could also handle Referer with a table wih this structure :

//this would contain all the crawled pages
table Referer {
    ParentId bigint,
    ChildId bigint
}

It could allow you to implement your very own Page Rank :p

Manitra Andriamitondra 2010-01-05 14:53:05

storing metadata in the db probably makes the most sense. the actual html content should be stored in the filesystem. i'm using mysql.

traderde 2010-01-08 12:34:51

Answer 3

+1 A:

I recommend that you pull from a Queue and fetch each URL in a separate thread, peeling off from the Queue until you max out of the number of simultaenous threads that you want to allow. Each thread invokes a callback method that reports whether it finished successfully or encountered a problem.

As you start each thread, put its ManagedThreadId into a Dictionary, with the key being the id and the value being the thread status. The callback method should return its id and completion status. Delete each thread from the Dictionary as it completes and launch the next waiting thread. If it didn't finish successfully, then add it back to the queue.

The Dictionary's Count property tells you how many threads are in flight and the callback can also be used to update your UI or check for a pause or halt signal. If you need to persist your results in case of a system crash, then you should consider using database tables in lieu of memory-resident collections, such a manitra describes.

This approach has worked very well for me for lots of simultaneous threads.

ebpower 2010-01-05 18:51:18

thanks, this is the implementataion I was pondering. Scheduling and thread/request limits will definitely be needed.

traderde 2010-01-08 12:33:04

Glad to help. If this does it for you, then please mark as answered.

ebpower 2010-01-13 23:14:23

ansaurus

tags:

views:

answers:

Asynch Design C#

related questions