views:

130

answers:

3

I am writing a crawler in C# that starts with a set of known url's in a file. I want pull the pages down asynch. My question is what is the best pattern for this, i.e. Read file into List/Array of urls, Create an array to store completed urls? Should I create a 2 dimensional array to track status of threads and completion? Also some other considerations are retries (if the first request is slow or dead) or auto restarts (app/system crash).

+1  A: 
foreach (var url in File.ReadAllLines("urls.txt"))
{
    var client = new WebClient();
    client.DownloadStringCompleted += (sender, e) => 
    {
        if (e.Error == null)
        {
            // e.Result will contain the downloaded HTML
        }
        else
        {
            // some error occurred: analyze e.Error property
        }
    };
    client.DownloadStringAsync(new Uri(url));
}
Darin Dimitrov
Ouch. Firing all requests off in one go isn't going to scale very well. With a sufficiently large list, there will be several points of failure, including timeouts at this end, routers getting overloaded, port exhaustion etc etc. The requests will *need* scheduling to go beyond a toy crawler.
spender
Then just use a ThreadPool.
Jan Jongboom
A: 

Here is my opinion about storing the data

I would suggest you to use a relationnal database for storing the page list because it will make your task easier for :

  • retrieving the page to crawl (basically the N pages with the oldest SuccessFullCrawlDate)
  • adding the newly discovered pages
  • marking pages as being crawled (set the SuccessFullCrawlDate flags)
  • in case of a program crash, your data would already be safe
  • you could add columns to store the number of retries to automatically discard those that failed more than N times ...

An example of relational model would be :

//this would contain all the crawled pages
table Pages {
    Id bigint,
    Url nvarchar(2000)
    Created DateTime,
    LastSuccessfullCrawlDate DateTime,
    NumberOfRetry  int //increment this when a failure occures, if it reach 10 => set Ignored to True
    Title nvarchar(200)   //this is is where you would put the html
    Content nvarchar(max) //this is is where you would put the html
    Ignored Bool,         //set it to True to ignore this page
}

You could also handle Referer with a table wih this structure :

//this would contain all the crawled pages
table Referer {
    ParentId bigint,
    ChildId bigint
}

It could allow you to implement your very own Page Rank :p

Manitra Andriamitondra
storing metadata in the db probably makes the most sense. the actual html content should be stored in the filesystem. i'm using mysql.
traderde
+1  A: 

I recommend that you pull from a Queue and fetch each URL in a separate thread, peeling off from the Queue until you max out of the number of simultaenous threads that you want to allow. Each thread invokes a callback method that reports whether it finished successfully or encountered a problem.

As you start each thread, put its ManagedThreadId into a Dictionary, with the key being the id and the value being the thread status. The callback method should return its id and completion status. Delete each thread from the Dictionary as it completes and launch the next waiting thread. If it didn't finish successfully, then add it back to the queue.

The Dictionary's Count property tells you how many threads are in flight and the callback can also be used to update your UI or check for a pause or halt signal. If you need to persist your results in case of a system crash, then you should consider using database tables in lieu of memory-resident collections, such a manitra describes.

This approach has worked very well for me for lots of simultaneous threads.

ebpower
thanks, this is the implementataion I was pondering. Scheduling and thread/request limits will definitely be needed.
traderde
Glad to help. If this does it for you, then please mark as answered.
ebpower