I am writing a crawler in C# that starts with a set of known url's in a file. I want pull the pages down asynch. My question is what is the best pattern for this, i.e. Read file into List/Array of urls, Create an array to store completed urls? Should I create a 2 dimensional array to track status of threads and completion? Also some other considerations are retries (if the first request is slow or dead) or auto restarts (app/system crash).
foreach (var url in File.ReadAllLines("urls.txt"))
{
var client = new WebClient();
client.DownloadStringCompleted += (sender, e) =>
{
if (e.Error == null)
{
// e.Result will contain the downloaded HTML
}
else
{
// some error occurred: analyze e.Error property
}
};
client.DownloadStringAsync(new Uri(url));
}
Here is my opinion about storing the data
I would suggest you to use a relationnal database for storing the page list because it will make your task easier for :
- retrieving the page to crawl (basically the N pages with the oldest SuccessFullCrawlDate)
- adding the newly discovered pages
- marking pages as being crawled (set the SuccessFullCrawlDate flags)
- in case of a program crash, your data would already be safe
- you could add columns to store the number of retries to automatically discard those that failed more than N times ...
An example of relational model would be :
//this would contain all the crawled pages
table Pages {
Id bigint,
Url nvarchar(2000)
Created DateTime,
LastSuccessfullCrawlDate DateTime,
NumberOfRetry int //increment this when a failure occures, if it reach 10 => set Ignored to True
Title nvarchar(200) //this is is where you would put the html
Content nvarchar(max) //this is is where you would put the html
Ignored Bool, //set it to True to ignore this page
}
You could also handle Referer with a table wih this structure :
//this would contain all the crawled pages
table Referer {
ParentId bigint,
ChildId bigint
}
It could allow you to implement your very own Page Rank :p
I recommend that you pull from a Queue and fetch each URL in a separate thread, peeling off from the Queue until you max out of the number of simultaenous threads that you want to allow. Each thread invokes a callback method that reports whether it finished successfully or encountered a problem.
As you start each thread, put its ManagedThreadId into a Dictionary, with the key being the id and the value being the thread status. The callback method should return its id and completion status. Delete each thread from the Dictionary as it completes and launch the next waiting thread. If it didn't finish successfully, then add it back to the queue.
The Dictionary's Count property tells you how many threads are in flight and the callback can also be used to update your UI or check for a pause or halt signal. If you need to persist your results in case of a system crash, then you should consider using database tables in lieu of memory-resident collections, such a manitra describes.
This approach has worked very well for me for lots of simultaneous threads.