Hi, I am writing a program to crawl the websites. The crawl function is a recursive one and may consume more time to complete, So I used Multi Threading to perform the crawl for multiple websites. What exactly I need is, after completion crawling one website it call next one (which should be in Queqe) instead multiple websites crawling at a time. I am using C# and ASP.NET.
Put all your url's in a queue, and pop one off the queue each time you are done with the previous one.
You could also put the recursive links in the queue, to better control how many downloads you are executing at a time.
You could set up X number of worker threads which all get a url off the queue in order to process more at a time. But this way you can throttle it yourself.
You can use ConcurrentQueue<T>
in .Net to get a thread safe queue to work with.
I don't usually think positive thoughts when it comes to web crawlers...
You want to use a threadpool.
ThreadPool.QueueUserWorkItem(new WaitCallback(CrawlSite), (object)s);
You simply 'push' you workload into the queue, and let the threadpool manage it.
I have to say - I'm not a Threading expert and my C# is quite rusty - but considering the requirements I would suggest something like this:
- Define a Queue for the websites.
- Define a Pool with
Crawler
threads. - The main process iterates over the website queue and retrieves the site address.
- Retrieve an available thread from the pool - assign it the website address and allow it to start running. Set an indicator in the thread object that it should wait for all subsequent threads to finish (so you will not continue to the next site).
- Once all the threads have ended - the main thread (started in step #4) will end and return to the main loop of the main process to continue to the next website.
The Crawler
behavior should be something like this:
- Investigate the content of the current address
- Retrieve the hierarchy below the current level
- For each child of the current node of the site tree - pull a new
crawler
thread from the pool and start it running in the background with the address of the child node - If the pool is empty, wait until a thread becomes available.
- If the thread is marked to wait - wait for all the other threads to finish
I think there are some challenges here - but as a general flow I believe it can do do job.
The standard practice for doing this is to use a blocking queue. If you are using .NET 4.0 then you can take advantage of the BlockingCollection class otherwise you can use Stephen Toub's implementation.
What you will do is spin up as many worker threads as you feel necessary and have them go around in an infinite loop dequeueing items as they appear in the queue. Your main thread will be enqueueing the item. A blocking queue is designed to wait/block on the dequeue operation until an item becomes available.
public class Program
{
private static BlockingQueue<string> m_Queue = new BlockingQueue<string>();
public static void Main()
{
var thread1 = new Thread(Process);
var thread2 = new Thread(Process);
thread1.Start();
thread2.Start();
while (true)
{
string url = GetNextUrl();
m_Queue.Enqueue(url);
}
}
public static void Process()
{
while (true)
{
string url = m_Queue.Dequeue();
// Do whatever with the url here.
}
}
}