views:

350

answers:

4

Hi all

I am working on a web crawler. I am using the Webbrowser control for this purpose. I have got the list of urls stored in database and I want to traverse all those URLs one by one and parse the HTML.

I used the following logic

            foreach (string href in hrefs)
            {
                webBrowser1.Url = new Uri(href);
                webBrowser1.Navigate(href);
            }

I want to do some work in the "webBrowser1_DocumentCompleted" event once the page is loaded completely. But the "webBrowser1_DocumentCompleted" does not get the control as I am using the loop here. It only get the control when the last url in "hrefs" is navigated and the control exits the loop.

Whats the best way to handle such problem?

+2  A: 

Store the list somewhere in your state, as well as the index of where you've got to. Then in the DocumentCompleted event, parse the HTML and then navigate to the next page.

(Personally I wouldn't use the WebBrowser control for web crawling... I know it means it'll handle the JavaScript for you, but it'll be a lot harder to parallelize nicely than using multiple WebRequest or WebClient objects.)

Jon Skeet
A: 

While I agree with Jon Skeet with regards to using WebClient instead of a full blown WebBrowser I have a solution which should do what you need. I ran into a few problems while writing the code. 1. The DocumentCompleted is fired multiple times with one page (I believe it is do to frames etc.) so I had to do a check. 2. Because I use ManualResetEventSlim (you can use ManualResetEvent if < .Net 4.0) another thread was nessasary.

I'm pretty sure this code can be improved but this is at least a start.

var wait = new ManualResetEventSlim();

webBrowser1.DocumentCompleted += (o, e0) =>
{
    if (webBrowser1.ReadyState == WebBrowserReadyState.Complete)
        wait.Set();
};

var t = new Thread(() =>
    {
        foreach (string href in hrefs)
        {
            webBrowser1.Url = new Uri(href);
            webBrowser1.Navigate(href);

            wait.Wait();
            wait.Reset();

            //Do your work with the url here or just after wait.Set();
            Console.WriteLine(href + " completed.");
        }
    });

t.Start();

A better version inspired by Akash without the unnecessary complexities of threading:

var queue = new Queue<string>(hrefs);

Action navigate = () =>
    {
        if (queue.Count != 0)
            webBrowser1.Navigate(queue.Dequeue());
    };

webBrowser1.DocumentCompleted += (o, e0) =>
{
    if (webBrowser1.ReadyState == WebBrowserReadyState.Complete)
    {
        //Do something
        navigate();
    }
};

navigate();
lasseespeholt
I have my doubt, you can not access web browser control in different thread apart from UI thread. And first of all, this makes no sense because you will end up in deadlock because you are using same wait to lock on multiple threads and only one instance of browser to navigate all urls !!
Akash Kava
I will improve on it. But it actually works here. Why do I end in a deadlock? I'm only iterating one url at the time, and each time waits until the flag is up (document completed).
lasseespeholt
your flag will never be up because you are setting new url to the same browser so instead, Navigation Cancelled event will be fired and your Wait.Set will never be fired at all!!
Akash Kava
I'm not a pro at concurrency so you properly have a point in the things you are saying. But try it - it works (at least the times I have tried it). It will lock if a document is not completed.
lasseespeholt
I should point out that this is not intended to run in parallel. But that is not what the question was about. As I read it, he just wants to read a page and do something with it when it is done. In hes code it just iterates through all the urls in milliseconds and only fully load the last url.
lasseespeholt
+1  A: 

First of all, you are setting new url to same web browser control, even before it has loaded anything, this way you will simply see the last url on your browser. Definately browser will certainly take some time to load url, so I guess navigation is cancelled well in advance before Document_Completed can be fired.

There is only one way to do this simultaneously,

You have to use a tab control, and open a new tab item for every url and each tab item will have its own web browser control and you can set its url.

foreach(string href in hrefs){
   TabItem item = new TabItem();
   WebBrowser wb = new WebBrowser();
   wb.DocumentCompleted += wb_DocumentCompleted;
   wb.Url = href;
   item.Child = web;
   tabControl1.Items.Add(item);
}


private void wb_DocumentCompleted(object sender, EventArgs e){
 /// do your stuff...
}

In order to improve above method, you should see how can you create multiple tab items in different UI threads, its pretty log topic to discuss here, but it is still possible.

Another method is to do use a queue...

private static Queue<string> queue = new ...

foreach(string href in hrefs){
    queue.Enqueue(href);
}

private void webBrowser1_DocumentCompleted(object sender, EventArgs e){
    if(queue.Count>0){
        webBrowser1.Url = queue.Dequeue();
    }
}
Akash Kava
+1 Inspired by your queue approach. I'm not sure he wants to run it in parallel - I just thinks he wants a loop which waits on "complete" before moving on. Best regards
lasseespeholt
A: 

Hi lasseespeholt,

Thanks for Code examples. I'm using Queue example. Myrequirement is to print the web page loads each URL. I'm getting the Javascript errors while parsing through each URL. Can you pls tell me how to suppress the error. Thank you.