views:

763

answers:

3

Is there a way to view the generated source of a web page (the code after all AJAX calls and JavaScript DOM manipulations have taken place) from a C# application without opening up a browser from the code?

Viewing the initial page using a WebRequest or WebClient object works ok, but if the page makes extensive use of JavaScript to alter the DOM on page load, then these don't provide an accurate picture of the page.

I have tried using Selenium and Watin UI testing frameworks and they work perfectly, supplying the generated source as it appears after all JavaScript manipulations are completed. Unfortunately, they do this by opening up an actual web browser, which is very slow. I've implemented a selenium server which offloads this work to another machine, but there is still a substantial delay.

Is there a .Net library that will load and parse a page (like a browser) and spit out the generated code? Clearly, Google and Yahoo aren't opening up browsers for every page they want to spider (of course they may have more resources than me...).

Is there such a library or am I out of luck unless I'm willing to dissect the source code of an open source browser?

SOLUTION

Well, thank you everyone for you're help. I have a working solution that is about 10X faster then Selenium. Woo!

Thanks to this old article from beansoftware I was able to use the System.Windows.Forms.WebBrwoswer control to download the page and parse it, then give em the generated source. Even though the control is in Windows.Forms, you can still run it from Asp.Net (which is what I'm doing), just remember to add System.Window.Forms to your project references.

There are two notable things about the code. First, the WebBrowser control is called in a new thread. This is because it must run on a single threaded apartment.

Second, the GeneratedSource variable is set in two places. This is not due to an intelligent design decision :) I'm still working on it and will update this answer when I'm done. wb_DocumentCompleted() is called multiple times. First when the initial HTML is downloaded, then again when the first round of JavaScript completes. Unfortunately, the site I'm scraping has 3 different loading stages. 1) Load initial HTML 2) Do first round of JavaScript DOM manipulation 3) pause for half a second then do a second round of JS DOM manipulation.

For some reason, the second round isn't cause by the wb_DocumentCompleted() function, but it is always caught when wb.ReadyState == Complete. So why not remove it from wb_DocumentCompleted()? I'm still not sure why it isn't caught there and that's where the beadsoftware article recommended putting it. I'm going to keep looking into it. I just wanted to publish this code so anyone who's interested can use it. Enjoy!

using System.Threading;
using System.Windows.Forms;

public class WebProcessor
{
    private string GeneratedSource{ get; set; }
    private string URL { get; set; }

    public string GetGeneratedHTML(string url)
    {
        URL = url;

        Thread t = new Thread(new ThreadStart(WebBrowserThread));
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();

        return GeneratedSource;
    }

    private void WebBrowserThread()
    {
        WebBrowser wb = new WebBrowser();
        wb.Navigate(URL);

        wb.DocumentCompleted += 
            new WebBrowserDocumentCompletedEventHandler(
                wb_DocumentCompleted);

        while (wb.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        //Added this line, because the final HTML takes a while to show up
        GeneratedSource= wb.Document.Body.InnerHtml;

        wb.Dispose();
    }

    private void wb_DocumentCompleted(object sender, 
        WebBrowserDocumentCompletedEventArgs e)
    {
        WebBrowser wb = (WebBrowser)sender;
        GeneratedSource= wb.Document.Body.InnerHtml;
    }
}
+2  A: 

it is possibly using an instance of a browser (in you case: the ie control). you can easily use in your app and open a page. the control will then load it and process any javascript. once this is done you can access the controls dom object and get the "interpreted" code.

Niko
that's what Watin does
orip
Wouldn't this still have the same speed problems as opening the browser?
Michael La Voie
since you want your code to be interpreted+parsed, the speed "problem" would be pretty the same (maybe a little less on cpu if you dont display the window + you have a little less overhead). As far as i remember you can also prevent the ocntrol from loading images thus reducing the load time even more. But thats the only way you can accomplish what you want i am afraid
Niko
Thanks for your help. I posted my final answer, but yours was what sent me in that direction. :D
Michael La Voie
+1  A: 

Theoretically yes, but, at present, no.

I don't think there is currently a product or OSS project that does this. Such a product would need to have it's own javascript interpreter and be able to accurately emulate the run-time environment and quirks of every browser it supports.

Given that you need something that accurately emulates the server + browser environment in order to produce the final page code, in the long run, I think that using a browser instance is the best way to accurately generate the page in its final state. This is especially true, when you consider that, after the page load completes, the page sources can still change over time in the browser from AJAX/javascript.

Jeff Leonard
You may be right, and thanks for the thought. I did find a Java library that may be what I need, but I'm still hoping for a .net solution. Surely someone else has needed this before me: http://stackoverflow.com/questions/857515/screen-scraping-from-a-web-page-with-a-lot-of-javascript/857630#857630
Michael La Voie
A: 

I also have similar requirements, did you get the solution for your question. If yes please let me know also.

Thanks .Net Developer

Net developer
Yes, about halfway through my question, i updated it and posted my solution. Also, please don't post answers that aren't answers
Michael La Voie