tags:

views:

292

answers:

3

I'm downloading a web site using WebClient

public void download()
{
client = new WebClient();
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(client_DownloadStringCompleted);
client.Encoding = Encoding.UTF8;
client.DownloadStringAsync(new Uri(eUrl.Text));
}
void client_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
    SaveFileDialog sd = new SaveFileDialog();
    if (sd.ShowDialog() == DialogResult.OK)
    {
        StreamWriter writer = new StreamWriter(sd.FileName,false,Encoding.Unicode);
        writer.Write(e.Result);
        writer.Close();                
    }
}

This works fine. But I am unable to read content that is loaded using ajax. Like this:

<div class="center-box-body" id="boxnews" style="width:768px;height:1167px; ">
    loading .... </div>

<script language="javascript">
    ajax_function('boxnews',"ajax/category/personal_notes/",'');
    </script>

This "ajax_function" downloads data from server on the client side.

How can I download the full web html data?

A: 

I think you'd need to use a WebBrowser control to do this since you actually need the javascript on the page to run to complete the page load. Depending on your application this may or may not be possible for you -- note it's a Windows.Forms control.

tvanfosson
webBrowser is download all pictures. I don't need download picture, I need only download html text data.
ebattulga
+1  A: 

To do so, you would need to host a Javascript runtime inside of a full-blown web browser. Unfortunately, WebClient isn't capable of doing this.

Your only option would be automation of a WebBrowser control. You would need to send it to the URL, wait until both the main page and any AJAX content has been loaded (including triggering that load if user action is required to do so), then scrape the entire DOM.

If you are only scraping a particular site, you are probably better off just pulling the AJAX URL yourself (simulating all required parameters), rather than pulling the web page that calls for it.

richardtallent
how to setup webbrowser to not download picture.
ebattulga
@ebattulga - that's really a different question. It was already asked here: http://stackoverflow.com/questions/1260615/disable-image-loading-on-webbrowser-control-c-net-2-0 - but there was no answer. It's a good question, though.
Jon B
A: 

When you visit a page in a browser, it

1.downloads a document from the requested url

2.downloads anything referenced by an img, link, script,etc tag (anything that references an external file)

3.executes javascript where applicable.

The WebClient class only performs step 1. It encapsulates a single http request and response. It does not contain a script engine, and does not, as far as I know, find image tags, etc that reference other files and initiate further requests to obtain those files.

If you want to get a page once it's been modified by an AJAX call and handler, you'll need to use a class that has the full capabilities of a web browser, which pretty much means using a web browser that you can somehow automate server-side. The WebBrowser control does this, but it's for WinForms only, I think. I shudder to think of the security issues here, or the demand that would be placed on the server if multiple users are taking advantage of this facility simultaneously.

A better question to ask yourself is: why are you doing this? If the data you're really interested in is being obtained via AJAX (probably through a web service), why not skip the webClient step and just go straight to the source?

David Lively