views:

56

answers:

2

I am tring to screen scrape a page of a web app that just contains text and is hosted by a 3rd party. It's not a properly formed HTML page, however the text that is diplayed will tell us if the web app is up or down.

When I try to scrape the sreen it returns an error when it tries the WebRequest. The error is "The remote server returned an error: (500) Internal Server Error."

public void ScrapeScreen()
    {
        try
        {
            var url = textBox1.Text; 
            var request = WebRequest.Create(url);
            var response = request.GetResponse();
            var stream = response.GetResponseStream();
            var reader = new StreamReader(stream);
            var result = reader.ReadToEnd();
            stream.Dispose();
            reader.Dispose();
            richTextBox1.Text = result;
        }

        catch(Exception ex)
        {
            MessageBox.Show(ex.Message);
        }

   }

Any ideas how I can get the text from the page?

A: 

First, try this:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

However, if you're just looking for text and not having to do any POST-ing of data to the server, you may want to look at the webClient class. It more closely resembles a real browser, and takes care of a lot of HTTP header stuff that you may end up having to twek if you stick with the HttpWebRequest class.

Adam Barney
+1  A: 

Some sites don't like the default UserAgent. Consider changing it to something real, like:

((HttpWebRequest)request).UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4"
Kirk Woll
+1 This has often been the case when I've tried screen-scraping before.
Noldorin
Default user agent is null by the way - usually specifying anything will work.
Noldorin