views:

1508

answers:

6

I have a project at work the requires me to be able to enter information into a web page, read the next page I get redirected to and then take further action. A simplified real-world example would be something like going to google.com, entering "Coding tricks" as search criteria, and reading the resulting page.

Small coding examples like the ones linked to at http://www.csharp-station.com/HowTo/HttpWebFetch.aspx tell how to read a web page, but not how to interact with it by submitting information into a form and continuing on to the next page.

For the record, I'm not building a malicious and/or spam related product.

So how do I go read web pages that require a few steps of normal browsing to reach first?

+1  A: 

What you need to do is keep retrieving and analyzing the html source for each page in the chain. For each page, you need to figure out what the form submission will look like and send a request that will match that to get the next page in the chain.

What I do is build a custom class the wraps System.Net.HttpWebRequest/HttpWebResponse, so retrieving pages is as simple as using System.Net.WebClient. However, my custom class also keeps the same cookie container across requests and makes it a little easier to send post data, customize the user agent, etc.

Joel Coehoorn
A: 

Depending on how the website works you can either manipulate the url to perform what you want. e.g to search for the word "beatles" you could just open a request to google.com?q=beetles and then just read the results.

Alternatively if the website does not use querystring values (url) to process page actions then you will need to work on a webrequest which posts the required values to the website instead. Search in Google for working with WebRequest and webresponse.

Kaius
A: 

In your Google example, you shouldn't enter anything into the search criteria but instead go directly to the page where the search button takes you.

For your example: http://www.google.com/search?hl=en&q=coding%20tricks

Austin Salonen
+1  A: 

You can programmatically create an Http request and retrieve the response:

 string uri = "http://www.google.com/search";
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
        request.Method = "POST";
        request.ContentType = "application/x-www-from-urlencoded";

        // encode the data to POST:
        string postData = "q=searchterm&hl=en";
        byte[] encodedData = new ASCIIEncoding().GetBytes(postData);
        request.ContentLength = encodedData.Length;

        Stream requestStream = request.GetRequestStream();
        requestStream.Write(encodedData, 0, encodedData.Length);

        // send the request and get the response
        using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
        {

            // Do something with the response stream. As an example, we'll
            // stream the response to the console via a 256 character buffer
            using (StreamReader reader = new StreamReader(response.GetResponseStream()))
            {
                Char[] buffer = new Char[256];
                int count = reader.Read(buffer, 0, 256);
                while (count > 0)
                {
                    Console.WriteLine(new String(buffer, 0, count));
                    count = reader.Read(buffer, 0, 256);
                }
            } // reader is disposed here
        } // response is disposed here

Of course, this code will return an error since Google uses GET, not POST, for search queries.

This method will work if you are dealing with specific web pages, as the URLs and POST data are all basically hard-coded. If you needed something that was a little more dynamic, you'd have to:

  1. Capture the page
  2. Strip out the form
  3. Create a POST string based on the form fields

FWIW, I think something like Perl or Python might be better suited to that sort of task.

Chris Lawlor
A: 

I've had very good luck with this product:

IMacros

http://www.iopus.com/

I have an app that's been running for many months, maybe over a year using their product.

The top-level product has a GUI that you can use to record and edit macros, and C# libraries that you can call from .Net code.

IMHO, this is one of those programming areas that seems simple when you start ("I'll just GET the HTML for the page, process the string, then GET the next page...") but in practice it turns out to be a real PITA.

+2  A: 

You might try Selenium. Record the actions in Firefox using Selenium IDE, save the script in C# format, then play them back using the Selenium RC C# wrapper. As others have mentioned you could also use System.Net.HttpWebRequest or System.Net.WebClient. If this is a desktop application see also System.Windows.Forms.WebBrowser.

Addendum: Similar to Selenium IDE and Selenium RC, which are Java-based, WatiN Test Recorder and WatiN are .NET-based.

Axl