views:

47

answers:

3

I am trying to read the HTML of a page that contains a non-delayed redirect. The following snippet (C#) will give me the destination/redirected page, not the initial one I need to see:

using System.Net;
using System.Text;

public class SomeClass {
    public static void Main() {
        byte[] data = new WebClient().DownloadData("http://SomeUrl.com");
        System.Console.WriteLine(Encoding.ASCII.GetString(data));
    }
}

Is there a way to get the HTML of a redirecting page? (I prefer .NET but a snippet in Java or Python would be fine too. Thx!)

+5  A: 

Unless the redirect is done on the client side you can't. If the redirect is done server side, then no html is actually generated to the client, but the header is redirected at the new server.

Joel Etherton
Interesting. I guess I've only seen client-side-script based redirects before, didn't know about the server kind. (Web dev is not my forte. ;-) +1, thanks
Paul Sasik
He wants to get the source of the page that's doing the redirect, not the one being redirected to.
CyberDude
Some lousy programmers makes web pages that sends a redirect header but forget to stop execution, so though the browser and http client will follow the redirect, there will be still content bytes along the wire. (Usually a webserver would output some html with a link to the target page, for old clients and stuff).
aularon
@CyberDude, the whole point of the answer is that no HTML source is sent to the client for the page *doing* a server-side redirect. The server sends a 302 response code to the browser with a new URL to request, the brower then requests the new URL.
Anthony Pegram
@CyberDude - he asked for the html of the page doing the redirect. A page doing a redirect produces no html to the client. This is what I said in my answer.
Joel Etherton
@Joel it _does_ produce/send html content, but that content is not important and can be ignored.
aularon
A: 

Simplest answer would be to add the current page onto the QueryString component of the redirect when redirecting, for instance:

Response.Redirect(newPage + "?FromPage=" + Request.Url);

Then the new page could see where you cane from by simply looking at Request.QueryString("FromPage").

KeithS
The question is about possibly seeing the HTML of the redirecting page, not the URL.
Anthony Pegram
+1  A: 

It would take more work, but rather than using WebClient, use HttpWebRequest and set the AllowAutoRedirect property to False. A redirect will then throw an exception, but you can get any response text (and some pages do have response text along with the redirect) from the exception's response object. After you get the response from the exception, you can issue another HttpWebRequest for the redirect URL (specified in the Location response header).

You might be able to do something similar with WebRequest if you create a derived object, MyWebRequest, where you overload the GetWebRequest method and set the AllowAutoRedirect property. I don't know what kind of exception, if any, the DownloadData method will return if you do something like that.

As somebody said previously, this will only work for those pages that do client-side redirects (typically 301 or 302). If there is server-side redirection going on, you'd never know it.

Jim Mischel