views:

1376

answers:

2

I'm trying to download and parse the HTML of a web page. Recently, the source website moved from having all of their information on one page to hiding part of it behind javascript. There's a "Show All" check box that needs activated in order to view the whole page.

Here's the website: Source Website

Essentially I'm looking to automate retrieving that page after the check box has been clicked. Currently, we have a C program that downloads the web page and handles our parsing. I'm not sure if it can accept javascript in the URL if that can be used to solve this problem (I've tried using a bookmarklet to call the javascript from the URL, but I wasn't able to get it to handle the check box), but it can handle files if it's easier to write a C# program that can handle this.

I would prefer a way to code this myself rather than use a third party program to avoid having to install anything on the server this runs on. Any help is greatly appreciated.


Edit: Basically, how can I automate the call to the javascript that is linked to that "Select All" checkbox so I can grab the html page containing everything's that's displayed after clicking the checkbox.


Edit 2: Here's the output from Fiddler2:

__EVENTTARGET ctl00$ContentPlaceHolder1$GenericWebUserControl$ShowAllCheckBox
__EVENTARGUMENT
__LASTFOCUS
__VIEWSTATE (REMOVED DUE TO LENGTH)
__EVENTVALIDATION (REMOVED DUE TO LENGTH)
ctl00$ContentPlaceHolder1$GenericWebUserControl$Organization0 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Initial or Amendment1 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Relief Requested2 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Country3 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$Status4 ALL
ctl00$ContentPlaceHolder1$GenericWebUserControl$StartDate5  
ctl00$ContentPlaceHolder1$GenericWebUserControl$EndDate5    
ctl00$ContentPlaceHolder1$GenericWebUserControl$ShowAllCheckBox on

I'm currently getting 500 ERRORS from the server. Do I need to include all of those GenericWebUserControls in the post request as well? Also do I need to include the EVENTVALIDATION?


EDIT 3: Here's the latest code. I'm still getting server 500 errors.

private void CreateRequest()
{
    HttpWebRequest httpWebRequest;
    HttpWebResponse httpWebResponse;
    StreamWriter streamWriter;
    Stream webResponseStream;
    StreamReader streamReader;
    string postData;
    string outputHTML;

    postData = String.Format("&__EVENTTARGET={0}" + "&__VIEWSTATE={1}" + "&__EVENTVALIDATION=(2)"+"&ctl00$ContentPlaceHolder1$GenericWebUserControl$ShowAllCheckBox=on" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Organization0=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Initial+or+Amendment1=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Relief+Requested2=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Country3=ALL" +"&ctl00$ContentPlaceHolder1$GenericWebUserControl$Status4=ALL",EVENTTARGET, VIEWSTATE, EVENTVALIDATION);

    httpWebRequest = (HttpWebRequest)WebRequest.Create("http://services.cftc.gov/sirt/sirt.aspx?Topic=ForeignPart30Exemptions");
    httpWebRequest.Method = "POST";
    httpWebRequest.ContentType = "application/x-www-form-urlencoded";
    httpWebRequest.ContentLength = postData.Length;

    streamWriter = new StreamWriter(httpWebRequest.GetRequestStream(), System.Text.Encoding.ASCII);
    streamWriter.Write(postData);
    streamWriter.Close();

    httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();

    webResponseStream = httpWebResponse.GetResponseStream();
    streamReader = new StreamReader(webResponseStream);
    outputHTML = streamReader.ReadToEnd();

    Console.WriteLine(outputHTML);
}


EDIT 4: I've determined that it's the postData string that's causing the server 500 error. If I make it an empty string, it outputs the entire webpage. Does anyone know if I'm correct in having to put everything that came from Fiddler2 that had a value into the postData string? Also, that __VIEWSTATE is an incredibly long string. Are there limits or anything I'm not sure about?


EDIT 5: I ran all of the strings used in postData through a URL encoder, but I'm still getting server 500 errors. Is there any way for me to debug why that post body is invalid?


SOLUTION: Ok, I couldn't get my postData string correct, but when I pasted in the raw POST body it works. This looks like it will be good enough, but my concern is if this will continue working.

+3  A: 

That's an asp.net page. Clicking the checkbox causes the page to be posted back to the server. So rather than trying to simulate the javascript what you want to do instead is simulate the post request.

This is notoriously tricky with ASP.Net pages, because you usually need to populate the hidden __ViewState input. I recommend using a packet sniffer like Fiddler to view the actual request as it's sent. You should be able to copy the ViewState from there.

Joel Coehoorn
Ok, so I ran the sniffer and go the ViewState input. I assume I can just run a C# HttpWebRequest now to simulate the post request. The only other question I have is will that ViewState ever change?
Tony Trozzo
It might. ViewState might store things like user tokens, breadcrumbs, etc. You can put a lot of stuff in there. But the main thing you care about is that it's 'valid' in case EnableEventValidation is set to true (default) and that your checkbox input has the correct value.
Joel Coehoorn
Any idea why I'd be getting server 500 errors? I'm assuming something is wrong with the post mechanism.
Tony Trozzo
You'll have to play with it. Try setting a user agent for your request.
Joel Coehoorn
No luck with the user agent. If you check out my code up above, any idea if that postData string looks correct? I also posted the Fiddler2 output, and I just added each one that had a value into the postData string.
Tony Trozzo
Again: asp.net pages are notoriously tricky to scrape correctly. There's a certain amount voodoo trial and error involved to figure out exactly what the server is expecting.
Joel Coehoorn
+1  A: 

It looks the JavaScript initiates a POST to the same page. Firebug shows the following in the POST data.

__EVENTTARGET: ctl00$ContentPlaceHolder1$GenericWebUserControl$ShowAllCheckBox

That's probably a good place to start looking.

Andrew Mason
So can I do this using a C# HttpWebRequest POST method? I'm not really that familiar with web programming. Will I need to use the packet sniffing approach to get the information I need?
Tony Trozzo
Firebug would work, too, but there's a lot of javascript to follow. You just need _something_ that will show you what data is ultimately posted to the server, so you can simulate the same request.
Joel Coehoorn