views:

39

answers:

1

I often go to a site to look stuff up. I thought to myself: "Hold on. I can program. Why am I going to this site manually when I can write a piece of software that does it for me?".

And so I started. I'm using C#, so I found WebClient and Uri.

I've managed to get the source code for the site, yet the problem occurred that the specific data I'm looking for is generated via AJAX, after the source code has loaded.

So that's my problem. How can I get that code, if it needs to be requested via an AJAX call first?

+1  A: 

The general approach is this:

  1. using a tool like Fiddler, find out which HTTP requests are made by the browser in order to fetch the data you're looking for.
  2. use WebClient to fetch the HTTP request(s) you need.

Take a look at my answer to this question for more info about HTML screen scraping for more details and how to work around various issues you may run across.

For #1 above, here's how to use fiddler to understand how a specific request is being made:

First, find the request you care about (the request which contains the data you want in its response). You can do this by inspecting each request by double-clicking it on the left pane in fiddler and looking inside the "text fiew" tab on the lower-right pane. You can also use CTRL+F to find content across multiple requests, but some requests are compressed so you'll want to ensure the "autodecode" button is selected in the toolbar before making your requests if you want to be sure you can text-search across all of them.

Once you've found the request you want, double-click it in Fiddler and select the "headers" tab in the upper-right pane. Those are the headers being sent. If your client sends exactly these headers to the server, you should get back the same data. But usually not all the headers are needed, so you'll want to figure out which ones are needed. You do this using Fiddler's Request Builder tab in the upper-right pane. Select that tab and drag your data request over from the left pane onto the request builder. Then submit the request to validate that it returns the correct results. Then start deleting headers, one header at a time, until the request stops working-- you know that that header was required. Try to delete each header until you find the ones that are required.

Then, you'll need to write code to generate the right header. Don't worry about the Host: header, that's generated automatically for you. For the Cookie: header, you'll need to generate it using the CookieContainer class. For the other headers (e.g. UserAgent:, Accept:, etc. you can generally copy them and add them to your request as-is.

Justin Grant
I don't really understand. Fiddler shows 1 request only, that is the POST request sent. Nothing happens beyond that. I don't see how the data is requested, and no headers are appearing? What should it look like?
WebDevHobo
I'm assuming the POST you're talking about actually contains the data you're trying to fetch programmatically. If so, I expanded my answer to include more details about how to use fiddler to find the right request, to understand which headers are being sent, and to find which headers are required. Is this the info you were looking for?
Justin Grant
Okay, I got what I needed. Only thing that's still a mystery is the CookieContainer thing. I looked it up, but most tutorials were about ASP.NET. I'm using a small C# app for this, not ASP.NET
WebDevHobo
CookieContainer is a simple class to allow you to store cookies which come in from one request, and then send those cookies out in subsequent requests-- just like a browser does. CookieContainer works the same way regardless of whether your client is an ASP.NET or a C# console app or WinForms or any other client app type.
Justin Grant