views:

599

answers:

2

Hello,

I'm using a C# WebClient to post login details to a page and read the all the results.

The page I am trying to load includes flash (which, in the browser, translates into HTML). I'm guessing it's flash to avoid being picked up by search engines???

The flash I am interested in is just text (not an image/video) etc and when I "View Selection Source" in firefox I do actually see the text, within HTML, that I want to see.

(Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?)

Currently after I have posted my login details, and loaded the HTML back, I see the page which does NOT show the flash HTML (as if I had viewed source for the whole page).

Thanks in advance,

Jim

PS: I should point out that the POST is actually working, my log in is successful.

+5  A: 

Fiddler (or similar tool) is invaluable to track down screen-scraping problems like this. Using a normal browser and with fiddler active, look at all the requests being made as you go through the login and navigation process to get to the data you want. In between, you will likley see one or more things that your code is doing differently which the server is responding to and hence showing you different HTML than a real client.

The list of stuff below (think of it as "scraping 101") is what you want to look for. Most of the stuff below is probably stuff you're already doing, but I included everything for completeness.

In order to scrape effectively, you may need to deal with one or more of the following:

  1. cookies and/or hidden fields. when you show up at any page on a site, you'll typically get a session cookie and/or hidden form field which (in a normal browser) would be propagated back to the server on all subsequent requests. You will likley also get a persistent cookie. On many sites, if a requests shows up without a proper cookie (or form field for sites using "cookieless sessions"), the site will redirect the user to a "no cookies" UI, a login page, or another undesirable location (from the scraper app's perspective). always make sure you capture the cookies set on the initial request and faithfully send them back to the server on subsequent requests, except if one of those subsequent requests changes a cookie (in which case propagate that new cookie instead).
  2. authentication tokens a special case of above is forms-authentication cookies or hidden fields. make sure you're capturing the login token (usually a cookie) and sending it back.
  3. POST vs. GET this is obvious, but make sure you're using the same HTTP method that a real browser does.
  4. form fields (esp. hidden ones!) i'm sure you're doing this already, but make sure to send all form fields that a real browser does, not just the visible fields. make sure fields are HTML-encoded properly.
  5. HTTP headers. you already checked this, but it may make sense to check again just to make sure the (non-cookie) headers are identical. I always start with the exact same headers and then start pulling out headers one by one, and only keep the ones that cause the request to fail or return bogus data. this approach simplifies your scraping code.
  6. redirects. These can either come from the server, or from client script (e.g. "if user doesn't have flash plugin loaded, redirect to a non-flash page"). See http://stackoverflow.com/questions/1455567/webrequest-how-to-find-a-postal-code-using-a-webrequest-against-this-contenttype for a crazy example of how redirection can trip up a screen-scraper. Note that if you're using .NET for scraping, you'll need to use HttpWebRequest (not WebClient) for redirect-dependent scraping, because by default WebClient doesn't provide a way for your code to attach cookies and headers to the second (post-redirect) request. See the thread above for more details.
  7. sub-requests (frames, ajax, flash, etc.) - often, page elements (not the main HTTP requests) will end up fetching the data you want to scrape. you'll be able to figure this out by looking which HTTP response contains the text you want, and then working backwards until you find what on the page is actually making the request for that content. A few sites do really crazy things in sub-requests, like requesting compressed or encrypted text via ajax, and then using client-side script to decrypt it. if this is the case, you'll need to do a bit more work like reverse-engineering what the client script is doing.
  8. ordering - this one is obvious: make HTTP requests in the same order that a browser client does. that doesn't mean you need to make every request (e.g. images). Typically you only need to make the requests which return text/html content type, unless the data you want is not in the HTML and is in an ajax/flash/etc. request.
Justin Grant
+1, awesome answer
orip
A: 

(Interestingly when I view the source for the whole page I do not see the text, within HTML, that I want to see. Could this be related?)

This usually means that the discrepancy is caused by some DOM manipulations via javascript after the page has loaded. Try turning off javascript and see what it looks like.

recursive