views:

155

answers:

1

So here's the deal. I'm creating a spider bot for a website that scans all the product pages and records the product data. I'm using C# and the WebClient library to download the HTML string. The site I'm crawling must be specially made because the HTML that is received from WebClient.DownloadString() is different than the HTML that I get when I view the source of the HTML when visiting it on a browser. This seems intentional because the only info I can't get is the price.

Does anyone know a workaround for this problem or can anyone explain what is happening? Thanks.

+1  A: 

It is probably using the the user agent string to decide what content to send. The example here shows how to set the user agent header.

Ben Robinson
OP here, I was able to find out exactly why. Apparently the website uses an AJAX function to get sensitive data. When I do a screen scrape with WebClient.DownloadString(), I get the HTML document, but instead of getting this sensitive info, I get a segment of AJAX where this should be. Does this help? I will post the code for the AJAX that is included in the HTML
Ryan Fuentes
<!-- AJAX Product Details Panel Begins --> <div id="product_details" style="position:relative"></div> <script language="javascript"> var rQsp = 'productId=3519 showProductDetails(rQsp); </script>
Ryan Fuentes
The ajax call is being made by the function showProductDetails,you need to look at the source code of that function to find out how to scrape the data.
Ben Robinson
That function is not present in the html document
Ryan Fuentes
IT must be within a references javascript file. Look for <script> tags and pulldown the .js file referenced by the src attribute. It will be in one of them.
Ben Robinson