tags:

views:

320

answers:

4

I have researched on spidering and think that it is a little too complex for quite a simple app I am trying to make. Some data on a web page is not available to view in the source as it is just being displayed by the browser.

If I wanted to get a value from a specific web page that I was to display in a WebBrowser control, is there any method to read values off of the contents of this browser?

If not, does anyone have any suggestions on how they might approach this?

A: 

Check out this example: http://www.example-code.com/csharp/spider.asp (It was the first hit on Google).

I think writing such an appliction is quite useful to get more familiar with C# (as it seems that you want to write the application for training purposes).

0xA3
+2  A: 

You’re not looking for spidering, you’re looking for screen scraping.

Bombe
A: 

Because the browser simply renders the underlying content, the most flexible approach would be to parse the underlying content (html/css/js/whatever) yourself.

I would create a parsing engine that looks for the things your spider application needs.

This could be a basic string searching algorithm which looks for href="" for example and reads the values in order to produce new requests and continue spidering. Your engine could be written to only look for things it is interested in and extended in that way for more functionality.

Martin
+2  A: 

I'd have to agree with Bombe, it sounds more like you want HTML Screen Scraping. It requires lots of parsing, and if the page your scraping ever changes, your app will break, however here's a small example of how to do it:

WebClient webClient = new WebClient(); 
const string strUrl = "http://www.yahoo.com/"; 
byte[] reqHTML; 
reqHTML = webClient.DownloadData(strUrl); 
UTF8Encoding objUTF8 = new UTF8Encoding(); 
string html = objUTF8.GetString(reqHTML);

Now the html variable has the entire HTML in it, and you can start parsing away.

BFree