how to read the content of a website?

views:

answers:

how to read the content of a website?

Hi, i want to read the content of a website and store it in a file by using c# and asp.net. I know we can read it by using httpwebrequest. But is it possible to read the all available links data also?

Ex: suppose i want to read http://www.msn.com i can directly give the url and can read the home page data that is no issue. But here that msn.com page contains so many links in the home page I want to read those pages content also. Is it possible?

Can somebody give me a starup to do this?

Thanks in advance

+1 A:

define queue of urls
add main page url to queue
while queue is not empy

3.1 currentUrl = Dequeue()

3.2 read current url

3.3 exctarct all urls from current page using regexp.

3.4 add all urls to the queue

You will have to limit the urls in queue to some sort of depth or to some domain, otherwise you will try to download the entire internet :)

Alex Reitbort 2009-10-07 11:45:35

hi thank you for your quick reply.. can u give me a simple example to extract urls from the current page?

Nagu 2009-10-07 11:52:17

I would go with "Beautiful Soup" for scraping the page.

jldupont 2009-10-07 11:56:55

3.35 For each URL found, ignore it if it's one you've already visited. Otherwise you can find yourself in a loop.

Paul Mitchell 2009-10-07 12:11:58

oh beautiful soup is developed entirely in python i think. I want to develop the same tool in c#. any how thank you for ur suggestions

Nagu 2009-10-07 12:26:32

ansaurus

tags:

views:

answers:

how to read the content of a website?

related questions