views:

52

answers:

1

Hi, i want to read the content of a website and store it in a file by using c# and asp.net. I know we can read it by using httpwebrequest. But is it possible to read the all available links data also?

Ex: suppose i want to read http://www.msn.com i can directly give the url and can read the home page data that is no issue. But here that msn.com page contains so many links in the home page I want to read those pages content also. Is it possible?

Can somebody give me a starup to do this?

Thanks in advance

+1  A: 
  1. define queue of urls

  2. add main page url to queue

  3. while queue is not empy

3.1 currentUrl = Dequeue()

3.2 read current url

3.3 exctarct all urls from current page using regexp.

3.4 add all urls to the queue

You will have to limit the urls in queue to some sort of depth or to some domain, otherwise you will try to download the entire internet :)

Alex Reitbort
hi thank you for your quick reply.. can u give me a simple example to extract urls from the current page?
Nagu
I would go with "Beautiful Soup" for scraping the page.
jldupont
3.35 For each URL found, ignore it if it's one you've already visited. Otherwise you can find yourself in a loop.
Paul Mitchell
oh beautiful soup is developed entirely in python i think. I want to develop the same tool in c#. any how thank you for ur suggestions
Nagu