views:

323

answers:

3

Forgive my ignorance on the subject

I am using

 string p="http://" + Textbox2.text;
 string r= textBox3.Text;
 System.Net.WebClient webclient=new
 System.Net.Webclient();
 webclient.DownloadFile(p,r);

to download a webpage. Can you please help me with enhancing the code so that it downloads the entire website. Tried using HTML Screen Scraping but it returns me only the href links of the index.html files. How do i proceed ahead

Thanks

+4  A: 

Scraping a website is actually a lot of work, with a lot of corner cases.

Invoke wget instead. The manual explains how to use the "recursive retrieval" options.

Will
+2  A: 
 protected string GetWebString(string url)
    {
        string appURL = url;
        HttpWebRequest wrWebRequest = WebRequest.Create(appURL) as HttpWebRequest;
        HttpWebResponse hwrWebResponse = (HttpWebResponse)wrWebRequest.GetResponse();

        StreamReader srResponseReader = new StreamReader(hwrWebResponse.GetResponseStream());
        string strResponseData = srResponseReader.ReadToEnd();
        srResponseReader.Close();
        return strResponseData;
    }

This puts the webpage into a string from the supplied URL.

You can then use REGEX to parse through the string.

This little piece gets specific links out of craigslist and adds them to an arraylist...Modify to your purpose.

 protected ArrayList GetListings(int pages)
    {
            ArrayList list = new ArrayList();
            string page = GetWebString("http://albany.craigslist.org/bik/");

            MatchCollection listingMatches = Regex.Matches(page, "(<p><a href=\")(?<LINK>/.+/.+[.]html)(\">)(?<TITLE>.*)(-</a>)");
            foreach (Match m in listingMatches)
            {
                list.Add("http://albany.craigslist.org" + m.Groups["LINK"].Value.ToString());
            }
            return list;
    }
Blankasaurus
+1, also remember to parse all text files (html, css) as them can have links to other resources
Rubens Farias
A: 

@Casonina Box Sorry for the late reply . Thanks for the help :)

I am sure this will work for the application, I have a query

If I am not wrong the arraylist now contains href links to other pages. Will applying the code snippet i.e.

string p="http://" + Textbox2.text; string r= textBox3.Text; System.Net.WebClient webclient=new System.Net.Webclient(); webclient.DownloadFile(p,r);

download the entire pages .. If yes how do i access the urls in the arraylist and how do i proceed ahead ??

Thanks in advance

Karthik