views:

260

answers:

2

I'm trying to have my program "rip" news off of a website and place it on the WinForm, but my method is so dumb and redundant, I'm sure there must be a better way to do it.

public void LoadLatestNews()
{
    WebClient TheWebClient = new WebClient();
    string SourceCode = TheWebClient.DownloadString("http://www.chronic-domination.com/");
    int NewsPosition = SourceCode.IndexOf("news_post-title");

    string Y = SourceCode.Substring(NewsPosition,5000);
    int TitlePosition = Y.IndexOf("</div");

    string NewsPostTitle = SourceCode.Substring((NewsPosition + 17), (TitlePosition - 17));

    int BodyPosition = Y.IndexOf("news_post-body");

    string X = Y.Substring(BodyPosition, 1000);
    int EndBodyPosition = X.IndexOf("<br><br>");

    string NewsPostBody = X.Substring((BodyPosition + 16)+ EndBodyPosition);

    MessageBox.Show(NewsPostTitle);

}

Not only is this code horrible, it doesn't even work as intended. So I beg you, teach me the proper way to do things like this?

+4  A: 

Use the Html Agility Pack to parse the page. You can load the entire text of the page and then treat it as XML - write XPATH expressions or crawl the DOM tree to get what you need.

This allows you to avoid the problem of "scraping" at all and approach the task as you would any other XML store. Here's a very basic intro to XPATH. You could write something like myDoc.SelectSingleNode("//div[@class='header']/h2").InnerText, which means "select the H2 element which is an immediate child of the DIV whose class is 'header'", and then getting the inner text of that element.

Rex M
I'm very, VERY green to Web Scraping. How could I apply this to my particular problem? All I need for it to do is copy the string between "X" html tag. Thank you!
Sergio Tapia
@Papuccino see my revised answer.
Rex M
I'll try out what you suggested. :)
Sergio Tapia
Rex M - Curious how you'd initially retrieve the web page as an XML document so that an XmlDocument can be created?
Howiecamp
@Howiecamp we would not create an XmlDocument from the webpage - rather we would load the entire response stream into the Html Agility Pack which creates an "XML-like" structure that behaves like XML, and can be converted to an XmlDocument.
Rex M
Thanks Rex.....
Howiecamp
+1  A: 

Have a look at Wikipedia's entry on Web Scraping: Here I do a lot of web scraping, and in my experience Regular Expressions are sufficient about 80% of the time. After which, you need to look at parsing the (X)HTML and traversing the DOM tree.

Nick