views:

42

answers:

1

I'm currently self-studying C# in my free time and thought of a "little" project to get me going (and one that I or others will actually find useful). It ended up being more complicated than I thought. Or maybe I'm just thinking it is?

Anyway, this project would parse the homepages of the blogs (most of them are Wordpress blogs) I frequent to, take the post headers and the links within those posts and notify me via a balloon tip in the task bar. I can handle the rest except for the way of getting C# to parse the HTML pages for the items I need. C# doesn't seem to have any built-in way to do this. Could anyone point me to the right direction? I just looked into the HTML Agility Pack but I'm still trying to figure it out. Some example code will help much too. Thanks in advance!

+1  A: 

You are doing the right thing if you are using the HTML Agility Pack.

Here is selecting all of the links on a page (from here):

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
   HtmlAttribute att = link["href"];
   att.Value = FixLink(att);
}
doc.Save("file.htm");

You may want to brush up on your XPath, if you want to learn how to query the HtmlDocument.

Oded
Thanks for the quick response. I was trying to get the same sample from Codeplex to work but I'm having problems – I'm still such a noob with this. Sorry.I'm having a problem with: new HtmlDocument();It says "Error: The type 'System.Windows.Forms.HtmlDocument' has no constructors defined"What should I do to eliminate this problem? Thanks again.
DeVilFisCh
Also, I use Visual C# 2010 Express if that matters.
DeVilFisCh
@DeVilFisCh - you need to add a reference to the agility pack to your project and a `using` statement for it in the class you are using it in.
Oded
@Oded - I added the HTML Agility Pack library to the project but didn't add it as a reference. My mistake. Fixed that part now and inserted a using statement for the library.Unfortunately, I now get a new error: "'HtmlDocument' is an ambiguous reference between 'System.Windows.Forms.HtmlDocument' and 'HTMLAgilityPack.HtmlDocument';
DeVilFisCh
@Oded - I fixed it using HtmlAgilityPack.HtmlDocument instead. I get a debug error since it seems to search for table in the site that I'm testing it on (if I understand the code correctly). I'll try something else and play with it for a bit.
DeVilFisCh
@Oded - I think I understand it a little now so I marked your answer. One last question though, how do I load directly from a URL?
DeVilFisCh
@DeVilFisCh - Just use a URL instead of a local file. `doc.Load("http://example.com/default.html");`
Oded
@Oded - I tried that but an error shows up: URI formats are not supported.
DeVilFisCh
@DeVilFisCh - you need to use the `HtmlWeb()` object to load the HTML document. See this link: http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=15645
Oded