views:

181

answers:

2

Hi, I'm trying to extract the text contained in a webpage. So that I'm using a third pary tool Html Agility Pack. In that they mentioned

HtmlWeb htmlWeb = new HtmlWeb(); HtmlDocument doc = htmlWeb.Load("http://www.msn.com/");

HtmlNodeCollection links = doc.DocumentNode.SelectNodes("//a[@href]"); foreach (HtmlNode link in links) { Response.Write(link.Attributes["href"].Value + "
"); }

it is working for me to grab all other links contained in a page. But i want to get all the text data contained in that page. Is it possible?

Did anybody worked with Html Agility Pack before?

Thanks in advance

+1  A: 

Yep, it's possible. Download the source code for the HtmlAgilityPack and take a look at the Html2Txt sample project, particularly HtmlConvert.cs. You can pretty much copy/paste their method into whatever it is you're doing.

Or, for that matter, compile the sample project as-is and set a reference to the binaries. HtmlAgilityPack.Samples.HtmlToText.Convert() will do exactly what you need.

Cam Soper
Yes exactly this is what i need. thank you
Nagu
A: 

you are using an xpath selector there. If you select all nodes ("*") and then perform the foreach would it work?

PS: what programming language is this?

Quamis