ansaurus

Question

Get the first few words(100 or 200) from a long summary(plain string or html) using c#?

Answer 1

+2 A:

two options:

build code to do this properly - counting words except html tags, pushing opening tags to a stack, then when you reach the threshold, you pop onclosed tags from the stack and append closing tags to the end of the string.

pro: complete control, and ability to get exactly N visible words.
con: somewhat tricky to implement cleanly.
cut the words, then feed the broken HTML into HtmlAgilityPack (a free download that can help with fixing broken HTML) and there you go.

pro: almost no coding, proven solution, maintainable
con: you'd still need to figure a way to not count tags when you do the .Substring() call

Ken Egozi 2009-10-16 10:48:40

Answer 2

A:

You should separate out your content and markup. Can you give more info on what you're trying to do? (e.g. where this string is coming from, why you're trying to do it).

UpTheCreek 2009-10-16 10:48:50

Please see my edits

Ravi 2009-10-16 13:31:23

Answer 3

+1 A:

Thought about have the Html Agility Pack do your bidding?

While not perfect, here's one idea that will achieve (more or less) what you're after:

// retrieve a summary of html, with no less than 'max' words
string GetSummary(string html, int max)
{
 string summaryHtml = string.Empty;

 // load our html document
 HtmlDocument htmlDoc = new HtmlDocument();
 htmlDoc.LoadHtml(html);

 int wordCount = 0;


 foreach (var element in htmlDoc.DocumentNode.ChildNodes)
 {
  // inner text will strip out all html, and give us plain text
  string elementText = element.InnerText;

  // we split by space to get all the words in this element
  string[] elementWords = elementText.Split(new char[] { ' ' });

  // and if we haven't used too many words ...
  if (wordCount <= max)
  {
   // add the *outer* HTML (which will have proper 
   // html formatting for this fragment) to the summary
   summaryHtml += element.OuterHtml;

   wordCount += elementWords.Count() + 1;
  }
  else 
  { 
   break; 
  }
 }

 return summaryHtml;
}

Bauer 2009-10-16 12:06:34

ansaurus

tags:

views:

answers:

Get the first few words(100 or 200) from a long summary(plain string or html) using c#?

related questions