views:

450

answers:

3

I want to get the first few words(100 or 200) from a long summary of words (plain string or html) using c#.

My requirement is to display the short description of the long summary of content(this content may include html elements). I'm able to retrieve the plain string but when it is html, the elements are cut it between Example, I get like this

<span style="FONT-FAMILY: Trebuchet MS">Heading</span>
</H3><span style="FONT-FAMILY: Trebuchet MS">
<font style="FONT-SIZE: 15px;

But it should return the string with full html element.

I have a Yahoo UI Editor to get the content from the user, and I'm passing that text to below method to get the short summary,

public static string GetFirstFewWords(string input, int numberWords)
{
     if (input.Split(new char[] { ' ' }, 
           StringSplitOptions.RemoveEmptyEntries).Length > numberWords)
        {
            // Number of words we still want to display.
            int words = numberWords;
            // Loop through entire summary.
            for (int i = 0; i < input.Length; i++)
            {
                // Increment words on a space.
                if (input[i] == ' ')
                {
                    words--;
                }
                // If we have no more words to display, return the substring.
                if (words == 0)
                {
                    return input.Substring(0, i);
                }
            }
            return string.Empty;
        }
        else
        {
            return input;
        }
}

I'm trying this to get the article content from the user and display short summary on the list page.

+2  A: 

two options:

  1. build code to do this properly - counting words except html tags, pushing opening tags to a stack, then when you reach the threshold, you pop onclosed tags from the stack and append closing tags to the end of the string.

    pro: complete control, and ability to get exactly N visible words.
    con: somewhat tricky to implement cleanly.

  2. cut the words, then feed the broken HTML into HtmlAgilityPack (a free download that can help with fixing broken HTML) and there you go.

    pro: almost no coding, proven solution, maintainable
    con: you'd still need to figure a way to not count tags when you do the .Substring() call

Ken Egozi
A: 

You should separate out your content and markup. Can you give more info on what you're trying to do? (e.g. where this string is coming from, why you're trying to do it).

UpTheCreek
Please see my edits
Ravi
+1  A: 

Thought about have the Html Agility Pack do your bidding?

While not perfect, here's one idea that will achieve (more or less) what you're after:

// retrieve a summary of html, with no less than 'max' words
string GetSummary(string html, int max)
{
 string summaryHtml = string.Empty;

 // load our html document
 HtmlDocument htmlDoc = new HtmlDocument();
 htmlDoc.LoadHtml(html);

 int wordCount = 0;


 foreach (var element in htmlDoc.DocumentNode.ChildNodes)
 {
  // inner text will strip out all html, and give us plain text
  string elementText = element.InnerText;

  // we split by space to get all the words in this element
  string[] elementWords = elementText.Split(new char[] { ' ' });

  // and if we haven't used too many words ...
  if (wordCount <= max)
  {
   // add the *outer* HTML (which will have proper 
   // html formatting for this fragment) to the summary
   summaryHtml += element.OuterHtml;

   wordCount += elementWords.Count() + 1;
  }
  else 
  { 
   break; 
  }
 }

 return summaryHtml;
}
Bauer