views:

1154

answers:

8

This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item.

For example, here is the full text we are displaying, including HTML


In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us.

We want to trim the news item to 250 characters, but exclude HTML.

The method we are using for trimming currently includes the HTML, and this results in some news posts that are HTML heavy getting truncated considerably.

For instance, if the above example included tons of HTML, it could potentially look like this:

In an attempt to make a bit more space in the office, kitchen, I've pulled...

This is not what we want.

Does anyone have a way of tokenizing HTML tags in order to maintain position in the string, perform a length check and/or trim on the string, and restore the HTML inside the string at its old location?

A: 

Wouldn't the fastest way be to use jQuery's text() method?

For example:

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

Would give the value OneTwoThree in the text variable. This would allow you to get the actual length of the text without the HTML included.

Phil.Wheeler
That won't help with finding the actual cutoff position in the string though, you'd have to take position 250 in the result and somehow reverse that back to the original string.
Chad Birch
Yeah, but hang on - JavaScript has this cool method, "substring" which can return a portion of a string.So try:var text = $('#MyTextEditorContent').text().substring(1, 250);Hey presto, your 250 characters of actual content without the HTML markup. What am I missing?
Phil.Wheeler
he wants the markup... just not to count it... but you're on the right track. The above will give you the point in the string on which to cut, you then match on that in the original (HTML marked up one) and cut at that point.
Dr.Dredel
Ah! Right. Figured I had to be missing something.
Phil.Wheeler
A: 

If I understand the problem correctly, you want to keep the HTML formatting, but you want to not count it as part of the length of the string you are keeping.

You can accomplish this with code that implements a simple finite state machine.

2 states: InTag, OutOfTag
InTag:
- Goes to OutOfTag if > character is encountered
- Goes to itself any other character is encountered
OutOfTag:
- Goes to InTag if < character is encountered
- Goes to itself any other character is encountered

Your starting state will be OutOfTag.

You implement a finite state machine by procesing 1 character at a time. The processing of each character brings you to a new state.

As you run your text through the finite state machine, you want to also keep an output buffer and a length so far encountered varaible (so you know when to stop).

  1. Increment your Length variable each time you are in the state OutOfTag and you process another character. You can optionally not increment this variable if you have a whitespace character.
  2. You end the algorithm when you have no more characters or you have the desired length mentioned in #1.
  3. In your output buffer, include characters you encounter up until the length mentioned in #1.
  4. Keep a stack of unclosed tags. When you reach the length, for each element in the stack, add an end tag. As you run through your algorithm you can know when you encounter a tag by keeping a current_tag variable. This current_tag variable is started when you enter the InTag state, and it is ended when you enter the OutOfTag state (or when a whitepsace character is encountered while in the InTag state). If you have a start tag you put it in the stack. If you have an end tag, you pop it from the stack.
Brian R. Bondy
+5  A: 

Start at the first character of the post, stepping over each character. Every time you step over a character, increment a counter. When you find a '<' character, stop incrementing the counter until you hit a '>' character. Your position when the counter gets to 250 is where you actually want to cut off.

Take note that this will have another problem that you'll have to deal with when an HTML tag is opened but not closed before the cutoff.

Chad Birch
It's amazing how you can be too close to a problem to find the most simple solution. This worked like a charm.
steve_c
You're going to run into trouble the first time you run across a '<' or a '>'. Unless you can be 100% sure that your text messages will never have those characters.
Dr.Dredel
Yep, we're encoding the content before this process.
steve_c
You will need to add a stack of open tags (as you find <, and until the next space character or > whichever first), and pop-off when one is closed, after you finish, pop all the items from the stack, adding the closing tag.
Osama ALASSIRY
@Osama ALASSIRY: No, that would be foolish, since an HTML tag cannot contain another HTML tag (i.e., <b><i> is legal, but <b<i>> is not.)
titaniumdecoy
You probably want to count `<` and similar entities as one character, which means you stop counting _after_
MSalters
A: 

what language are you using? If your language supports regExp and you can be sure that the text will not contain any lessThans (<) or greaterThans (>) outside the HTML you can write a simple regExp to remove all the markup, count the chars in what's left and truncate at the point where it makes sense.

Dr.Dredel
This solution wouldn't preserve the HTML in the string. Stripping the HTML from the string via regex is a trivial task; it's the preserving of the HTML that's more of the challenge.
steve_c
no no... I get that... I was saying you strip the HTML to count the characters in a copy. then when you've found the place you wish to cut the string you cut off the ORIGINAL string.
Dr.Dredel
A: 

Here's the implementation that I came up with, in C#:

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

And a few unit tests I used via TDD:

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}
steve_c
What happens if you have a table as part of your html? Your code wouldn't trim the string in the middle of a <td> tag but it might trim the string before the <td> tag is closed.
Alison
How would it do that, since it doesn't trim inside of open tags.
steve_c
A: 

What are the namespace are requered to using this code?

Are you referring to the TrimToLength method below? If so, it should just require System, if I'm not mistaken.
steve_c
Yes, i got an error ConvertToXhtml method is missing (asp.net). My query yes i want remove all html tags and read 250 charactor (content) from an html page.
A: 

I'm aware this is quite a bit after the posted date, but i had a similiar issue and this is how i ended up solving it. My concern would be the speed of regex versus interating through an array.

Also if you have a space before an html tag, and after this doesn't fix that

private string HtmlTrimmer(string input, int len)
{
 if (string.IsNullOrEmpty(input))
  return string.Empty;
 if (input.Length <= len)
  return input;

 // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
 string inputCopy;
 string tag;

 string result = "";
 int strLen = 0;
 int strMarker = 0;
 int inputLength = input.Length;  

 Stack stack = new Stack(10);
 Regex text = new Regex("^[^<&]+");                
 Regex singleUseTag = new Regex("^<[^>]*?/>");            
 Regex specChar = new Regex("^&[^;]*?;");
 Regex htmlTag = new Regex("^<.*?>");

 while (strLen < len)
 {
  inputCopy = input.Substring(strMarker);
  //If the marker is at the end of the string OR 
  //the sum of the remaining characters and those analyzed is less then the maxlength
  if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
   break;

  //Match regular text
  result += text.Match(inputCopy,0,len-strLen);
  strLen += result.Length - strMarker;
  strMarker = result.Length;

  inputCopy = input.Substring(strMarker);
  if (singleUseTag.IsMatch(inputCopy))
   result += singleUseTag.Match(inputCopy);
  else if (specChar.IsMatch(inputCopy))
  {
   //think of &nbsp; as 1 character instead of 5
   result += specChar.Match(inputCopy);
   ++strLen;
  }
  else if (htmlTag.IsMatch(inputCopy))
  {
   tag = htmlTag.Match(inputCopy).ToString();
   //This only works if this is valid Markup...
   if(tag[1]=='/')   //Closing tag
    stack.Pop();
   else     //not a closing tag
    stack.Push(tag);
   result += tag;
  }
  else    //Bad syntax
   result += input[strMarker];

  strMarker = result.Length;
 }

 while (stack.Count > 0)
 {
  tag = stack.Pop().ToString();
  result += tag.Insert(1, "/");
 }
 if (strLen == len)
  result += "...";
 return result;
}
Highstead
A: 

You need to be careful with comments. You can get all sort of stuff within comments, not sure why someone will put comment in a rich text box (unless he/she is copy pasting from somewhere else).

pawan jain
We strip comments first before we run it through this.
steve_c