views:

303

answers:

4

There are lots of questions on how to strip html tags, but not many on functions/methods to close them.

Here's the situation. I have a 500 character Message summary ( which includes html tags ), but I only want the first 100 characters. Problem is if I truncate the message, it could be in the middle of an html tag... which messes up stuff.

Assuming the html is something like this:

<div class="bd">"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <br/>
 <br/>Some Dates: April 30 - May 2, 2010 <br/>
 <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <em>Duis aute irure dolor in reprehenderit</em> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. <br/>
 </p>
 For more information about Lorem Ipsum doemdloe, visit: <br/>
 <a href="http://www.somesite.com" title="Some Conference">Some text link</a><br/> 
</div>

How would I take the first ~100 characters or so? ( Although, ideally that would be the first approximately 100 characters of "CONTENT" ( in between the html tags )

I'm assuming the best way to do this would be a recursive algorithm that keeps track of the html tags and appends any tags that would be truncated, but that may not be the best approach.

My first thoughts are using recursion to count nested tags, and when we reach 100 characters, look for the next "<" and then use recursion to write the closing html tags needed from there.

The reason for doing this is to make a short summary of existing articles without requiring the user to go back and provide summaries for all the articles. I want to keep the html formatting, if possible.

NOTE: Please ignore that the html isn't totally semantic. This is what I have to deal with from my WYSIWYG.

EDIT:

I added a potential solution ( that seems to work ) I figure others will run into this problem as well. I'm not sure it's the best... and it's probably not totally robust ( in fact, I know it isn't ), but I'd appreciate any feedback

+3  A: 

What if you parse HTML into a DOM structure then begin traverse breadth-first or deep first whatever you like, collecting text of nodes until you reach 100 characters?

Developer Art
Yeah, that's along the lines of what I'm thinking... just trying to visualize what my code will look like. I'm testing out some ideas.
Atømix
+1  A: 

In the past I've done this with regex. Grab the content, strip out the tags via regex, then trim it down to your desired length.

Granted, that removes all HTML, which is what I had wanted. If you're looking to keep the HTML, I'd consider not closing open tags but rather removing the open tags.

DA
I never thought of removing the open tags. Hmmm... not sure it would work as expected, though. My first instinct (read: lazy) was just to display the text... stripping out tags with regex... but I really want a good solution to keep the content if possible.
Atømix
"not sure it would work as expected" Well, that's the catch. No matter the solution, it's not going to meet the expectations of some folks. I would advocate the no-tags option as often a list of snippets is going to want to have it's own styling independent of the source.
DA
+1  A: 

My suggestion would be to find a HTML friendly traverser (one that lets you traverse HTML like XML) and then starting from the beginning tags ignore the tags themselves and only count the data in the tag. Count that towards your limit and then once reached just close out each tag (I cant think of any tags that are not just /whatever as the tag).

This should work reasonably well and be fairly close to what you are looking for.

Its totally off the top of the ol'noggin so I am assuming that there will be some tricky parts, like attribute values that display (such as link tag values).

GrayWizardx
Agreed, and you may want to look at HTML Agility Pack for this purpose.
JacobM
@JacobM, yeah HTMLAgility would be a good one for this.
GrayWizardx
I like this solution the best. Although I ended going on to another project, HTMLAgilityPack is on my list, for sure. An Example of using it would be great, mind you.... ^_-
Atømix
+1  A: 

I decided to roll my own solution... just for the challenge of doing it.

If anyone can see any logic errors or inefficiencies let me know.

I don't know if it's the best approach... but it seems to work. There are probably cases where it doesn't work... and it likely will fail if the html isn't correct.

/// <summary>
/// Get the first n characters of some html text
/// </summary>
private string truncateTo(string s, int howMany, string ellipsis) {

    // return entire string if it's more than n characters
    if (s.Length < howMany)
        return s;

    Stack<string> elements = new Stack<string>();
    StringBuilder sb = new StringBuilder();
    int trueCount = 0;

    for (int i = 0; i < s.Length; i++) {
        if (s[i] == '<') {

            StringBuilder elem = new StringBuilder();
            bool selfclosing = false;

            if (s[i + 1] == '/') {

                elements.Pop(); // Take the previous element off the stack
                while (s[i] != '>') {
                    i++;
                }
            }
            else { // not a closing tag so get the element name

                while (i < s.Length && s[i] != '>') {

                    if ((s[i] >= 'a' && s[i] <= 'z') || (s[i] >= 'A' && s[i] <= 'Z')) {
                        elem.Append(s[i]);
                    }
                    else if (s[i] == '/' || s[i] == ' ') {

                        // self closing tag or end of tag name. Find the end of tag
                        do {
                            if (s[i] == '/' && s[i + 1] == '>') {
                                // at the end of self-closing tag. Don't store
                                selfclosing = true;
                            }

                            i++;
                        } while (i < s.Length && s[i] != '>');
                    }
                    i++;
                } // end while( != '>' )

                if (!selfclosing)
                    elements.Push(elem.ToString());
            } 
        }
        else {
            trueCount++;
            if (trueCount > howMany) {
                sb.Append(s.Substring(0, i - 1));
                sb.Append(ellipsis);
                while (elements.Count > 0) {
                    sb.AppendFormat("</{0}>", elements.Pop());
                }
            }
        }
    }

    return sb.ToString();
}
Atømix
Noooooooo dont do it, turn back now before its too late. Get your self a HTML parser library unless there is some mandate that you can not use one.
Tj Kellie
Sounds like you're speaking from experience... not to worry, I haven't totally committed to this solution... but it was a relatively fun method to write.
Atømix