views:

51

answers:

3

I'm trying to figure out a way to count the number of characters in a string, truncate the string, then returns it. However, I need this function to NOT count HTML tags. The problem is that if it counts HTML tags, then if the truncate point is in the middle of a tag, then the page will appear broken.

This is what I have so far...

public string Truncate(string input, int characterLimit, string currID) {
    string output = input;

    // Check if the string is longer than the allowed amount
    // otherwise do nothing
    if (output.Length > characterLimit && characterLimit > 0) {

        // cut the string down to the maximum number of characters
        output = output.Substring(0, characterLimit);

        // Check if the character right after the truncate point was a space
        // if not, we are in the middle of a word and need to remove the rest of it
        if (input.Substring(output.Length, 1) != " ") {
            int LastSpace = output.LastIndexOf(" ");

            // if we found a space then, cut back to that space
            if (LastSpace != -1)
            {
                output = output.Substring(0, LastSpace);
            }
        }
        // end any anchors
        if (output.Contains("<a href")) {
            output += "</a>";
        }
        // Finally, add the "..." and end the paragraph
        output += "<br /><br />...<a href='Announcements.aspx?ID=" + currID + "'>see more</a></p>";
    }
    return output;
}

But I'm not happy with this. Is there a better way to do this? If you could provide a new solution to this, or perhaps suggestions on what to add to what I have so far, that would be great.

Disclaimer: I've never worked with C#, so I'm not familiar with the concepts related to the language... I'm doing this because I have to, not by choice.

Thanks, Hristo

+2  A: 

Use the right tool for the problem.

HTML is not a simple format to parse. I would advise that you use a proven, existing parser rather than rolling your own. If you know that you will only ever parse XHTML - then you could use an XML parser instead.

These are the only reliable ways to perform operations on HTML that will preserve the semantic representation.

Don't try to use regular expressions. HTML is not a regular language and you can only cause yourself grief and misery going in that direction.

LBushkin
Thanks for your advice. I looked at the parser and it doesn't look trivial to use. The only thing I have against the parser is that I don't want to parse a whole HTML document... just a snippet that will be added to the page dynamically.
Hristo
@Hristo: Take a look at the `DocumentElement.SelectNodes` method. You should be able to select all nodes of all types, and then use the `InnerText` property to count the number of non-HTML characters.
LBushkin
@LBushkin... before I can use any of the html agility pack features such as `DocumentElement.SelectNodes`, I need to get it working with Microsoft Visual Web Developer. Do you have any suggestions on how to get it "installed"?
Hristo
A: 

you can use regexp to remove html tags into another string and then count without them. Check out: http://stackoverflow.com/questions/787932/using-c-regular-expressions-to-remove-html-tags

Gmoliv