ansaurus

Question

How can I take the first 100 characters of html content ( without stripping the TAGS! )

Answer 1

+3 A:

What if you parse HTML into a DOM structure then begin traverse breadth-first or deep first whatever you like, collecting text of nodes until you reach 100 characters?

Developer Art 2010-03-29 20:13:53

Yeah, that's along the lines of what I'm thinking... just trying to visualize what my code will look like. I'm testing out some ideas.

Atømix 2010-03-29 20:16:33

Answer 2

+1 A:

In the past I've done this with regex. Grab the content, strip out the tags via regex, then trim it down to your desired length.

Granted, that removes all HTML, which is what I had wanted. If you're looking to keep the HTML, I'd consider not closing open tags but rather removing the open tags.

DA 2010-03-29 20:18:03

I never thought of removing the open tags. Hmmm... not sure it would work as expected, though. My first instinct (read: lazy) was just to display the text... stripping out tags with regex... but I really want a good solution to keep the content if possible.

Atømix 2010-03-29 20:31:02

"not sure it would work as expected" Well, that's the catch. No matter the solution, it's not going to meet the expectations of some folks. I would advocate the no-tags option as often a list of snippets is going to want to have it's own styling independent of the source.

DA 2010-03-30 14:21:44

Answer 3

+1 A:

My suggestion would be to find a HTML friendly traverser (one that lets you traverse HTML like XML) and then starting from the beginning tags ignore the tags themselves and only count the data in the tag. Count that towards your limit and then once reached just close out each tag (I cant think of any tags that are not just /whatever as the tag).

This should work reasonably well and be fairly close to what you are looking for.

Its totally off the top of the ol'noggin so I am assuming that there will be some tricky parts, like attribute values that display (such as link tag values).

GrayWizardx 2010-03-29 20:23:22

Agreed, and you may want to look at HTML Agility Pack for this purpose.

JacobM 2010-03-29 20:25:30

@JacobM, yeah HTMLAgility would be a good one for this.

GrayWizardx 2010-03-30 16:54:06

I like this solution the best. Although I ended going on to another project, HTMLAgilityPack is on my list, for sure. An Example of using it would be great, mind you.... ^_-

Atømix 2010-04-07 13:40:20

Answer 4

+1 A:

I decided to roll my own solution... just for the challenge of doing it.

If anyone can see any logic errors or inefficiencies let me know.

I don't know if it's the best approach... but it seems to work. There are probably cases where it doesn't work... and it likely will fail if the html isn't correct.

/// <summary>
/// Get the first n characters of some html text
/// </summary>
private string truncateTo(string s, int howMany, string ellipsis) {

    // return entire string if it's more than n characters
    if (s.Length < howMany)
        return s;

    Stack<string> elements = new Stack<string>();
    StringBuilder sb = new StringBuilder();
    int trueCount = 0;

    for (int i = 0; i < s.Length; i++) {
        if (s[i] == '<') {

            StringBuilder elem = new StringBuilder();
            bool selfclosing = false;

            if (s[i + 1] == '/') {

                elements.Pop(); // Take the previous element off the stack
                while (s[i] != '>') {
                    i++;
                }
            }
            else { // not a closing tag so get the element name

                while (i < s.Length && s[i] != '>') {

                    if ((s[i] >= 'a' && s[i] <= 'z') || (s[i] >= 'A' && s[i] <= 'Z')) {
                        elem.Append(s[i]);
                    }
                    else if (s[i] == '/' || s[i] == ' ') {

                        // self closing tag or end of tag name. Find the end of tag
                        do {
                            if (s[i] == '/' && s[i + 1] == '>') {
                                // at the end of self-closing tag. Don't store
                                selfclosing = true;
                            }

                            i++;
                        } while (i < s.Length && s[i] != '>');
                    }
                    i++;
                } // end while( != '>' )

                if (!selfclosing)
                    elements.Push(elem.ToString());
            } 
        }
        else {
            trueCount++;
            if (trueCount > howMany) {
                sb.Append(s.Substring(0, i - 1));
                sb.Append(ellipsis);
                while (elements.Count > 0) {
                    sb.AppendFormat("</{0}>", elements.Pop());
                }
            }
        }
    }

    return sb.ToString();
}

Atømix 2010-03-29 22:08:37

Noooooooo dont do it, turn back now before its too late. Get your self a HTML parser library unless there is some mandate that you can not use one.

Tj Kellie 2010-03-30 01:44:29

Sounds like you're speaking from experience... not to worry, I haven't totally committed to this solution... but it was a relatively fun method to write.

Atømix 2010-03-30 13:07:13

ansaurus

tags:

views:

answers:

How can I take the first 100 characters of html content ( without stripping the TAGS! )

EDIT:

related questions