ansaurus

Question

c# Truncate HTML safely for article summary

Answer 1

+4 A:

Abel 2009-11-11 12:32:01

HTML Agility Pack and SgmlReader both handles the "HTML to XHTML" need quite nicely. I personally like SgmlReader better, but both are good.

asbjornu 2009-11-11 12:45:10

This isn't what is asked in the original question. The formatting should be preserved; yet shouldn't count for the number of chars requested.

Jan Jongboom 2009-11-11 12:45:46

@Jan: where in the question does it say so? But I'd be happy to update the same method including the formatting / counting issue

Abel 2009-11-11 12:48:53

The part "What I would want is:" in the referenced question. So 26 char summary, should be 26 chars; PLUS the HTML, etc. See stian.net's answer.

Jan Jongboom 2009-11-11 12:55:45

Thanks Jan, I didn't read so far up. Question is meanwhile edited with full description, my answer is edited as well (see bottom half)

Abel 2009-11-11 16:08:03

As soon as you add an non-breaking space to the HTML your code will break.

Dan Diplo 2009-11-11 16:11:53

Thanks Abel for this, I will hook it up and see how it fairs, much appreciated!!!!

WickedW 2009-11-11 16:18:28

@Dan: you're right, though it shouldn't, but MS misbehaves against the XML recommendation. You can resolve it in some ways, but easiest is: use Silverlight libs to solve the issue. See http://blogs.msdn.com/xmlteam/archive/2008/08/14/introducing-the-xmlpreloadedresolver.aspx

Abel 2009-11-11 17:35:48

This doesn't handle arbitrary sections of HTML, adding some dummy root tags either side of the input string, i.e. String.Format("<dummyroottag>{0}</dummyroottag>", inputString), then just before final output (above the Debug tag) add: navigator.MoveToFirstChild(); which will give you back the basic HTML block. Cracking stuff though +1

Lazarus 2010-04-15 22:40:48

@Lazarus: in review, I agree that the implementation above is a bit limited, but that's perhaps the nature of such short snippets. Glad you like it :)

Abel 2010-04-18 19:39:51

Answer 2

A:

This is complicated and, as far as I can see, none of the PHP solutions is perfect. What if the text is:

substr("Hello, my <strong>name is <em>Sam</em>. I&acute;m a 
  web developer.  And this text is very long and all the text 
  is inside the sam html tag..</strong>",0,26)."..."

You will actually have to iterate through the whole text to find the end of the starting strong-tag.

My advice to you is to strip all html in the summary. Remember to use html-sanitizing if you are showing users own html-code!

Good luck :)

stian.net 2009-11-11 12:47:26

Stripping HTML is definitely easiest. But using XML + XPath (for XHTML, or sanitized HTML) to do the job makes this rather trivial. Though the bulk of the work has been getting "removing the rest" right, complex or hard is not the word I'd choose. But, doing the same with text parsing techniques is way harder (which is what PHP uses).

Abel 2009-11-11 16:14:43

Answer 3

+1 A:

Ok. This should work (dirty code alert):

        string blah = "hoi <strong>dit <em>is test bla meer tekst</em></strong>";
        int aantalChars = 10;


        bool inTag = false;
        int cntr = 0;
        int cntrContent = 0;
        foreach (Char c in blah)
        {
            if (cntrContent == aantalChars) break;



            cntr++;
            if (c == '<')
            {
                inTag = true;
                continue;
            }
            else if (c == '>')
            {
                inTag = false;
                continue;
            }

            if (!inTag) cntrContent++;
        }

        string substr = blah.Substring(0, cntr);

        //search for nonclosed tags
        MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
        MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

        for (int i =openedTags.Count - closedTags.Count; i >= 1; i--)
        {
            string closingTag = "</" + openedTags[closedTags.Count + i - 1].Value.Substring(1);
            substr += closingTag;
        }

Jan Jongboom 2009-11-11 13:10:15

Thanks Jan, currently testing this

WickedW 2009-11-11 13:51:01

Looks a bit "Dutch": aantalChars >>> amountChars ;). Looks like an excellent start, but... how does your code operate with `How many doweneedactually?.` when cutting in the middle of `we`?

Abel 2009-11-11 16:20:14

I don't know? Try it? I think `How many dow</b?`

Jan Jongboom 2009-11-12 09:40:05

ansaurus

tags:

views:

answers:

c# Truncate HTML safely for article summary

related questions