views:

675

answers:

5

How to write a function that can cut a string with HTML tags to an N-length string without breaking HTML tags while doing it.

The returned string doesn't need to be exactly N characters long. It can cut it before or after tag that is on the edge of N-long string.

Visit <a href="www.htz.hr">Croatia</a> this summer.

CutIt(9) should return

Visit

or

Visit <a href="www.htz.hr">Croatia</a>
A: 

When I encountered such problem (for RSS feed) I just called strip_tags before cutting my string.

Thinker
+1  A: 

This might be overkill, but try looking up AWK, it can do this kind of things pretty easily since it's centered around processing text.

You can also write a custom parsing script like

string s = "Visit <a href="www.htz.hr">Croatia</a> this summer."

result = ""

slice_limit = 9

i= 0

j = 0

in_tag = false

while i < slice_limit and j < s.size do

  if s[j] == "<" then in_tag = true

  if in_tag and s[i]==">" then in_tag = false

  if !in_tag then i++

  result += s[j]

end

... or something like that (haven't tested, but it gives you the idea).

EDIT: You will also have to add something to detect if the tag is closed or not (just add a flag like in_tag and mix it with some regular expression and it should work) Hope that helps

EDIT2: if you gave the language you want to use, that could be helpful. javascript?

marcgg
..i prefer c#, but pseudo code is ok too
Ante B.
cool, because I don't really know C# ^^
marcgg
A: 

In javascript, you can use the textContent property of DOM elements to obtain this.

HTML

<p id='mytext'>Hey <a href="#">Visit Croatia</a> today</p>

Javascript

var el = document.getElementById("mytext");
console.log( el.textContent );
//alert( el.textContent ); // if you don't have firebug.
garrow
+1  A: 
static string CutIt(string s, int limit)
{
  s = s.Substring(0, limit);
  int openMark = s.LastIndexOf('<');
  if (openMark != -1)
  {
    int closeMark = s.LastIndexOf('>');
    if (openMark > closeMark)
    {
      s = s.Substring(0, openMark);
    }
  }
  return s.Trim();
}

public static void Main()
{
  Console.WriteLine(
    CutIt("Visit <a href=\"www.htz.hr\">Croatia</a> this summer.", 9)
  ); // prints "Visit"
}
waqas
U have to notice that if we want to cut string on the </a> tag it will cut it before "</" and the <a> tag before it stays open. That's the case we want to prevent!
Ante B.
A: 

I solved the problem so here is the code in c#;

static string CutIt(string s, int limit)
{
    if (s.Length < limit) return s;

    int okIndex = 0;
    bool inClosingTag = false;
    int numOpenTags = 0;

    for (int i = 0; i < limit; i++)
    {
        if (s[i]=='<')
        {
            if (s[i+1]=='/')
            {
                inClosingTag = true;    
            }
            else
            {
                numOpenTags++;   
            }
        }
        if (s[i]=='>')
        {
            if (s[i-1]=='/')
            {
                numOpenTags--;
            }
            if (inClosingTag)
            {
                numOpenTags--;
            }
        }

        if (numOpenTags == 0) okIndex = i;

    }
    return s.Substring(0, okIndex + 1);
}
Ante B.