tags:

views:

1334

answers:

5

Is there an existing Java library which provides a method to strip all HTML tags from a String? I'm looking for something equivalent to the 'strip_tags' function in PHP.

I know that I can use a regex as described in this Stackoverflow question, however I was curious if there may already be a 'stripTags()' method floating around somewhere in the Apache Commons library that can be used.

+5  A: 

There may be some, but the most robust thing is to use an actual HTML parser. There's one here, and if it's reasonably well formed, you can also use SAX or another XML parser.

Charlie Martin
+3  A: 

I've used nekoHtml to do that. It can strip all tags but it can just as easily keep or strip a subset of tags.

Solomon Duskis
+4  A: 

What ever you do, make sure you normalize the data before you start trying to strip tags. I recently attended a web app security workshop that covered XSS filter evasion. One would normally think that searching for < or &lt; or its hex equivalent would be sufficient. I was blown away after seeing a slide with 70+ ways that < can be encoded to beat filters.

jkf
+2  A: 

After having this question open for almost a week, I can say with some certainty that there is no method available in the Java API or Apache libaries which strips HTML tags from a String. You would either have to use an HTML parser as described in the previous answers, or write a simple regular expression to strip out the tags.

Todd
+1  A: 

Wicket uses the following method to escape html, located in: org.apache.wicket.util.string.Strings

public static CharSequence escapeMarkup(final String s, final boolean escapeSpaces,
 final boolean convertToHtmlUnicodeEscapes)
{
 if (s == null)
 {
  return null;
 }
 else
 {
  int len = s.length();
  final AppendingStringBuffer buffer = new AppendingStringBuffer((int)(len * 1.1));

  for (int i = 0; i < len; i++)
  {
   final char c = s.charAt(i);

   switch (c)
   {
    case '\t' :
     if (escapeSpaces)
     {
      // Assumption is four space tabs (sorry, but that's
      // just how it is!)
      buffer.append("&nbsp;&nbsp;&nbsp;&nbsp;");
     }
     else
     {
      buffer.append(c);
     }
     break;

    case ' ' :
     if (escapeSpaces)
     {
      buffer.append("&nbsp;");
     }
     else
     {
      buffer.append(c);
     }
     break;

    case '<' :
     buffer.append("&lt;");
     break;

    case '>' :
     buffer.append("&gt;");
     break;

    case '&' :

     buffer.append("&amp;");
     break;

    case '"' :
     buffer.append("&quot;");
     break;

    case '\'' :
     buffer.append("&#039;");
     break;

    default :

     if (convertToHtmlUnicodeEscapes)
     {
      int ci = 0xffff & c;
      if (ci < 160)
      {
       // nothing special only 7 Bit
       buffer.append(c);
      }
      else
      {
       // Not 7 Bit use the unicode system
       buffer.append("&#");
       buffer.append(new Integer(ci).toString());
       buffer.append(';');
      }
     }
     else
     {
      buffer.append(c);
     }

     break;
   }
  }

  return buffer;
 }
}
Arthur