ansaurus

Question

Answer 1

+5 A:

There may be some, but the most robust thing is to use an actual HTML parser. There's one here, and if it's reasonably well formed, you can also use SAX or another XML parser.

Charlie Martin 2009-05-07 02:29:39

Answer 2

+3 A:

I've used nekoHtml to do that. It can strip all tags but it can just as easily keep or strip a subset of tags.

Solomon Duskis 2009-05-07 03:03:19

Answer 3

+4 A:

What ever you do, make sure you normalize the data before you start trying to strip tags. I recently attended a web app security workshop that covered XSS filter evasion. One would normally think that searching for < or < or its hex equivalent would be sufficient. I was blown away after seeing a slide with 70+ ways that < can be encoded to beat filters.

jkf 2009-05-07 03:29:48

Answer 4

+2 A:

After having this question open for almost a week, I can say with some certainty that there is no method available in the Java API or Apache libaries which strips HTML tags from a String. You would either have to use an HTML parser as described in the previous answers, or write a simple regular expression to strip out the tags.

Todd 2009-05-13 17:53:59

Answer 5

+1 A:

Wicket uses the following method to escape html, located in: org.apache.wicket.util.string.Strings

public static CharSequence escapeMarkup(final String s, final boolean escapeSpaces,
 final boolean convertToHtmlUnicodeEscapes)
{
 if (s == null)
 {
  return null;
 }
 else
 {
  int len = s.length();
  final AppendingStringBuffer buffer = new AppendingStringBuffer((int)(len * 1.1));

  for (int i = 0; i < len; i++)
  {
   final char c = s.charAt(i);

   switch (c)
   {
    case '\t' :
     if (escapeSpaces)
     {
      // Assumption is four space tabs (sorry, but that's
      // just how it is!)
      buffer.append("&nbsp;&nbsp;&nbsp;&nbsp;");
     }
     else
     {
      buffer.append(c);
     }
     break;

    case ' ' :
     if (escapeSpaces)
     {
      buffer.append("&nbsp;");
     }
     else
     {
      buffer.append(c);
     }
     break;

    case '<' :
     buffer.append("&lt;");
     break;

    case '>' :
     buffer.append("&gt;");
     break;

    case '&' :

     buffer.append("&amp;");
     break;

    case '"' :
     buffer.append("&quot;");
     break;

    case '\'' :
     buffer.append("&#039;");
     break;

    default :

     if (convertToHtmlUnicodeEscapes)
     {
      int ci = 0xffff & c;
      if (ci < 160)
      {
       // nothing special only 7 Bit
       buffer.append(c);
      }
      else
      {
       // Not 7 Bit use the unicode system
       buffer.append("&#");
       buffer.append(new Integer(ci).toString());
       buffer.append(';');
      }
     }
     else
     {
      buffer.append(c);
     }

     break;
   }
  }

  return buffer;
 }
}

Arthur 2009-09-17 01:02:38

ansaurus

tags:

views:

answers:

Stripping HTML tags in Java

related questions