tags:

views:

1370

answers:

5

Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.

+11  A: 

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}
tangens
Thanks for pointing me to htmlCleaner :)
exhuma
+2  A: 

No. Regular expressions can not by definition parse HTML.

You could use a regex to s/<[^>]*>// or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.

As another poster said, use an actual HTML parser.

Moishe
A: 

If you just need to remove tags then you can use this regular expression:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

EDIT: To avoid problems with HTML comments you can do the following:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");
Superfilin
Since you do not use any of the meat characters `.`, `^` and `$`, the `s`- and `m` flags can be omitted.
Bart Kiers
This regex is liable to cause mangling if the HTML contains XML comments with embedded '<' or '>' characters.
Stephen C
A: 

Alternatively, if your intent is to display user-controlled input back to the client, then you can also just replace all < by &lt; and all > by &gt;. This way the HTML won't be interpreted as-is by the client's application (the webbrowser).

If you're using JSP as view technology, then you can use JSTL's c:out for this. It will escape all HTML entities by default. So for example

<c:out value="<script>alert('XSS');</script>" />

will NOT display the alert, but just show the actual string as is.

BalusC
A: 

you can use this simple code to remove all html tags...

htmlString.replaceAll("\\<.*?\\>", ""))
Kandhasamy