Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.
You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.
With htmlCleaner you can do:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}
No. Regular expressions can not by definition parse HTML.
You could use a regex to s/<[^>]*>// or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.
As another poster said, use an actual HTML parser.
If you just need to remove tags then you can use this regular expression:
content = content.replaceAll("<[^>]+>", "");
It will remove only tags, but not other HTML stuff. For more complex things you should use parser.
EDIT: To avoid problems with HTML comments you can do the following:
content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");
Alternatively, if your intent is to display user-controlled input back to the client, then you can also just replace all <
by <
and all >
by >
. This way the HTML won't be interpreted as-is by the client's application (the webbrowser).
If you're using JSP as view technology, then you can use JSTL's c:out
for this. It will escape all HTML entities by default. So for example
<c:out value="<script>alert('XSS');</script>" />
will NOT display the alert, but just show the actual string as is.
you can use this simple code to remove all html tags...
htmlString.replaceAll("\\<.*?\\>", ""))