ansaurus

Question

Answer 1

+11 A:

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}

tangens 2009-11-09 06:05:36

Thanks for pointing me to htmlCleaner :)

exhuma 2009-11-09 12:16:57

Answer 2

+2 A:

No. Regular expressions can not by definition parse HTML.

You could use a regex to s/<[^>]*>// or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.

As another poster said, use an actual HTML parser.

Moishe 2009-11-09 06:13:38

Answer 3

A:

If you just need to remove tags then you can use this regular expression:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

EDIT: To avoid problems with HTML comments you can do the following:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");

Superfilin 2009-11-09 07:29:27

Since you do not use any of the meat characters `.`, `^` and `$`, the `s`- and `m` flags can be omitted.

Bart Kiers 2009-11-09 09:50:33

This regex is liable to cause mangling if the HTML contains XML comments with embedded '<' or '>' characters.

Stephen C 2009-11-09 12:24:37

Answer 4

A:

Alternatively, if your intent is to display user-controlled input back to the client, then you can also just replace all < by < and all > by >. This way the HTML won't be interpreted as-is by the client's application (the webbrowser).

If you're using JSP as view technology, then you can use JSTL's c:out for this. It will escape all HTML entities by default. So for example

<c:out value="<script>alert('XSS');</script>" />

will NOT display the alert, but just show the actual string as is.

BalusC 2009-11-09 14:09:23

Answer 5

A:

you can use this simple code to remove all html tags...

htmlString.replaceAll("\\<.*?\\>", ""))

Kandhasamy 2010-09-03 10:13:08

ansaurus

tags:

views:

answers:

How to remove HTML tag in Java

related questions