tags:

views:

60

answers:

4

I am using this pattern to remove all HTML tags (Java code):

String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");

System.out.println(html);

Now, I want to keep tag <a ...> (with </a>) and tag <img ...>

I want the result to be:

text <a href=#>link</a> b pic<img src=#>

How to do this?


I don't need HTML parser to do this,

because I need this regex pattern to filter a lot of html fragment,

so,I want the solution with regex

A: 

Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.

Gadolin
thank you,i can run this editor,but i don't know how to create the regex pattern for my issue,my regex is suck.
Zenofo
+1  A: 

You could do this using a negative lookahead:

"<(?!(?:a|/a|img)\\b).*?>"

Rubular

However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.

For more information see this question:

Mark Byers
thanks,i try the pattern `html=html.replaceAll("<(?!(?:a|/a|img)\b).*?>", "");` but nothing to happen
Zenofo
In Java you need to escape backslashes. I've corercted my post.
Mark Byers
A: 

Hey! Here is your answer:

You can’t parse [X]HTML with regex.

krmby
Hmm. You can. I agree it's a bad idea though.
Spudley
A: 

Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.

If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.

See also this answer.

Daniel Cassidy