I got some html text, which contains all kinds of html tags, such as <table>, <a>, <img>
, and so on.
Now I want to use a regular expression to remove all the html tags, except <img ...>
and </img>
(and upper case <IMG></IMG>
).
How to do this?
UPDATE:
My task is very simple, it just print the text content(including images) of a html as a summary in the front page, so I think regular expression is good and simple enough.
UPDATE AGAIN
Maybe a sample will make my question better to understand :)
There are some html text:
<html>
<head></head>
<body>
Hello, everyone. Here is my photo: <img src="xxx.jpg" />.
And, <a href="xxx">know more</a> about me!
</body>
</html>
I want to keep , and remove other tags. Following is what I want:
Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me!
Now I code like this:
html.replaceAll("<.*?>", "")
But it will remove all the content between <
and >
, but I want to keep <img xxx>
and </img>
, and remove the other content between < and >
Thank for everyone!