tags:

views:

78

answers:

4

I got some html text, which contains all kinds of html tags, such as <table>, <a>, <img>, and so on.

Now I want to use a regular expression to remove all the html tags, except <img ...> and </img>(and upper case <IMG></IMG>).

How to do this?


UPDATE:

My task is very simple, it just print the text content(including images) of a html as a summary in the front page, so I think regular expression is good and simple enough.


UPDATE AGAIN

Maybe a sample will make my question better to understand :)

There are some html text:

<html>
  <head></head>
  <body>
     Hello, everyone. Here is my photo: <img src="xxx.jpg" />. 
     And, <a href="xxx">know more</a> about me!
  </body>
</html>

I want to keep , and remove other tags. Following is what I want:

Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me!

Now I code like this:

html.replaceAll("<.*?>", "")

But it will remove all the content between < and >, but I want to keep <img xxx> and </img>, and remove the other content between < and >

Thank for everyone!

A: 
<(img|IMG)*>*</(img|IMG)>
mathk
@mathk, thank you. You code is to match img tags(Am I right?), but what I want to do is match non-img tags.
Freewind
You can just ignore case
abatishchev
@Freewind Then replace it by an empty string
mathk
@mathk, I want to keep img tags, and remove the other tags. (Maybe my poor English misguided you)
Freewind
@Freewind Why would you want to match non-img. Simply match img and keep them. Either you tell what you want, either you tell what you don't want. In your case the thing you want is smaller to express then the thing you don't want. (Unless the set your a talking is infinite but that is an other question you have to ask to Godël ;) )
mathk
@mathk, thank you! See my updated question
Freewind
+4  A: 

Do not use a RegEx to parse HTML. See here for a compelling demonstration of why.

Use an HTML parser for your language/platform.

  • Here is a java one (HTML parser)
  • For .NET, the HTML Agility Pack is recommended
  • For ruby, there is nokogiry, though I am not a ruby dev, so don't know how good it is
Oded
@Oded, thank you. I don't parse the html, it is too heavy for my simple task. I think regex is the best tool for this, although I don't know how to write it :)
Freewind
@Freewind - HTML is not a regular language, and _cannot_ be reliably parsed by RegEx, as the first link I posted demonstrates. You should use the right tool for the job. If you know _exactly_ what format your HTML will be coming in, string replace may even be enough...
Oded
I still want to use regex. I don't need exactly corret handling, it works most of time is OK
Freewind
Why the downvote?
Oded
-1 for using an uneducated and inexperienced mantra. If you can't answer the question, don't answer it. YOU ARE COMPLETELY, 100%, WRONG. I would tell you that, yes, in general, there may be better ways of dealing with HTML, however your blind and unthinking reflex is clearly preventing you from contributing positively.
PP
@PP - what _are_ you on? Bad day at the office? Looking at the question, my answer stands. For the kind of cleanup of arbitrary HTML that he wants, RegEx is not a good solution.
Oded
PP : Some specifics on why you feel the answer is inappropriate would be welcome. As it stands, your characterisation of the answer as "blind and unthinking" doesn't add anything to the discussion.
Noufal Ibrahim
@Oded, I think your answer is helpful, and thank you for your work. Don't be sad
Freewind
@Freewind - that's kind of you. I am not sad, I believe anyone is entitled to their own opinion, but would like to see something more than "you are wrong", a reasoned why would be good :)
Oded
A: 

A simple answer to why Do not use a RegEx is:

Regexp can't parse recursive grammar such as:

S -> (S)
S -> Empty

Because this kind of grammar has infinite state.

Since HTML has a recursive grammar you can simply use regexp.

SPAN -> <span>SPAN</span>
SPAN -> text

But in your case you can express a regular expression that is not recursive.

mathk
+1  A: 

I tried a lot, this regular expression seems work for me:

(?i)<(?!img|/img).*?>

My code is:

html.replaceAll('(?i)<(?!img|/img).*?>', '');
Freewind