ansaurus

Question

Answer 1

A:

<(img|IMG)*>*</(img|IMG)>

mathk 2010-07-21 08:07:11

@mathk, thank you. You code is to match img tags(Am I right?), but what I want to do is match non-img tags.

Freewind 2010-07-21 08:17:55

You can just ignore case

abatishchev 2010-07-21 08:18:37

@Freewind Then replace it by an empty string

mathk 2010-07-21 08:22:31

@mathk, I want to keep img tags, and remove the other tags. (Maybe my poor English misguided you)

Freewind 2010-07-21 08:30:22

@Freewind Why would you want to match non-img. Simply match img and keep them. Either you tell what you want, either you tell what you don't want. In your case the thing you want is smaller to express then the thing you don't want. (Unless the set your a talking is infinite but that is an other question you have to ask to Godël ;) )

mathk 2010-07-21 08:45:00

@mathk, thank you! See my updated question

Freewind 2010-07-21 09:01:44

Answer 2

+4 A:

Do not use a RegEx to parse HTML. See here for a compelling demonstration of why.

Use an HTML parser for your language/platform.

Here is a java one (HTML parser)
For .NET, the HTML Agility Pack is recommended
For ruby, there is nokogiry, though I am not a ruby dev, so don't know how good it is

Oded 2010-07-21 08:07:54

@Oded, thank you. I don't parse the html, it is too heavy for my simple task. I think regex is the best tool for this, although I don't know how to write it :)

Freewind 2010-07-21 08:16:34

@Freewind - HTML is not a regular language, and _cannot_ be reliably parsed by RegEx, as the first link I posted demonstrates. You should use the right tool for the job. If you know _exactly_ what format your HTML will be coming in, string replace may even be enough...

Oded 2010-07-21 08:18:46

I still want to use regex. I don't need exactly corret handling, it works most of time is OK

Freewind 2010-07-21 08:31:54

Why the downvote?

Oded 2010-07-21 10:22:07

-1 for using an uneducated and inexperienced mantra. If you can't answer the question, don't answer it. YOU ARE COMPLETELY, 100%, WRONG. I would tell you that, yes, in general, there may be better ways of dealing with HTML, however your blind and unthinking reflex is clearly preventing you from contributing positively.

PP 2010-07-21 10:23:13

@PP - what _are_ you on? Bad day at the office? Looking at the question, my answer stands. For the kind of cleanup of arbitrary HTML that he wants, RegEx is not a good solution.

Oded 2010-07-21 10:26:32

PP : Some specifics on why you feel the answer is inappropriate would be welcome. As it stands, your characterisation of the answer as "blind and unthinking" doesn't add anything to the discussion.

Noufal Ibrahim 2010-07-21 10:59:29

@Oded, I think your answer is helpful, and thank you for your work. Don't be sad

Freewind 2010-07-22 02:36:15

@Freewind - that's kind of you. I am not sad, I believe anyone is entitled to their own opinion, but would like to see something more than "you are wrong", a reasoned why would be good :)

Oded 2010-07-22 07:24:25

Answer 3

A:

A simple answer to why Do not use a RegEx is:

Regexp can't parse recursive grammar such as:

S -> (S)
S -> Empty

Because this kind of grammar has infinite state.

Since HTML has a recursive grammar you can simply use regexp.

SPAN -> <span>SPAN</span>
SPAN -> text

But in your case you can express a regular expression that is not recursive.

mathk 2010-07-21 08:39:26

Answer 4

+1 A:

I tried a lot, this regular expression seems work for me:

(?i)<(?!img|/img).*?>

My code is:

html.replaceAll('(?i)<(?!img|/img).*?>', '');

Freewind 2010-07-21 09:13:19

ansaurus

tags:

views:

answers:

How to remove all html tags except img?

related questions