tags:

views:

2553

answers:

10

Let's say I have a string holding a mess of text and (x)HTML tags. I want to remove all instances of a given tag (and any attributes of that tag), leaving all other tags and text along. What's the best Regex to get this done?

Edited to add: Oh, I appreciate that using a Regex for this particular issue is not the best solution. However, for the sake of discussion can we assume that that particular technical decision was made a few levels over my pay grade? ;)

+15  A: 

Attempting to parse HTML with regular expressions is generally an extremely bad idea. Use a parser instead, there should be one available for your chosen language.

You might be able to get away with something like this:

</?tag[^>]*?>

But it depends on exactly what you're doing. For example, that won't remove the tag's content, and it may leave your HTML in an invalid state, depending on which tag you're trying to remove. It also copes badly with invalid HTML (and there's a lot of that about).

Use a parser instead :)

Dan
Dangit, don't run the fun for all the people crafting regexes with your obviously correct answer!
Will
You need to make that * non-greedy (*?) or you'll lose everything from the first matched tag to the last greater-than symbol in your string.
Prestaul
A: 

I think it might be Raymond Chen (blogs.msdn.com/oldnewthing) that I'm paraphrasing (badly!) here... But, you want a Regular Expression? "Now you have two problems" ... :=)

If the string is well-formed (X)HTML, could you load it up into a parser (HTML/XML) and use this to remove any nodes of the offending variety? If it's not well-formed, then it becomes a bit more tricky, but, I suspect that a RegEx isn't the best way to go about this...

Rob
Raymond Chen did use that statement, but he was quoting Jaime Zawinski.
toast
A: 

There are just TOO many ways a single tag can appear, not to mention encodings, variants, etc.
I strongly suggest you rethink this approach.... you really shouldnt have to be handling HTML directly, anyway.

AviD
A: 

Off the top of my head, I'd say this will get you started in the right direction.

s/<TAG[^>]*>([^<]*)</TAG[^>]*>/\1

Basically find the starting tag, any text in between the tags, and then the ending tag. Replace the whole thing with whatever was in between the tags.

toast
+8  A: 

I think there is some serious anti-regex bigotry happening here. There are lots of times when you may want to strip a particular tag out of some markup when it doesn't make sense to use a full blown parser.

Of course there are times when a parser might be the best option, but if you are looking for a regex then:

<script[^>]*?>[\s\S]*?<\/script>

That would remove script tags and their contents. Make sure that you use case-insensitive matching.

If you don't want to remove the contents of the tag then you can use:

<\/?script[^>]*?>

An example of usage in javascript would be:

function stripScripts(markup) {
  return markup.replace(/<script[^>]*?>[\s\S]*?<\/script>/gi, '');
}

var safeText = stripScripts(textarea.value);
Prestaul
Hey nothing wrong with regular expressions, it's just that you can't write an HTML parser in one (actually, I think you can in Perl (perl has some extra regex stuff), but bagsy not maintaining it!).
Dan
I agree with you. Sometime you want to act only on a given page, with well known structure, or HTML generated by a tool, with well defined output. When the code is predictable, using a regex might make sense.Using them to parse any HTML typed by humans is more risky! ;-)
PhiLho
A: 

Corrected answer:

</?TAG\b[^>]*?>

Because Dans answer would remove <br />, but you want only <b>

A: 

Here's a regex I wrote for this purpose, it works in a few more situations:

</?(?(?=b|img|a|script)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>
Loophole
A: 

If it's XHTML why not use XSLT...?

Thomas Hansen
A: 
A: 

While using regexes for parsing HTML is generally frowned upon or looked down on, you almost certainly don't want to write your own parser.

You could however use some inbuilt or library functions to achieve what you need.

  • JavaScript has getElementsByTagName and getElementById, not to mention jQuery.
  • PHP has the DOM extension.
  • Python has the awesome Beautiful Soup
  • ...and many more.
garrow