ansaurus

Question

I'm looking for a regular expression to remove a given (x)HTML tag from a string

Answer 1

+15 A:

Attempting to parse HTML with regular expressions is generally an extremely bad idea. Use a parser instead, there should be one available for your chosen language.

You might be able to get away with something like this:

</?tag[^>]*?>

But it depends on exactly what you're doing. For example, that won't remove the tag's content, and it may leave your HTML in an invalid state, depending on which tag you're trying to remove. It also copes badly with invalid HTML (and there's a lot of that about).

Use a parser instead :)

Dan 2008-09-22 17:58:39

Dangit, don't run the fun for all the people crafting regexes with your obviously correct answer!

Will 2008-09-22 18:00:41

You need to make that * non-greedy (*?) or you'll lose everything from the first matched tag to the last greater-than symbol in your string.

Prestaul 2008-09-22 19:45:09

Answer 2

A:

I think it might be Raymond Chen (blogs.msdn.com/oldnewthing) that I'm paraphrasing (badly!) here... But, you want a Regular Expression? "Now you have two problems" ... :=)

If the string is well-formed (X)HTML, could you load it up into a parser (HTML/XML) and use this to remove any nodes of the offending variety? If it's not well-formed, then it becomes a bit more tricky, but, I suspect that a RegEx isn't the best way to go about this...

Rob 2008-09-22 18:00:57

Raymond Chen did use that statement, but he was quoting Jaime Zawinski.

toast 2008-09-22 18:05:38

Answer 3

A:

There are just TOO many ways a single tag can appear, not to mention encodings, variants, etc.
I strongly suggest you rethink this approach.... you really shouldnt have to be handling HTML directly, anyway.

AviD 2008-09-22 18:01:36

Answer 4

A:

Off the top of my head, I'd say this will get you started in the right direction.

s/<TAG[^>]*>([^<]*)</TAG[^>]*>/\1

Basically find the starting tag, any text in between the tags, and then the ending tag. Replace the whole thing with whatever was in between the tags.

toast 2008-09-22 18:04:26

Answer 5

+8 A:

I think there is some serious anti-regex bigotry happening here. There are lots of times when you may want to strip a particular tag out of some markup when it doesn't make sense to use a full blown parser.

Of course there are times when a parser might be the best option, but if you are looking for a regex then:

<script[^>]*?>[\s\S]*?<\/script>

That would remove script tags and their contents. Make sure that you use case-insensitive matching.

If you don't want to remove the contents of the tag then you can use:

<\/?script[^>]*?>

An example of usage in javascript would be:

function stripScripts(markup) {
  return markup.replace(/<script[^>]*?>[\s\S]*?<\/script>/gi, '');
}

var safeText = stripScripts(textarea.value);

Prestaul 2008-09-22 18:09:47

Hey nothing wrong with regular expressions, it's just that you can't write an HTML parser in one (actually, I think you can in Perl (perl has some extra regex stuff), but bagsy not maintaining it!).

Dan 2008-09-22 18:28:55

I agree with you. Sometime you want to act only on a given page, with well known structure, or HTML generated by a tool, with well defined output. When the code is predictable, using a regex might make sense.Using them to parse any HTML typed by humans is more risky! ;-)

PhiLho 2008-10-17 16:01:30

Answer 6

A:

Corrected answer:

</?TAG\b[^>]*?>

Because Dans answer would remove <br />, but you want only <b>

2008-11-04 01:53:46

Answer 7

A:

Here's a regex I wrote for this purpose, it works in a few more situations:

</?(?(?=b|img|a|script)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>

Loophole 2008-11-24 23:35:09

Answer 8

A:

If it's XHTML why not use XSLT...?

Thomas Hansen 2008-11-25 00:45:14

Answer 9

A:

2009-03-27 04:25:27

Answer 10

A:

While using regexes for parsing HTML is generally frowned upon or looked down on, you almost certainly don't want to write your own parser.

You could however use some inbuilt or library functions to achieve what you need.

JavaScript has getElementsByTagName and getElementById, not to mention jQuery.
PHP has the DOM extension.
Python has the awesome Beautiful Soup
...and many more.

garrow 2009-05-18 15:17:09

ansaurus

tags:

views:

answers:

I'm looking for a regular expression to remove a given (x)HTML tag from a string

related questions