ansaurus

Question

How to remove all empty tags in X/HTML code in once?

Answer 1

+1 A:

If this is only about quickly editing a file, and your editor supports regular expression replacement, you can use a regex like this:

<[^>]+></[^>]+>

Search for this regex, and replace with an empty string.

Note: This isn't safe in any way - don't rely on it, as it can find more things than just valid, empty tags. (It would also find <a></b> for example.) There is no safe way to do this with regexes - but if you check each replacement manually, you should be fine. If you need real safe replacement, then either you'll have to find an editor that supports this (JEdit may be a good bet, but I haven't checked), or you'll have to parse the file yourself - e.g. using XSLT.

Chris Lercher 2010-03-20 14:06:31

+1; editors are one of the places where a regex is a perfectly acceptable tool for manipulating HTML

Rob 2010-03-20 14:07:59

I'd put a \s* between the tags and you should escape the /. So ... <[^>]+>\s*<\/[^>]+>

Robusto 2010-03-20 14:13:27

... if you want to allow whitespace between the tags, you'd probably also want to enable multiline matching: `(?m)<[^>]+>\s*<\/[^>]+>` (I'm not sure, if the forward slash really has to be escaped - at least it works for me in Eclipse without escaping it)

Chris Lercher 2010-03-20 14:28:56

HTML attribute values can contain literal `>` characters.

Gumbo 2010-03-20 17:30:28

100% agreed (unless we're talking about valid X(HT)ML) - that's just one example why I highlighted, that every replacement must be checked manually, if you want to use a regex approach for quick editing! But it can't be said often enough: HTML is not a regular language.

Chris Lercher 2010-03-20 17:49:24

Answer 2

A:

You could use a regular expression in any editor that supports them. For instance, I tested this one in Dreamweaver:

<(?!\!|input|br|img|meta|hr)[^/>]*?>[\s]*?</[^>]*?>

Just make a search and replace all (with the regex as search string and nothing as replacement). Note however that this may remove necessary whitespace. If you just want to remove empty tags without anything in between,

<(?!\!|input|br|img|meta|hr)[^/>]*?></[^>]*?>

would be the way to go.

Update: You want to remove &nbsps as well:

<(?!\!|input|br|img|meta|hr)[^/>]*?>(?:[\s]|&nbsp;)*?</[^>]*?>

I did not verify this one - it should be OK though, try it out :-)

Mef 2010-03-20 14:07:11

what is the means of this part `^input|^br|^img|^meta|^hr]`

metal-gear-solid 2010-03-20 14:41:39

He's checking for tags that don't have an ending tag.

Rich Bradshaw 2010-03-20 14:47:04

ok thanks for this , please add more thing in this . I also want to remove one more thing `<p> </p>` from code.

metal-gear-solid 2010-03-20 15:18:42

Gumbo 2010-03-20 17:28:53

@Gumbo: corrected.

Mef 2010-03-20 20:25:43

working but one problem. it's does no remove this `<p><label></label></p>` it removed inside `label` but not outside`<p>`

metal-gear-solid 2010-03-21 02:56:40

Yeah... running through the document twice is the best you can do. There won't be a regex that matches every special case... HTML is not a regular language, there will be no perfect solution

Mef 2010-03-21 18:55:44

Answer 3

A:

What you're asking for sounds like a job for regular expressions. Many editors support regular expression find/replace. Personally, I'd probably do this from the command-line with Perl (sed would also work), but that's just me.

perl -pe 's|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html

or if you're brave, edit the file in place:

perl -pe 's|<([^\s>]+)[^>]*></\1>||g' -i file.html

This will remove:

<p></p>
<p id="foo"></p>

but not:

<p>hello world</p>
<p></a>

Warning: things like <img src="pic.png"></img> and <br></br> will also be removed. It's not obvious from your question, but I'll assume this is undesirable. Maybe you're not worried because you know all your images are declared like this <img src="pic.png"/>. Otherwise the regular expression will need to be modified to account for this, but I decided to start simple for an easier explanation...

It works by matching the opening tag: a literal < followed by the tag name (one or more characters which are not whitespace or > = [^\s>]+), any attributes (zero or more characters which aren't > = [^>]*), and then a literal >; and a closing tag with the same name: this takes advantage of the fact that we captured the tag name, so we can use a backreference = </\1>. The matches are then replaced with the empty string.

If the syntax/terminology used here is unfamiliar to you, I'm a fan of the perlre documentation page. Regular expression syntax in other languages should be very similar if not identical to this, so hopefully this will be useful even if you don't Perl :)

Oh, one more thing. If you have things like <div><p></p></div>, these will not be picked up all at once. You'll have to do multiple passes: the first will remove the <p></p> leaving a <div></div>to be removed by the second. In Perl, the substitution operator returns the number of replacements made, so you can:

perl -pe '1 while s|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html

miorel 2010-03-20 14:43:20

Quite advanced regex, but still be careful: `<a b="></a>"></a>` (this isn't well formed X(HT)ML in the first place, but it would be replaced to: `"></a>`) If it should be guaranteed to work, I'd always prefer a real parser.

Chris Lercher 2010-03-20 15:03:32

Of course :)Manipulating XML with regular expressions is usually a hack at best, but you can get away with it if there are no pathological cases!

miorel 2010-03-20 15:33:44

ansaurus

tags:

views:

answers:

How to remove all empty tags in X/HTML code in once?

related questions