views:

85

answers:

5

I want to delete empty tags such as <label></label>, <font> </font> so that:

<label></label><form></form>
<p>This is <span style="color: red;">red</span> 
<i>italic</i>
</p>

will be cleaned as:

<p>This is <span style="color: red;">red</span> 
<i>italic</i>
</p>

I have this RegEx in javascript, but it deletes the the empty tags but it also delete this: "<i>italic</i></p>"

str=str.replace(/<[\S]+><\/[\S]+>/gim, "");

What I am missing?

+8  A: 

Regex is not for HTML. If you're in JavaScript anyway I'd be encouraged to use jQuery DOM processing.

Something like:

$('*:empty').remove();

Alternatively:

$("*").filter(function() 
{ 
     return $.trim($(this).html()).length > 0; 
}).remove();
Graphain
+1. Use a DOM parser.
Mark Peters
I have other regex cleaning in the same function. I prefer this way unfortunately.The content is within an IFrame where user paste from word doc.I am cleaning all MSFT junks
bobby
I will look into JQuery option.
bobby
I agree that regex *seems* easier, but if you're already in javascript jQuery is so much easier and makes it so much easier to extend capabilities as well (what if requirements ask you to start removing nested <p> tags, or tags that are nested more than 3 levels deep?)
Graphain
I am using jquery. I haven't done this type of cleaning before using jquery.Any simple example will help.. Thanks
bobby
The example provided doesn't work? Let me know what kind of example you need and I'm happy to help.
Graphain
I know the formatting won't work out in a comment, but copy and paste this snipped to the head of your document: `<script type="text/javascript"> $(document).ready(function() { $('*:empty').remove(); }); </script>` What it does it that it waits for the document to be ready and then it removes the empty tags (as per Graphain's example). Make sure to load the jQuery library first, e.g. `<script type="text/javascript" src="jquery.min.js"></script>`
Gert G
http://api.jquery.com/category/selectors/, http://api.jquery.com/category/traversing/ and http://api.jquery.com/category/manipulation/ are pretty helpful, but let me know what you need in particular.
Graphain
@Gert G - Nice one
Graphain
Thanks Graphain.The above html is the stored in a string variablein js.I am doing regex on that variable.I already ran another set of regex to do to the other cleaning.I can drop that html inside a div and process using jquery?
bobby
@Graphain Thanks. I hope it will help bobby getting started. Your code worked fine.
Gert G
@Bobby: Yep, pretty sure you can go $(data).('*:empty'), where data is your string var. Let me know if that works.
Graphain
A: 

You need /<[\S]+?><\/[\S]+?>/ -- the difference is the ?s after the +s, to match "as few as possible" (AKA "non-greedy match") nonspace characters (though 1 or more), instead of the bare +s which match"as many as possible" (AKA "greedy match").

Avoiding regular expressions altogether, as the other answer recommends, is also an excellent idea, but I wanted to point out the important greedy vs non-greedy distinction, which will serve you well in a huge variety of situations where regexes are warranted.

Alex Martelli
+1  A: 

You have "not spaces" as your character class, which means "<i>italic</i></p>" will match. The first half of your regex will match "<(i>italic</i)>" and the second half "</(p)>". (I've used brackets to show what each [\S]+ matches.)

Change this:

/<[\S]+><\/[\S]+>/

To this:

/<[^\/>][^>]*><\/[^>]+>/

Overall you should really be using a proper HTML processor, but if you're munging HTML soup this should suffice :)

Porges
This is the closest expression. Others don't work.But this one delete </i></p> as well.I am playing with the code. Thanks
bobby
Added a fix for that :)
Porges
/<[^\/>][^>]*><\/[^>]+>/This works!!!!
bobby
A: 

This is an issue of greedy regex. Try this:

str=str.replace(/<[\^>]+><\/[\S]+>/gim, "");

or

str=str.replace(/<[\S]+?><\/[\S]+>/gim, "");

In your regex, <[\S]+?> matches <i>italic</i> and the <\/[\S]+> matches the </p>

Jamie Wong
I see what is missing..Thanks
bobby
+1  A: 

I like Graphain's jQuery solution but here is another option using native JavaScript.

function CleanChildren(elem)
{
    var children = elem.childNodes;
    var len = elem.childNodes.length;

    for (var i = 0; i < len; i++)
    {
        var child = children[i];

        if(child.hasChildNodes())
            CleanChildren(child);
        else
            elem.removeChildNode(child);

    }
}
Rodrick Chapman
Pretty nice for classic js!
Graphain