ansaurus

Question

How do I strip all html tags in javascript with exceptions?

Answer 1

+1 A:

First, I'm not sure if regex is the right tool for this. A user might enter invalid HTML (forget a > or put a > inside attributes), and a regex would fail then. I don't know, though, if a parser would be much better/more bulletproof.

Second, you have a few unnecessary parentheses in your regex.

Third, you could use lookahead to exclude certain tags:

o.node.innerHTML=o.node.innerHTML.replace(/<(?!\s*\/?(br|p)\b)[^>]+>/ig,"");

Explanation:

< match opening angle bracket

(?!\s*\/?(br|p)\b) assert that it's not possible to match zero or more whitespace characters, zero or one /, any one of br or p, followed directly by a word boundary. The word boundary is important, otherwise you might trigger the lookahead on tags like <pre> or <param ...>.

[^>]+ match one or more characters that are no closing angle brackets

> match the closing angle brackets.

Note that you might run into trouble if a closing angle bracket occurs somewhere inside a tag.

So this will match (and strip)

<pre> <a href="dot.com"> </a> </pre>

and leave

  etc.

alone.

Tim Pietzcker 2010-03-06 15:23:13

hmm just tried it and still strips everything... is not reg ex what would you suggest? I didn't want to find and replace every single type of tag.

Code Monkey 2010-03-06 15:27:05

Sorry, I had first misread your post (`b` instead of `br`). Can you try again?

Tim Pietzcker 2010-03-06 15:32:46

Answer 2

+3 A:

The browser already has a perfectly good parsed HTML tree in o.node. Serialising the document content to HTML (using innerHTML), trying to hack it about with regex (which cannot parse HTML reliably), then re-parsing the results back into document content by setting innerHTML... is just a bit perverse really.

Instead, inspect the element and attribute nodes you already have inside o.node, removing the ones you don't want, eg.:

filterNodes(o.node, {p: [], br: [], a: ['href']});

Defined as:

// Remove elements and attributes that do not meet a whitelist lookup of lowercase element
// name to list of lowercase attribute names.
//
function filterNodes(element, allow) {
    // Recurse into child elements
    //
    Array.fromList(element.childNodes).forEach(function(child) {
        if (child.nodeType===1) {
            filterNodes(child, allow);

            var tag= child.tagName.toLowerCase();
            if (tag in allow) {

                // Remove unwanted attributes
                //
                Array.fromList(child.attributes).forEach(function(attr) {
                    if (allow[tag].indexOf(attr.name.toLowerCase())===-1)
                       child.removeAttributeNode(attr);
                });

            } else {

                // Replace unwanted elements with their contents
                //
                while (child.firstChild)
                    element.insertBefore(child.firstChild, child);
                element.removeChild(child);
            }
        }
    });
}

// ECMAScript Fifth Edition (and JavaScript 1.6) array methods used by `filterNodes`.
// Because not all browsers have these natively yet, bodge in support if missing.
//
if (!('indexOf' in Array.prototype)) {
    Array.prototype.indexOf= function(find, ix /*opt*/) {
        for (var i= ix || 0, n= this.length; i<n; i++)
            if (i in this && this[i]===find)
                return i;
        return -1;
    };
}
if (!('forEach' in Array.prototype)) {
    Array.prototype.forEach= function(action, that /*opt*/) {
        for (var i= 0, n= this.length; i<n; i++)
            if (i in this)
                action.call(that, this[i], i, this);
    };
}

// Utility function used by filterNodes. This is really just `Array.prototype.slice()`
// except that the ECMAScript standard doesn't guarantee we're allowed to call that on
// a host object like a DOM NodeList, boo.
//
Array.fromList= function(list) {
    var array= new Array(list.length);
    for (var i= 0, n= list.length; i<n; i++)
        array[i]= list[i];
    return array;
};

bobince 2010-03-06 16:29:46

Great function! and clever method. Works like a charm. The only thing that is left (sometimes) is. The . I guess because they aren't nodes. Any way to get rid of that? If not still wonderful!

Code Monkey 2010-03-06 17:25:40

They're comment nodes. You can get rid of them if you want with `... else if (child.nodeType===8) { element.removeChild(child); }` (`8` is `COMMENT_NODE` like `1` is `ELEMENT_NODE` (though IE doesn't give you the constant names so you have to use the numbers).

bobince 2010-03-06 20:14:49

ansaurus

tags:

views:

answers:

How do I strip all html tags in javascript with exceptions?

related questions