views:

229

answers:

2

I've been beating my head against this reg ex for the longest time now and am hoping someone can help. Basically I have a WYSIWYG field where a user can type formatted text. But of course they will copy and paste form word/web/etc. So i have a JS fucntion catching the input on paste. I got a function that will strip ALL of the formatting on the text which is nice, but I'd like to have it leave tags like p and br so it's not just a big mess.

Any regex ninjas out there? Here is what I have so far and it works. Just need to allow tags.

o.node.innerHTML=o.node.innerHTML.replace(/(<([^>]+)>)/ig,"");
+1  A: 

First, I'm not sure if regex is the right tool for this. A user might enter invalid HTML (forget a > or put a > inside attributes), and a regex would fail then. I don't know, though, if a parser would be much better/more bulletproof.

Second, you have a few unnecessary parentheses in your regex.

Third, you could use lookahead to exclude certain tags:

o.node.innerHTML=o.node.innerHTML.replace(/<(?!\s*\/?(br|p)\b)[^>]+>/ig,"");

Explanation:

< match opening angle bracket

(?!\s*\/?(br|p)\b) assert that it's not possible to match zero or more whitespace characters, zero or one /, any one of br or p, followed directly by a word boundary. The word boundary is important, otherwise you might trigger the lookahead on tags like <pre> or <param ...>.

[^>]+ match one or more characters that are no closing angle brackets

> match the closing angle brackets.

Note that you might run into trouble if a closing angle bracket occurs somewhere inside a tag.

So this will match (and strip)

<pre> <a href="dot.com"> </a> </pre>

and leave

<p> < p > < /br > <br /> <br> etc.

alone.

Tim Pietzcker
hmm just tried it and still strips everything... is not reg ex what would you suggest? I didn't want to find and replace every single type of tag.
Code Monkey
Sorry, I had first misread your post (`b` instead of `br`). Can you try again?
Tim Pietzcker
+3  A: 

The browser already has a perfectly good parsed HTML tree in o.node. Serialising the document content to HTML (using innerHTML), trying to hack it about with regex (which cannot parse HTML reliably), then re-parsing the results back into document content by setting innerHTML... is just a bit perverse really.

Instead, inspect the element and attribute nodes you already have inside o.node, removing the ones you don't want, eg.:

filterNodes(o.node, {p: [], br: [], a: ['href']});

Defined as:

// Remove elements and attributes that do not meet a whitelist lookup of lowercase element
// name to list of lowercase attribute names.
//
function filterNodes(element, allow) {
    // Recurse into child elements
    //
    Array.fromList(element.childNodes).forEach(function(child) {
        if (child.nodeType===1) {
            filterNodes(child, allow);

            var tag= child.tagName.toLowerCase();
            if (tag in allow) {

                // Remove unwanted attributes
                //
                Array.fromList(child.attributes).forEach(function(attr) {
                    if (allow[tag].indexOf(attr.name.toLowerCase())===-1)
                       child.removeAttributeNode(attr);
                });

            } else {

                // Replace unwanted elements with their contents
                //
                while (child.firstChild)
                    element.insertBefore(child.firstChild, child);
                element.removeChild(child);
            }
        }
    });
}

// ECMAScript Fifth Edition (and JavaScript 1.6) array methods used by `filterNodes`.
// Because not all browsers have these natively yet, bodge in support if missing.
//
if (!('indexOf' in Array.prototype)) {
    Array.prototype.indexOf= function(find, ix /*opt*/) {
        for (var i= ix || 0, n= this.length; i<n; i++)
            if (i in this && this[i]===find)
                return i;
        return -1;
    };
}
if (!('forEach' in Array.prototype)) {
    Array.prototype.forEach= function(action, that /*opt*/) {
        for (var i= 0, n= this.length; i<n; i++)
            if (i in this)
                action.call(that, this[i], i, this);
    };
}

// Utility function used by filterNodes. This is really just `Array.prototype.slice()`
// except that the ECMAScript standard doesn't guarantee we're allowed to call that on
// a host object like a DOM NodeList, boo.
//
Array.fromList= function(list) {
    var array= new Array(list.length);
    for (var i= 0, n= list.length; i<n; i++)
        array[i]= list[i];
    return array;
};
bobince
Great function! and clever method. Works like a charm. The only thing that is left (sometimes) is. The <!-- garbage -->. I guess because they aren't nodes. Any way to get rid of that? If not still wonderful!
Code Monkey
They're comment nodes. You can get rid of them if you want with `... else if (child.nodeType===8) { element.removeChild(child); }` (`8` is `COMMENT_NODE` like `1` is `ELEMENT_NODE` (though IE doesn't give you the constant names so you have to use the numbers).
bobince