views:

321

answers:

5

I am using a 'contenteditable' <div/> and enabling PASTE.

It is amazing the amount of markup code that gets pasted in from a clipboard copy from Microsoft Word. I am battling this, and have gotten about 1/2 way there using Prototypes' stripTags() function (which unfortunately does not seem to enable me to keep some tags).

However, even after that, I wind up with a mind-blowing amount of unneeded markup code.

So my question is, is there some function (using JavaScript), or approach I can use that will clean up the majority of this unneeded markup?

+2  A: 

How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.

Josh
This would be my worst-case scenario I suppose (and the way its looking, may be the only scenario - very depressing).
OneNerd
@OneNerd: I marked your question as a favorite because if anyone else has a better solution I think I'll use it too!
Josh
i came up with something I *think* may be usable -- see my answer (and improve it too plz) if you would like. Thanks -
OneNerd
A: 

Could you paste to a hidden textarea, copy from same textarea, and paste to your target?

souLTower
hmm - well, do you know a way to send the pasted content to a textarea so it is indeed plain text instead of the markup code -- since the keypress is on the DIV, I can read the contents and pass it to the textarea, but it wouldn't be plaintext.
OneNerd
A: 

Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US"> for example) and I correct it manually.

Coronatus
Yikes - not really an option for me.
OneNerd
A: 

Here is the function I would up writing that does the job fairly well (as far as I can tell anyway).

I am certainly open for improvement suggestions if anyone has any. Thanks.

function cleanWordPaste( in_word_text ) {
 var tmp = document.createElement("DIV");
 tmp.innerHTML = in_word_text;
 var newString = tmp.textContent||tmp.innerText;
 // this next piece converts line breaks into break tags
 // and removes the seemingly endless crap code
 newString  = newString.replace(/\n\n/g, "<br />").replace(/.*<!--.*-->/g,"");
 // this next piece removes any break tags (up to 10) at beginning
 for ( i=0; i<10; i++ ) {
  if ( newString.substr(0,6)=="<br />" ) { 
   newString = newString.replace("<br />", ""); 
  }
 }
 return newString;
}

Hope this is helpful to some of you.

OneNerd
A: 

I did something like that long ago, where i totally cleaned up the stuff in a rich text editor and converted font tags to styles, brs to p's, etc, to keep it consistant between browsers and prevent certain ugly things from getting in via paste. I took my recursive function and ripped out most of it except for the core logic, this might be a good starting point ("result" is an object that accumulates the result, which probably takes a second pass to convert to a string), if that is what you need:

var cleanDom = function(result, n) {
var nn = n.nodeName;
if(nn=="#text") {
    var text = n.nodeValue;

    }
else {
    if(nn=="A" && n.href)
        ...;
    else if(nn=="IMG" & n.src) {
        ....
        }
    else if(nn=="DIV") {
        if(n.className=="indent")
            ...
        }
    else if(nn=="FONT") {
        }       
    else if(nn=="BR") {
        }

    if(!UNSUPPORTED_ELEMENTS[nn]) {
        if(n.childNodes.length > 0)
            for(var i=0; i<n.childNodes.length; i++) 
                cleanDom(result, n.childNodes[i]);
        }
    }
}
rob