ansaurus

Question

Clean Microsoft Word Pasted Text using JavaScript

Answer 1

+2 A:

How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.

Josh 2010-05-20 15:17:47

This would be my worst-case scenario I suppose (and the way its looking, may be the only scenario - very depressing).

OneNerd 2010-05-20 15:38:38

@OneNerd: I marked your question as a favorite because if anyone else has a better solution I think I'll use it too!

Josh 2010-05-20 18:12:34

i came up with something I *think* may be usable -- see my answer (and improve it too plz) if you would like. Thanks -

OneNerd 2010-05-20 18:17:35

Answer 2

A:

Could you paste to a hidden textarea, copy from same textarea, and paste to your target?

souLTower 2010-05-20 15:23:42

hmm - well, do you know a way to send the pasted content to a textarea so it is indeed plain text instead of the markup code -- since the keypress is on the DIV, I can read the contents and pass it to the textarea, but it wouldn't be plaintext.

OneNerd 2010-05-20 15:40:04

Answer 3

A:

Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US"> for example) and I correct it manually.

Coronatus 2010-05-20 15:25:44

Yikes - not really an option for me.

OneNerd 2010-05-20 15:40:53

Answer 4

A:

Here is the function I would up writing that does the job fairly well (as far as I can tell anyway).

I am certainly open for improvement suggestions if anyone has any. Thanks.

function cleanWordPaste( in_word_text ) {
 var tmp = document.createElement("DIV");
 tmp.innerHTML = in_word_text;
 var newString = tmp.textContent||tmp.innerText;
 // this next piece converts line breaks into break tags
 // and removes the seemingly endless crap code
 newString  = newString.replace(/\n\n/g, "<br />").replace(/.*<!--.*-->/g,"");
 // this next piece removes any break tags (up to 10) at beginning
 for ( i=0; i<10; i++ ) {
  if ( newString.substr(0,6)=="<br />" ) { 
   newString = newString.replace("<br />", ""); 
  }
 }
 return newString;
}

Hope this is helpful to some of you.

OneNerd 2010-05-20 18:16:06

Answer 5

A:

I did something like that long ago, where i totally cleaned up the stuff in a rich text editor and converted font tags to styles, brs to p's, etc, to keep it consistant between browsers and prevent certain ugly things from getting in via paste. I took my recursive function and ripped out most of it except for the core logic, this might be a good starting point ("result" is an object that accumulates the result, which probably takes a second pass to convert to a string), if that is what you need:

var cleanDom = function(result, n) {
var nn = n.nodeName;
if(nn=="#text") {
    var text = n.nodeValue;

    }
else {
    if(nn=="A" && n.href)
        ...;
    else if(nn=="IMG" & n.src) {
        ....
        }
    else if(nn=="DIV") {
        if(n.className=="indent")
            ...
        }
    else if(nn=="FONT") {
        }       
    else if(nn=="BR") {
        }

    if(!UNSUPPORTED_ELEMENTS[nn]) {
        if(n.childNodes.length > 0)
            for(var i=0; i<n.childNodes.length; i++) 
                cleanDom(result, n.childNodes[i]);
        }
    }
}

rob 2010-05-20 18:16:23

ansaurus

tags:

views:

answers:

Clean Microsoft Word Pasted Text using JavaScript

related questions