views:

173

answers:

2

I believe this may be related to http://stackoverflow.com/questions/1038129/need-pure-jquery-javascript-solution-for-cleaning-word-html-from-text-area

But in my case I am using CKEditor; however, before sending the data to the server (or after receiving it back) I'd like to strip out "junk" HTML tags and comments such as those that appear when pasting from recent (2007 or later) versions of Microsoft Office. Because the server-side here is a third-party application, I'd prefer to do this client side if I can. Yes, I am aware of the security risks of doing that; this is just meant to sanitize data in common use cases.

Are there any common techniques or existing libraries (especially jQuery-friendly) that can do this? Note, I am not looking to encode or strip all HTML, only the Office-related crud.

+2  A: 

Did you try CKEditor built in Word clean up functionality? It seems to be run automatically when using the "Paste From Word" dialog, but can also be used from your code. I'm not an expert on CKEditor API, so there might be a more efficient or correct way of doing this, but this seems to work on the current release (3.3.1):

function cleanUp() {

    if (!CKEDITOR.cleanWord) {
        // since the filter is lazily loaded by the pastefromword plugin we need to add it ourselves. 
        // We use the same function as the callback for when the cleanup filter is loaded. Change the script path to the correct one
        CKEDITOR.scriptLoader.load("../plugins/pastefromword/filter/default.js", cleanUp, null, false, true );
        alert('loading script for the first usage');
    } else { // The cleanWord is available for use

        // change to the correct editor instance
        var editor = CKEDITOR.instances.editor1;
        // perform the clean up
        var cleanedUpData = CKEDITOR.cleanWord(editor .getData(),  editor );

        // do something with the clean up
        alert(cleanedUpData);
    }
}

cleanUp();

If you're not happy with this clean up you can modify default.js for your clean up needs. There are some configuration options available for the cleanup, check http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.config.html (search for "pasteFromWord" options).

If you need something more advanced, but that will require a server access, I suggest you check WordOff (http://wordoff.org/). You might be able to build a proxy and jsonp wrapper around their service so you can use it from the client without a server installation.

Amitay Dobo