views:

1436

answers:

2

I have a site where users can post stuff (as in forums, comments, etc) using a customised implementation of TinyMCE. A lot of them like to copy & paste from Word, which means their input often comes with a plethora of associated MS inline formatting.

I can't just get rid of <span whatever> as TinyMCE relies on the span tag for some of it's formatting, and I can't (and don't want to) force said users to use TinyMCE's "Paste From Word" feature (which doesn't seem to work that well anyway).

Anyone know of a library/class/function that would take care of this for me? It must be a common problem, though I can't find anything definitive. I've been thinking recently that a series of brute-force regexes looking for MS-specific patterns might do the trick, but I don't want to re-write something that may already be available unless I must.

Also, fixing of curly quotes, em-dashes, etc would be good. I have my own stuff to do this now, but I'd really just like to find one MS-conversion filter to rule them all.

+2  A: 

HTML Purifier will create standards compliant markup and filter out many possible attacks (such as XSS).

For faster cleanups that don't require XSS filtering, I use the PECL extension Tidy which is a binding for the Tidy HTML utility.

If those don't help you, I suggest you switch to FCKEditor which has this feature built-in.

Eran Galperin
Thanks, but neither of those appear to cope with MS formatting, which is what I'm primarily interested in. HTML Purifier has it planned for version 3.5 but with "research necessary".
da5id
Then I suggest you switch to fckeditor which can deal with word input. Updated my answer.
Eran Galperin
da5id
Mind you, (if I switch) I still need to clean all the crap that's *already* been posted...
da5id
Try the non PHP suggestions in the following link - http://forums.devarticles.com/general-programming-help-4/removing-ms-word-html-from-a-file-4068.html
Eran Galperin
A: 

Not sure if this is helpful for you or not, but I asked a similar question a while back, and got some fairly useful answers.

Ben

Ben
Thanks Ben, good thought but I have the non-standard characters stuff covered already.
da5id