views:

177

answers:

4

I'm coding a WYSIWYG editor width designMode="on" on a iframe. The editor works fine and i store the code as is in the database.

Before outputing the html i need to "clean" with php on the server-side to avoid cross-site-scripting and other scary things. Is there some sort of best practice on how to do this? What tags can be dangerous?

UPDATE: Typo fixed, it's What You See Is What You Get. Nothing new :)

+5  A: 

The best practice is to allow only certain things you know aren't dangerous, and remove/escape all the rest. See the paper Automated Malicious Code Detection and Removal on the Web (OWASP AntiSamy) for a discussion on this (the library is for Java, but the principles apply for any language).

Chris Lercher
I started out that way, but since all browsers implement this stuff differently i will get a lot of tags for the same thing that i need to allow. For example bold text is done in at least 3 different ways. So it will be a huge set of regex. It's also possible to paste in whatever formatted html you want in the editor, like from a html-mail or something. And that looks good in the editor but won't work after escape.
Martin
That's why AntiSamy already comes with some example sets. Probably, there's also a PHP library (or you can create one?) You will *never* achieve it the other way around (by blacklisting): Everyone who tried this before, has failed - it's simply not realistically possible - there *will* be something you haven't covered (which is fatal for blacklisting, but doesn't matter too much when whitelisting). Ideally, if you can avoid HTML, use Markdown etc., as suggested by Hank!
Chris Lercher
@Martin you *REALLY* shouldn't be using regexes for this. There's a reason [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) got (net) 3000 upvotes.
Hank Gay
@chris_l: Okay i'm convinced now that i should do whitelisting instead of blacklisting.@Hank Gay: But i'm not really going to parse html, i'm just going to replace < with < and then replace < back to < on a small set of known patterns. Is that still like going to a date with satan?
Martin
A: 

If you are familiar with ASP .NET, just perform a Server.htmlencode() to convert special characters like < > to "& g t;" "&l t ;"

In php, you can use htmlspecialchars() functions.

Once the special characters are encoded, cross-site-scripting can be prevented.

StartClass0830
But that disables html, i want to allow html but remove dangerous tags like iframe and script.
Martin
Then use a markup specifically designed for the prupose like bbcode or wikicode and a suitable editor.
symcbean
+3  A: 

If you're really bent on allowing this, you should use a white list approach.

The best approach is probably to disallow HTML and use a simplified markup format instead; you can pre-render to HTML and store that in the database if performance is a concern. Avoiding these sorts of problems is one of the big reasons for using Markdown, Textile, reStructuredText, etc.

NOTE: I linked to GitHub-Flavored Markdown (GFM), not Standard Markdown (SM). GFM addresses some common problems that end-users have with SM.

Hank Gay
+1  A: 

Hi

I looked into the same question recently with Perl as the server-side language.

While doing so I ran into HTML Purifier which may be what you want. But obviously as it's in PHP and not Perl, I didn't actually test it out.

Also, in my research I came to the conclusion that this is a very tricky business and consider if possible using a simplified markup language like Markdown, as suggested by Hank Gay.

FalseVinylShrub