I have a rich text editor that passes HTML to the server. That HTML is then displayed to other users. I want to make sure there is no JavaScript in that HTML. Is there any way to do this?
Also, I'm using ASP.NET if that helps.
I have a rich text editor that passes HTML to the server. That HTML is then displayed to other users. I want to make sure there is no JavaScript in that HTML. Is there any way to do this?
Also, I'm using ASP.NET if that helps.
The simplest thing to do would be to either strip out tags with a regex. Trouble is that you could do plenty of nasty things without script tags (e.g. imbed dodgy images, have links to other sites that have nasty Javascript) . Disabling HTML completely by convert the less than/greater than characters into their HTML entities forms (e.g. <) could also be an option.
If you want a more powerful solution, in the past I have used AntiSamy to sanitize incoming text so that it's safe for viewing.
You may want to check how some browser based WYSIWYG editors such as TinyMCE do. They usually remove JS and seem to do a resonable job at it.
If you want the html to be changed so users can see the HTML code itself. Do a string replace of all '<', '>', '&' and ';'. For example '<' becomes '<'.
If you want the html to work, the easiest way is to remove all HTML and Javascript and then replace the HTML only. Unfortunately there is almost not sure way of removing all javascript and allowing only HTML.
For example you may want to allow images. However you may not know that you can do
<img src='evilscript.js'>
and it can run that script. It becomes very unsafe very fast$. This is why most websites like Wikipedia and this website use special markdown language. This makes it much easier to allow formatting but not malicious javascript.
The only way to insure that some HTML markup does not contain any JavaScript is to filter it of all unsafe HTML tags and attributes, in order to prevent Cross-Site Scripting (XSS).
However, there is in general no reliable way of explicitly removing all unsafe elements and attributes by their names, since certain browsers may interpret ones of which you weren't even aware at the time of design, and thus open up a security hole for malicious users. This is why you're much better off taking a whitelisting approach rather than a blacklisting one. That is to say, only allow HTML tags that you are sure are safe, and stripping all others by default. Indeed, only one accidently permitted tag can make your website vulnerable to XSS.
See this article on HTML sanitisation, which offers some specific examples of why you should whitelist rather than blacklist. Quote from that page:
Here is an incomplete list of potentially dangerous HTML tags and attributes:
- script, which can contain malicious script
- applet, embed, and object, which can automatically download and execute malicious code
- meta, which can contain malicious redirects
- onload, onunload, and all other on* attributes, which can contain malicious script
- style, link, and the style attribute, which can contain malicious script
Here is another helpful page that suggests a set of HTML tags & attributes as well as CSS attributes that are typically safe to allow, as well as recommended practices.
Although many website have in the past (and currently) use the blacklisting approach, there is almost never any true need for it. (The security risks invariably outweight the potential limitations whitelisting enforces with the formatting capabilities that are granted to the user.) You need to be very aware of its flaws.
For example, this page gives a list of what are supposably "all" the HTML tags you might want to strip out. Just from observing it briefly, you should notice that it contains a very limited number of element names; a browser could easily include a proprietary tag that unwittingly allowed scripts to run on your page, which is essentially the main problem with blacklisting.
Finally, I would strongly recommend that you utilise an HTML DOM library (such as the well-known HTML Agility Pack) for .NET, as opposed to RegEx to perform the cleaning/whitelisting, since it would be significantly more reliable. (It is quite possible to create some pretty crazy obfuscated HTML that can fool regexes! A proper HTML reader/writer makes the coding of the system much easier, anyway.)
Hopefully that should given you a decent overview of what you need to design in order to fully (or at least maximally) prevent XSS, and how it's critical that HTML sanitisation is performed with the unknown factor in mind.
As pointed out by Lee Theobald, that's a very dangerous plan. You cannot by definition ever produce "safe" HTML by filtering/blacklisting, since the user might put stuff into the HTML that you didn't think about (or that don't even exist in your browser version, but does in others).
The only safe way is a whitelisting approach, i.e. strip everything but plain text and certain specific HTML constructs. This incidentially is what stackoverflow.com does :-).