views:

516

answers:

5

I have a rich text editor that passes HTML to the server. That HTML is then displayed to other users. I want to make sure there is no JavaScript in that HTML. Is there any way to do this?

Also, I'm using ASP.NET if that helps.

+1  A: 

The simplest thing to do would be to either strip out tags with a regex. Trouble is that you could do plenty of nasty things without script tags (e.g. imbed dodgy images, have links to other sites that have nasty Javascript) . Disabling HTML completely by convert the less than/greater than characters into their HTML entities forms (e.g. <) could also be an option.

If you want a more powerful solution, in the past I have used AntiSamy to sanitize incoming text so that it's safe for viewing.

Lee Theobald
Actually, "strip out tags with a regex" is not the best of recommendations to give.
Tomalak
Will be using AntiSamy
Stephen lacy
I'm not familiar with AntiSamy, but I would recommend that you insure it's well-designed before using it (i.e. takes a whitelisting approach for a start). Also, regex is *definitely* not the way to go even for a simple solution.
Noldorin
it is a whitelisting approach
Stephen lacy
A: 

You may want to check how some browser based WYSIWYG editors such as TinyMCE do. They usually remove JS and seem to do a resonable job at it.

Darryl Hein
Yeah they do that, but if you're a bit "hacker" you can put the tinymce editor in textmode and then when you save your data it will still have a chance that the user have modified the text with javascript.
Nordes
Well, this is true for any JS. You can always disable JS and submit whatever you want. You should instead be looking at what you can do with ASP.NET then as you'll want to protect yourself on the server where you have control vs the browser where you have very littly.
Darryl Hein
+1  A: 

If you want the html to be changed so users can see the HTML code itself. Do a string replace of all '<', '>', '&' and ';'. For example '<' becomes '&lt;'.

If you want the html to work, the easiest way is to remove all HTML and Javascript and then replace the HTML only. Unfortunately there is almost not sure way of removing all javascript and allowing only HTML.

For example you may want to allow images. However you may not know that you can do

<img src='evilscript.js'>

and it can run that script. It becomes very unsafe very fast$. This is why most websites like Wikipedia and this website use special markdown language. This makes it much easier to allow formatting but not malicious javascript.

+8  A: 

The only way to insure that some HTML markup does not contain any JavaScript is to filter it of all unsafe HTML tags and attributes, in order to prevent Cross-Site Scripting (XSS).

However, there is in general no reliable way of explicitly removing all unsafe elements and attributes by their names, since certain browsers may interpret ones of which you weren't even aware at the time of design, and thus open up a security hole for malicious users. This is why you're much better off taking a whitelisting approach rather than a blacklisting one. That is to say, only allow HTML tags that you are sure are safe, and stripping all others by default. Indeed, only one accidently permitted tag can make your website vulnerable to XSS.


Whitelisting (good approach)

See this article on HTML sanitisation, which offers some specific examples of why you should whitelist rather than blacklist. Quote from that page:

Here is an incomplete list of potentially dangerous HTML tags and attributes:

  • script, which can contain malicious script
  • applet, embed, and object, which can automatically download and execute malicious code
  • meta, which can contain malicious redirects
  • onload, onunload, and all other on* attributes, which can contain malicious script
  • style, link, and the style attribute, which can contain malicious script

Here is another helpful page that suggests a set of HTML tags & attributes as well as CSS attributes that are typically safe to allow, as well as recommended practices.

Blacklisting (generally bad approach)

Although many website have in the past (and currently) use the blacklisting approach, there is almost never any true need for it. (The security risks invariably outweight the potential limitations whitelisting enforces with the formatting capabilities that are granted to the user.) You need to be very aware of its flaws.

For example, this page gives a list of what are supposably "all" the HTML tags you might want to strip out. Just from observing it briefly, you should notice that it contains a very limited number of element names; a browser could easily include a proprietary tag that unwittingly allowed scripts to run on your page, which is essentially the main problem with blacklisting.


Finally, I would strongly recommend that you utilise an HTML DOM library (such as the well-known HTML Agility Pack) for .NET, as opposed to RegEx to perform the cleaning/whitelisting, since it would be significantly more reliable. (It is quite possible to create some pretty crazy obfuscated HTML that can fool regexes! A proper HTML reader/writer makes the coding of the system much easier, anyway.)

Hopefully that should given you a decent overview of what you need to design in order to fully (or at least maximally) prevent XSS, and how it's critical that HTML sanitisation is performed with the unknown factor in mind.

Noldorin
While writing my answer I saw yours and it looks good. I actually had to code something in C# to do what you may try to do. Prevent any XSS attack. I've made a config file to know which html tags with which attributes are allowed. But you will require to have a lot of tests based on your code. (like what Noldorin was saying).
Nordes
Blacklisting can never work, as other browsers might interpret tags you didn't even know. You need a whitelisting approach.
sleske
On my side i'm more whitelisting than blacklisting. For the style attribute you need to remove behavior and etc.
Nordes
@sleske: Blacklisting does work in practice, but I agree that it can be risky. Equally, if you whitelist certain tags, then there may be some harmless ones that the user might want to use that aren't allowed. Still, this is admittedly a lesser evil. I'll update the post to mention whitelisting, which is important. Fancy removing the down vote?
Noldorin
@Noldorin: Blacklisting does work in the sense that it makes attacks harder, but it will always leave holes; that's what I meant. Anyway, now I actually like your answer :-). +1
sleske
@sleske: Yeah, exactly. The point is that only one accidentally allowed tag can ruin security. I've put in a lot of clarifications now, which should all be correct. Thanks for pointing this out! (I was aware of it, but it slipped my mind when I first wrote the post.)
Noldorin
Unfortunately I can't give assisted answers on stackoverflow, this is a really great answer but AntiSamy is what I was looking for. Oddly enough it uses the HTML Agility Pack
Stephen lacy
+3  A: 

As pointed out by Lee Theobald, that's a very dangerous plan. You cannot by definition ever produce "safe" HTML by filtering/blacklisting, since the user might put stuff into the HTML that you didn't think about (or that don't even exist in your browser version, but does in others).

The only safe way is a whitelisting approach, i.e. strip everything but plain text and certain specific HTML constructs. This incidentially is what stackoverflow.com does :-).

sleske