views:

210

answers:

5

Hi

I am trying to make a regex that will just look for and remove script tags(its the only tag I wanted removed since I think it is the only one that can cause damage).

Anyways I know there are so many way to write a script tag that is still valid. Will this catch them?

<\s*script\s*>.*?<\s*\/script\s*>

Edit

or would it better to try to change them all to safe tags? you know where it does html encoding on the tags? But it could only be on script tags since I still want to allow other html tags like <b> and stuff.

+1  A: 

That regexp will allow something like <script foo=bar><script> to get through (and a myriad of similar things that might cause havoc, but there are also things like this that people often forget about:

 <foo onload="document.write('<scri'+'pt>...<'+'/script>')"></foo>

which also make life difficult :-(

olliej
+3  A: 

It's not the only tag that can cause damage. Consider the following:

<a href="javascript:window.close()">

Also, no, it won't. Again, consider the following:

<script language="javascript">window.close()</script>

Even if you expand it to handle attributes on the script tag, what about:

<script src="http://somesite.com/malicious.js" />

To be honest, in my personal estimation, the best way is to either have a very explicit whitelist of tags/attributes that are allowed, or introduce your own markup and disallow bare html altogether.

EDIT:

Some more information for you:

A whitelist is simply a list of things that are allowed, everything else is disallowed, as opposed to your original idea of a blacklist, where the script tag is disallowed, but everything else is allowed.

Matthew Scharley
Everyone one says use a "whiteList" but no one actually tells me how. They just say don't use regex but don't show me how to actually make in C#
chobo2
+6  A: 

In almost all cases where you want to filter this sort of thing, it is better to check for what you specifically want to allow, rather than what you want to disallow. There are a zillion creative ways of hiding a <script> tag in HTML source, and you don't want to try to play the race of catching up with the new ones people might invent. On the other hand, you can quite easily create a list of acceptable tags and let people use those.

Greg Hewgill
how do I make this acceptable tags list? all my stuff from a rich html editor gets passed most of the times as a style(for font-weight, margin-left) but it also uses tags like <big>
chobo2
A: 

You can use these Samples that demonstrats how to use MSHTML has UI-Less parser there you can remove the script tags as well as you can implement custom service host that can completely disable javascript in your application and here is a discussionwhich did helped me once.

There are two ways, 1 you can set design mode on that does not execute javascript and other is you can disable option URLACTION_SCRIPT_JAVA_USE;

Akash Kava
A: 
<b style="left:expression(alert('IE just got pwned'));">Oops...</b>

Here's a good discussion of the issues: Sanitising HTML is an extremely hard problem.

NickFitz