tags:

views:

195

answers:

6

I'm planning on making a web app that will allow users to post entire web pages on my website. I'm thinking of using HTML Purifier but I'm not sure because HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted. So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.

I saw a Google video a while ago that had a solution for this. Their solution was to use another website to post javascript in so the original website cannot be accessed by it. But I don't wanna purchase a new domain just for this.

+2  A: 

Make sure that user content doesn't contain anything that could cause Javascript to be ran on your page.

You can do this by using an HTML stripping function that gets rid of all HTML tags (like strip_tags from PHP), or by using another similar tool. There are actually many reasons besides XSS to do this. If you have user submitted content, you want to make sure that it doesn't break the site layout.

I belive you can simply use a sub-domain of your current domain to host Javascript, and you will get the same security benefits for AJAX. Not cookies however.


In your specific case, filtering out the <script> tag and Javascript actions is probably going to be your best bet.

Chacha102
be sure to strip out <style> tags and style attributes too because IE will execute CSS expressions in them.
scunliffe
also strip out any inline event handlers (onclick, onmouseover, ondblclick, onmouseenter,...) all the normal events, plus any proprietary IE ones.
scunliffe
Can you confirm that a subdomain would work for this? Because if it does I would rather use that and allow Javascript, without the access to cookies. Also the style tag and attribute are something that is necessary for what I'm doing.
this is a dead end
Trying to sanitize HTML is actually very hard to do -- there are plenty of places that JS can slip in, and plenty of ways for an attacker to mangle their HTML in such a way that your sanitizer will pass it, but the browser will execute. I remember that LiveJournal put a lot of effort into building a HTML/CSS sanitizer, and still ended up with XSS bugs, for example.
Stephen Veiss
+5  A: 

be careful with homebrew regexes for this kind of thing

A regex like

s/(<.*?)onClick=['"].*?['"](.*?>)/$1 $3/

looks like it might get rid of onclick events, but you can circumvent it with

<a onClick<a onClick="malicious()">="malicious()">

running the regex on that will get you something like

<a onClick ="malicious()">

You can fix it by repeatedly running the regex on that string until it doesn't match, but that's just one example of how easy it is to get around simple regex sanitizers.

Charles Ma
A: 

1) Use clean simple directory based URIs to serve user feed data. Make sure when you dynamically create URIs to address the user's uploaded data, service account, or anything else off your domain make sure you don't post information as parameters to the URI. That is an extremely easy point of manipulation that could be used to expose flaws in your server security and even possibly inject code onto your server.

2) Patch your server. Ensure you keep your server up to date on all the latest security patches for all the services running on that server.

3) Take all possible server-side protections against SQL injection. If somebody can inject code to your SQL database that can execute from services on your box that person will own your box. At that point they can then install malware onto your webserver to be feed back to your users or simple record data from the server and send it out to a malicious party.

4) Force all new uploads into a protected sandboxed area to test for script execution. No matter how you try to remove script tags from submitted code there will be a way to circumvent your safeguards to execute script. Browsers are sloppy and do all kinds of stupid crap they are not supposed to do. Test your submissions in a safe area before you publish them for public consumption.

5) Check for beacons in submitted code. This step requires the previous step and can be very complicated, because it can occur in script code that requires a browser plugin to execute, such as Action Script, but is just as much a vulnerability as allowing JavaScript to execute from user submitted code. If a user can submit code that can beacon out to a third party then your users, and possibly your server, is completely exposed to data loss to a malicious third party.

I'm not quite following #1... Can you provide a (hypothetical) example of an attack which could be made against URI parameters which would be prevented by using pathinfo-style ("directory-based") URIs instead?
Dave Sherohman
Yes, but I'm still not seeing how that's any more vulnerable than domain.com/page/login/name/query/term/ordernum/1234/account/5678/dest/cart/status/vip would be. The issue is with the amount of information contained in the URL, not with whether that information is passed as parameters vs. being passed as pathinfo. SEO and ugliness aside, I'm not aware of any way that domain.com/?action=login is any worse than domain.com/login/ - unless I'm missing something, it's not the "?" that's the problem.
Dave Sherohman
+3  A: 

The most critical error people make when doing this is validating things on input.

Instead, you should validate on display.

The context matters when determing what is XSS and what isn't. Therefore, you can happily accept any input, as long as you pass it through appropriate cleaning functions when displaying it.

Consider that something that constitutes 'XSS' will be different when the input is placed in a '&lt;a href="HERE"> as opposed to <a>here!</a>.

Thus, all you need to do, is make sure that any time you write user data, you consider, very carefully, where you are displaying it, and make sure that it can't escape the context you are writing it to.

Noon Silk
+2  A: 

If you can find any other way of letting users post content, that does not involve HTML, do that. There are plenty of user-side light markup systems you can use to generate HTML.

So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.

Forget it. You cannot process HTML with regex in any useful way. Let alone when security is involved and attackers might be deliberately throwing malformed markup at you.

If you can convince your users to input XHTML, that's much easier to parse. You still can't do it with regex, but you can throw it into a simple XML parser, and walk over the resulting node tree to check that every element and attribute is known-safe, and delete any that aren't, then re-serialise.

HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted.

Why?

If it's so they can edit it in their original form, then the answer is simply to purify it on the way out to be displayed in the browser, not on the way in at submit-time.

If you must let users input their own free-form HTML — and in general I'd advise against it — then HTML Purifier, with a whitelist approach (ban all elements/attributes that aren't known-safe) is about as good as it gets. It's very very complicated and you may have to keep it up to date when hacks are found, but it's streets ahead of anything you're going to hack up yourself with regexes.

But I don't wanna purchase a new domain just for this.

You can use a subdomain, as long as any authentication tokens (in particular, cookies) can't cross between subdomains. (Which for cookies they can't by default as the domain parameter is set to only the current hostname.)

Do you trust your users with scripting capability? If not don't let them have it, or you'll get attack scripts and iframes to Russian exploit/malware sites all over the place...

bobince
"you may have to keep it up to date when hacks are found" This is another reason to purify it on the way out to the browser. If you do it on the way in, updating your filters to prevent new attacks won't automatically protect you from attacks of that type made before the new filter went into place.
Dave Sherohman
Agreed. In general it's usually a good idea to keep hold of your original input of anything rather than a processed version.
bobince
I only care about cookies really. This posted content won't be stored on the site. It will work like a proxy, but it's not a proxy. If you guys say cookies can't be accessed from a subdomain then I'm going with a subdomain.
this is a dead end
They can't by default. They can if the cookie is set with a ‘domain’ parameter that allows cross-subdomain access, which some software might do, so check any cookie-setting stuff.
bobince
A: 

You should filter ALL HTML and whitelist only the tags and attributes that are safe and semantically useful. WordPress is great at this and I assume that you will find the regular expressions used by WordPress if you search their source code.

Eli Grey