I'll be inserting content from remote sources into a web app. The sources should be limited/trusted, but there are still a couple of problems:
The remote sources could
1) be hacked and inject bad things
2) overwrite objects in my global names space
3) I might eventually open it up for users to enter their own remote source. (It would be up to the user to not get in trouble, but I could still reduce the risk.)
So I want to neutralize any/all injected content just to be safe.
Here's my plan so far:
1) find and remove all inline event handlers
str.replace(/(<[^>]+\bon\w+\s*=\s*["']?)/gi,"$1return;"); // untested
Ex.
<a onclick="doSomethingBad()" ...
would become
<a onclick="return;doSomethingBad()" ...
2) remove all occurences of these tags: script, embed, object, form, iframe, or applet
3) find all occurences of the word script within a tag and replace the word script with html entities for it
str.replace(/(<[>+])(script)/gi,toHTMLEntitiesFunc);
would take care
<a href="javascript: ..."
4) lastly any src or href attribute that doesn't start with http, should have the domain name of the remote source prepended to it
My question: Am I missing anything else? Other things that I should definitely do or not do?
Edit: I have a feeling that responses are going to fall into a couple camps.
1) The "Don't do it!" response
Okay, if someone wants to be 100% safe, they need to disconnect the computer.
It's a balance between usability and safety.
There's nothing to stop a user from just going to a site directly and being exposed. If I open it up, it will be a user entering content at their own risk. They could just as easily enter a given URL into their address bar as in my form. So unless there's a particular risk to my server, I'm okay with those risks.
2) The "I'm aware of common exploits and you need to account for this ..." response ... or You can prevent another kind of attack by doing this ... or What about this attack ...?
I'm looking for the second type unless someone can provide specific reasons why my would be more dangerous than what the user can do on their own.