views:

125

answers:

5

I have text stored in SQL as HTML. I'm not guaranteed that this data is well-formed, as users can copy/paste from anywhere into the editor control I'm using, or manually edit the HTML that's generated.

The question is: what's the best way of going about removing or somehow ignoring <script/> and <form/> tags so that, when the user's text is displayed elsewhere in the Web Application, it doesn't disrupt the normal operation of the containing page.

I've toyed with the idea of simply doing a "Find and Replace" for <script>/<form>with <div> (obviously taking into account whitespace and closing tags, if they exist). I'm also open to any way to somehow "ignore" certain tags. For all I know, there could be some built-in way of saying (in HTML, CSS, or JavaScript) "for all elements in <div id="MyContent">, treat <form> and <script> as <div>.

Any help or advice would be greatly appreciated!

+1  A: 

In terms of sanitising user input, form and script tags are not the only ones that should be cleaned up.

The best way of doing this job depends a little on what tools you are using. Have a look at these questions:

Colin Pickard
A: 

It depends on which language you're using. In general, I'd recommend using an HTML parser, constructing a small DOM from the snippet, then nuking unwanted elements. There are many good HTML parser, especially designed to handle real-world, messy HTML. Examples include BeautifulSoup (Python), HTMLParser (Java)... And, since the answer came in while I was typing, what Colin said!

Jonathan Feinberg
A: 

Don't try and do it yourself - there are far too many tricks for getting bits of script and general nastiness into a page. Use the Microsoft AntiXSS library - version 3.1 has HTML sanitation built in. You probably want the GetSafeHTMLFragment method, which returns a sanitised chunk of HTML. See my previous answer.

PhilPursglove
A: 

Since you're using .Net I would recommend HtmlAgilityPack as it is easy to work with and works well with malformed HTML.

Pat
A: 

Though the answers suggested were acceptable, I ended up using a good old regular expression to replace begin and end <script> and <form> tags with <div>'s.

Pwninstein