views:

49

answers:

2

Hi Everyone,

I have a great concern in deploying the TinyMCE editor on a website. Looking at the code parsed by the editor it does a great job, and I leave the HTML button off the toolbar configuration so users can not inject their own source.

However, from what I read in the TinyMCE docs, it claims to degrade nicely to a regular textarea should javascript be disabled on a users browser... and therein lies my concern. If it does revert to a normal textarea, then the user is then able to easily inject their own HTML, and this leaves me with a security concern.

I just pass through data created with TinyMCE, and it is used within another page created by my script, so it poses no security risk to my server. The security concern arises over what malicious data may be passed to another user viewing the generated page.

I know many of you will tell me to just use regexes, or parse this data, but that itself could be a nightmare, as I would be trying to either...

a.) Use regexes to try and clean up the HTML without breaking the generated page, and it is better to parse the data for that anyway.

b.) Reparsing data that has already been parsed by the RTF editor, which also would probably end up breaking the generated page.

Anyone with any previous experience with this type of scenario, I would really appreciate a 'heads-up' as to any other risks that using an RTF editor for user data could entail. I would really like to provide this as a user option, but not if the risks outweigh giving the user using the RTF a chance to take a wack at another user viewing the page that is generated by the script.

My gut feeling is to steer a wide berth around use of the RTF at this point.

Thanks for any direction you can give me with your own experiences.

+2  A: 

Regex is generally not considered good for parsing HTML http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags but I have noted the "perl" tag :)

My advice when taking markup from users is to always parse it through something that can accept mal-formed HTML and return well formed HTML. These parses generally produce something that can be queried and updated with some form of XPath.

In Python there is a module called BeautifulSoup, Ruby has Nokogiri and in ASP.NET there is a project called HtmlAgilityPack that all do this sort of thing. I'm not sure what library perl has, but I'm sure there would be something.

tarn
Theoretically, you could simply blacklist certain strings, e.g. `eval`, `<script>`, `<link>`, `<iframe>`, `javascript:`, `onclick`, etc. Then you wouldn't have to parse the HTML. So I'm not sure what advantages there are to parsing the HTML aside from checking to see if it's valid or not.
Lèse majesté
@Lèse majesté, one problem with a blacklist like that is that it would have prohibited your comment. :-) If it doesn't parse the HTML, then it doesn't know whether `onclick` is just plain text or an attribute name.
cjm
@Lese Firstly it is the mal-formed HTML that will probably break your page, so you probably want to get on top of that. Secondly simply removing <script> string is very naive and dangerous, what about "<script >"? It's not "<script>" but most browsers will still execute the script inside.
tarn
@cjm: good point; I hadn't considered such applications. @tarn: that's why you need to be very careful about blacklisting all possible dangerous strings; and regexp is fully capable of capturing all variations of valid script tags. So aside from validation, you don't need to parse the HTML.
Lèse majesté
@Lese Did you not read that link in my answer? There are 3865 votes for the answer that says you cannot correctly parse HTML with regex! Why would you even try when there are better ways?
tarn
When did I ever suggest you parse HTML using regex?
Lèse majesté
@Lese Regex is more sophisticated than a blacklist. One regex can find "<script [one space] >" and "<script [two spaces] >" and "<script [n spaces] >". A blacklist for the same thing has infinite entries :) If you can't do it with regex you certainly can't do it with a blacklist.
tarn
@Lese: I tried a little bit of this, but I always lean to the white list approach, as it always remains safe into the future irregardless of changes to browsers. White lists are considered the more secure way to handle through inclusion, rather than black list that rely on exclusion. Do the same with my regexes... always inclusion for pattern matches.
Epiphany
@tarn: I was referring to implementing a blacklist with regexps. You would basically write out all potentially malicious strings and create a single regex for all possible variations of that string. Of course, using a parser is less error-prone and easier, but I was merely playing devil's advocate (I voted your answer up btw).
Lèse majesté
@tarn: Even the regex could end up being the one that breaks the page, for example, if your where trying to just match tags, then this would break it.<img src="mypic.jpg" alt="<<wow" />
Epiphany
@Epiphany: you can whitelist the tags that you want to allow the user access to, but you can't possibly whitelist all possible permutations of text. But then you'd still have to wrestle with the problem of invalid HTML. You can actually sandbox the damage from broken HTML by not allowing certain tags, like `<div>` and `<textarea>`, and then automatically appending any missing closing tags or right-angle-brackets using a simple count. This will tend to give you more closing brackets/tags than you need, but it will prevent the HTML from screwing up the rest of your layout.
Lèse majesté
Like I said, I was playing devil's advocate. You can certainly secure a page using regexp without parsing the HTML, but when it's all said and done, it's easier to just parse it using a DOM library.
Lèse majesté
@Lese: I'm afraid that won't work for me, as I am a very fussy (or anal as my wife put's it) about my end product. It has to validate both XHTML and CSS, as will as be cross-browser without any IE hacks or multiple pages being deployed using sniffer redirects. I pride myself on my ablity on being able to show exactly the same page on all browsers with just one document. Yeah... LOL, I'm that hard on myself!
Epiphany
@Epiphany: I am not advocating using Regex to scrub HTML. I advocating parsing it into a valid document tree, and I recommend using an established library for this. See my answer we are commenting on :)
tarn
@Epiphany: That's a good attitude to have. The web would be a better place if everyone were that anal. =] I hope you also put the same zeal into making sure that your sites are accessible to the visually/hearing impaired or physically handicapped.
Lèse majesté
LOL... Sure do Lese... I'm 52 years old, I can't see sh*t in front of me with out my spectacles, and tend to say either 'What?" or "Sure Honey" to my wife a lot these days. I will admit... I don't have a braille keyboard yet, but the way things are going, who knows!
Epiphany
+6  A: 

You cannot have client-side security on the web. You simply can't trust the browser, because it's easy for a malicious user to substitute a replacement browser that does whatever he wants.

If you accept HTML from users (using TinyMCE or through any other method) and display it to other users, you must sanitize or validate the HTML in some way on the server. If you're using Perl, the leading package seems to be HTML::Scrubber (along with various other modules that help you plug it in to various frameworks). I haven't had occasion to try it myself.

The TinyMCE Security page mentions some ways to make it harder for people to submit arbitrary HTML, but you still need server-side checks.

cjm
@cjm: I know that, and always validate both client-side and server-side... and always very carefully. The problem with most 'scrubber' modules I've seen, is that they are relying on the HTML being well-formed in the first place. It is almost impossible to catch every instance of mal-formed HTML with either regexes or parsers, and therein lies the problem. This is a standard tool of the hackers toolbox, using mal-formed HTML to get around both of these.
Epiphany
I wanna +5 but I can only +1
xenoterracide
@Epiphany not knowing much about TinyMCE... can it submit anything other than html? if so, how about not allowing people to submit HTML. Let them using BBCode (or whatever else perhaps TinyMCE can handle)
xenoterracide
@Everyone: LOL... It appears my gut feeling is right, and I should just do away with the thought of using an RTF on a public site. After all, my script will still make a wonderful looking page that validates with both it's XHTML and CSS. Taking control of the formatting of the user input can be done I guess with a little more elbow grease, while allowing them only alpha-numeric input.
Epiphany
@xenoterracide: In this case that wouldn't work, as the user input is being place within another HTML document that is being generated by the script. It has to be HTML, or there is no point in using the RTF at all in this particular application. The reason I choose the TinyMCE, was because they worked so hard to have parse XHTML compliant.
Epiphany
@Everyone: Hey!! I've got an idea!! I'll write a regex that just gets rid of the user! LOL. The 'Teregexator'.
Epiphany
@cjm: Well, you get my vote on this one. First, because you gave me the technically correct answer in the very first sentence of your post (and I already knew that answer... but was hoping someone might have a brilliant workaround that would let me use the RTF securely :{ ).And second, because you addressed the answer with the best alternatives available in a Perl context, which means you bothered to look at how the post was tagged.
Epiphany