views:

114

answers:

3

I am processing a user input from the public with a javascript WYSIWYG editor and I'm planning on using htmlpurifier to cleanse the text.

I thought it would be enough to use htmlpurifier on the input, stored the cleaned input in the database,and then output it without further escaping/filtering. But I've heard other opinions that you should always escape the output.

Can someone explain why I should need to clean the output if I'm already cleaning the input?

A: 

The mantra always escape your output, which is a Text to HTML conversion, is a good and reasonable default to fall back to when working in the web space. In the case of HTML Purifier, you are specifically breaking this good advice, because you are indeed performing an HTML to HTML conversion and treating the HTML as Text again doesn't really make sense.

Edward Z. Yang
thanks for the answer but I didn't quite follow - are you saying that once using htmlpurifier, it could be treated as safe?
Yehosef
+2  A: 

I assume your WYSIWYG editor generates HTML, which is then validated and put in the database. In that case, the validation already took place, so there is no need to validate twice.

As to "escaping output", that's a different matter. You cannot escape the resulting HTML, otherwise you won't have formatted text, and the tags will be visible. Escaping the output is used when you do not want said output to interfere with the markup of the page.

I'd add you have to be very careful with what you allow in your validation phase. You will probably only want to allow a few HTML tags and attributes.

Artefacto
The problem with relying on the js editor is that a malicious user could submit a post bypassing whatever checks the js has.
Yehosef
@user of course. But isn't your purifier for that purpose?
Col. Shrapnel
@Col - yes - but Artefacto was saying that the js "validated" the html - so there is no need to validate twice (meaning to use htmlpurifier)
Yehosef
No, he's saying you don't need to validate again after reading the data back *out* of the database.
jvenema
@Yehosef I'm not. You have to validate it after submisson (with HTML Purifier, if you wish) and before inserting it in the database, just not everytime you fetch it from the database.
Artefacto
@artefacto - thanks for the clarification and answer - I didn't read it properly
Yehosef
A: 

To be 100% safe, use HTMLPurifier twice. Before saving the HTML to DB and before outputting it to screen.
The huge drawback of such solution is performance. HTMLPurifier is ultraslow when filtering HTML and you might encounter longer processing times of your pages.

You should be ok if you perform only 1-2 filterings before outputting something to screen, but if you do 10 filterings per request like we did, we rather decided not to use HTMLPurifier when outputting large amounts of texts to keep.

HTMLPurifier took 60% of processing time per request and we wanted to achieve low response times and higher UX instead.

It depends on your situation. If you can afford using HTMLPurifier before outputting, go for it - it's better and you always have control over what tags you want to allow (for new and even for old content stored in your db).

michal kralik
Thanks for your post - but can you explain a case in which I would need to do it twice?eg if I do: $id = (int)$_POST['id'];$db->query("select * from users where id = ".int_val($id));have I gained anything in security?
Yehosef
The second filtering (before output) is helpful in cases where someone has hacked into your db server but did not manage to break into your web server. The attacker can easily change any content in your db and if you're not filtering HTML before outputting you have pretty serious security problem. However I believe this is a very rare scenario.
michal kralik
@michal this is ridiculuos scenario I'd say
Col. Shrapnel
I agree, it's rare, but the side effect of filtering before output is also that if you decide you no longer want to allow certain tag (ie <img>), it's pretty simple for all your content. If there was no filtering before output you would have to go through each entry and remove the tags.
michal kralik