views:

1348

answers:

5

Simple question that keeps bugging me.

Should I HTML encode user input right away and store the encoded contents in the database, or should I store the raw values and HTML encode when displaying?

Storing encoded data greatly reduces the risk of a developer forgetting to encode the data when it's being displayed. However, storing the encoded data will make datamining somewhat more cumbersome and it will take up a bit more space, even though that's usually a non-issue.

+14  A: 

i'd strongly suggest encoding information on the way out. storing raw data in the database is useful if you wish to change the way it's viewed at a certain point. the flow should be something similar to:

sanitize user input -> protect against sql injection -> db -> encode for display

think about a situation where you might want to display the information as an RSS feed instead. having to redo any HTML specific encoding before you re-display seems a bit silly. any development should always follow the "don't trust input" meme, whether that input is from a user or from the database.

Owen
How do subsequent queries work when you're doing a SELECT..WHERE and some of the values have HTML encoding and others don't?
DOK
ugh, sounds kinda messy. it really depends on your specifics, but if i inherited a project where i needed to create new views, and the info was half encoded, i'd probably re-store the information unencoded to make life easier in the long run.
Owen
To add onto this, if your encoding process for display is expensive (for example, you're allowing HTML and are running HTML Purifier on it), caching the filtered version can be an option. Disk space is cheap.
Edward Z. Yang
@Ambush Commander: if you accept HTML then it's a different problem: sanitation, not escaping. Your input is then in HTML and you don't have choice of (losslessly) storing as plain text or HTML.
porneL
The distinction is true. However, I see far too many developers going the lossy method and storing filtered text in their database.
Edward Z. Yang
+1  A: 

Keep in mind that you may need to access the database with something that doesn't understand HTML encoded text (e.g., a reporting tool). I agree that space is a non-issue, but IMHO, putting HTML encoding in the database moves knowledge of your view/front end into the lowest tier in the application, and that is a design mistake.

Craig Stuntz
+3  A: 

The encoding should only only only be done in the display. Without exception.

Andy Lester
+3  A: 

Output.

With HTML you can't simply check length of a string (& is 1 character, but strlen() will tell you 5), you can easily crop it (it could break entities).

You may need to mix strings from database with strings from another source, or read and write them back. Doing this application-wide without missing any escaping and avoiding double escaping is a nightmare.

PHP tried to do similar thing with magic_quotes and it turned out to be a huge failure. Don't take magic_entities route! :)

porneL
A: 

Doesn't this defeat the purpose of encoding? If a malicious sql script is entered as input, which is then passed to the db it could cause a huge problem.