views:

72

answers:

4

Should HTML be encoded before being stored in say, a database? Or is it normal practice to encode on its way out to the browser?

Should all my text based field lengths be quadrupled in the database to allow for extra storage?

Looking for best practice rather than a solid yes or no :-)

A: 

For security reasons, yes you should first convert the html to their entities and then insert into the database. Attacks such as XSS are initiated when you allow users (or rather bad guys) to use html tags and then you process/insert them in to the databse. XSS is one of the root causes of most security holes. So you definitely need to encode your html before storing it.

Sarfraz
This is may not be always good, because I loose the orignial data here!
Mahesh Velaga
@Mahesh what if the original data is a XSS attack?
mxmissile
Inserting malicious HTML into the database is not a security risk. Only presenting that malicious HTML to the browser is one. So it is *not* necessary to replace the HTML special characters with character references when inserting HTML into the database. Only the contextual special characters of SQL need to be replaced/escaped.
Gumbo
+5  A: 

The practice is to HTML encode before display.

If you are consistent about encoding before displaying, you have done a good bit of XSS prevention.

You should save the original form in your database. This preserved the original and you may want to do other processing on that and not on the encoded version.

Oded
Hmmm, I was trying to avoid encoding for every single request seeing as its a repetitive task. Interesting take though :)
Sir Psycho
Some frameworks will do encoding automatically.
Oded
+1 Its good to encode when displaying and not when storing, as you will have the original data with u, if u need to process it differently.
Mahesh Velaga
+1  A: 

Is the data in your database really HTML or is it application data like a name or a comment that you just happen to know will end up as part of an HTML page?

If it's real data, I think its best to:

  • represent it in a form that native to the environment and
  • make sure its properly translated as it crosses representational boundaries.

If you're a fan of MVC, this also help separates the view/controller from the model (and from the persistent storage format).

Representation

For example, assume someone leaves the comment "I love M&Ms". Its probably easiest to represent it in the code as the plain-text String "I love M&Ms", not as the HTML-encoded String "I love M&Ms". Technically, the data as it exists in the code is not HTML yet and life is easiest if the data is represented as simply as accurately possible. This data may later be used in a different view, e.g. desktop app. This data may be stored in a database, a flat file, or in an XML file, perhaps later be shared with another program. Its simplest for the other program to assume the string is in "native" representation for the format: "I love M&Ms" in a database and flat file and "I love M&Ms" in the XML file. I would cringe to see the HTML-encoded value encoded in an XML file ("I love &Ms").

Translation

Later, when the data is about to cross a representation boundary (e.g. displayed in HTML, stored in a database, plain-text file, or XML file), then its important to make sure it is properly translated so it is represented accurately in a format native to that next environment. In short, when you go to display it on an HTML page, make sure its translated to properly-encoded HTML (manually or through a tool) so the value is accurately displayed on the page. When you go to store it in the database or use it in a query, use escaping and/or prepared statements and bound variable to ensure the same conceptual value is accurately represented to the database. When you go to store it in an XML file, you ensure its XML-encoded.

Failure to translate properly when crossing representation boundaries is the source of injection attacks such SQL-injection attacks. Be conscientious of that whenever you are working with multiple representations/languages (e.g. Java, SQL, HTML, Javascript, XML, etc).

If you are really trying to save HTML page fragments to the database, then I am unclear by what you mean "encoded". If its is true and proper HTML, all the necessary values should already be encoded (e.g. &, <, etc).

Bert F
+1  A: 

Database vendor specific escaping on the input, html escaping on the output.

WishCow