views:

228

answers:

4
A: 

The purpose of escaping is to transmit data over a channel that does not allow certain characters. Since an UTF-8 database can handle UTF-8 characters just fine, you have no reason to escape anything for storage. In fact, since escaped text is harder to manipulate (string functions will not work properly, for instance), it is usually advised not to perform an unnecessary escaping.

Victor Nicollet
+2  A: 

Store the data as-is. Perform any conversions necessary for display at run-time.

Because if you store it as HTML (with entities) you create several issues

  • You lock your data to the HTML format, not just "text content"
  • Messes up data widths (e.g., varchar(255) or usage of SQL string functions like substring() or reverse())
  • Searching against those characters becomes impossible without also converting the search input
Peter Bailey
All very good reasons. Locking data into HTML is a good reason not to do this...I hadn't taken this into account.
andrew
+4  A: 

If you are using the UTF-8 charset for your whole application (i.e. MySQL, but also the encoding of your HTML pages, your scripts, code, and all that), there is no need to tranform "special characters" into entities : just send your text data as UTF-8 too ;-)

Pascal MARTIN
This definitely makes sense, thinking about it this way.
andrew
That's one of the great thing about UTF-8 : less troubles (well... hu... at least when you're done settings all your application and servers up ^^ )
Pascal MARTIN
A: 

Consider that the database can host data for multiple applications.

In that environment, the definition of a string in the database is defined by the database, not the application. Make your application conform to the data standards and make the conversions explicit in your data layer.

For example, if the database is a newer schema and the DBA has defined that strings will be stored in UTF-8, then all strings passed from your application should be UTF-8.

If, however, the database is a legacy system and the target for your data is an 8 bit character set, then do the conversion in your application to the appropriate code page and/or fail when you encounter a non-conforming value.

Most newer database schemas that interact with the web should standardise on UTF-8 or UTF-16. If you are building the database, start with localising it first and then, once you've decided on the internal string representations, force all the applications that write to it to conform to your standards.

James