views:

196

answers:

8

I can never decide if it's better to format data before inserting it into the DB, or when pulling it out.

I'm not talking about data sanitization; we all know to protect against SQL injection. I'm talking about if the user gives you a URL, and it doesn't have http:// in front of it, should you add that before inserting it into the DB or when pulling it out? What about more complex things, like formatting a big wad of text. Do I want to mark it up with HTML (or strip it down) before or after? What if I change my mind later and want to format it differently? I can't do this if I've already formatted it, but I can if I store it unformatted... but then I'm doing extra work every time I pull a piece of data out of the DB, which I could have done once and been done with it.

What are your thoughts?


From the answers, there seems to be a general consensus that things like URLs, phone numbers, and emails (anything with a well-defined format) should be normalized first to a consistent format. Things like text should generally be left raw or in a manipulable format for maximum flexibility. If speed is an issue, both formats may be stored.

+11  A: 

I think it's best to make sure data in the database is in the most consistent format possible. You might have multiple apps using this data, so if you can make sure it's all the same format, you won't have to worry about reformatting different formats in every application.

Andy White
+1: Database must be absolutely consistent.
S.Lott
+1: It's important for data you just plan to retrieve and reuse elsewhere, and CRITICAL for data that might make it into a WHERE clause.
ojrac
+5  A: 

Normalising URLs to a canonical form prior to insertion is probably okay; performing any kind of extensive formatting, e.g. HTML conversion/parsing etc. strikes me as a bad idea - always have the "rawest" data possible in your database, especially if you want to change the presentation format later.

In terms of avoiding unnecessary post-processing on every query, you might look into adopting object caching or similar techniques for the more expensive operations.

Rob
+1  A: 

In my opinion, it should be formatted first. If you choose to do it at the time of retrieval instead of insertion, this can cause problems down the road when other applications/scripts want to use data out of the same database. They will all need to know how to clean up the data when they pull it out.

Chad Birch
+1  A: 

depends

if you are doing well defined items, SSN, zip code, phone number, store it formatted (this does not necessarily mean to include dashes or dots, etc. it may mean removing them so everyhting is consistent.

KM
+1  A: 

You have to be very careful if you change data before you store it. You could always run into a situation where you need to echo back to the original user the exact text that they gave you.

JP Alioto
What situations would that be? And how often do you run into them? I can't think of any ATM, so I'm not really inclined to factor that into my designs....
Mark Brackett
The case I ran into it was the case of a product SKU. We have a EDI process that needed to feed back the SKU to the original requester and they had a system that was case sensitive.
JP Alioto
+3  A: 

You're asking two questions here.

Normalization should always be performed prior to the database insertion, e.g. if a column only has URLs then they should always be normalized first.

Regarding formating, that's a view problem and not a model (in this case DB) problem.

+1  A: 

My inclination is usually to store data in the most flexible form possible. For instance, numbers should be stored using integer or floating-point types, not strings, because you can do math with numeric types but not with strings (although it's easy enough to parse a number into a string that this is not a big deal). Perhaps a more practical example: dates/times should be stored using the database's actual date/time data type instead of strings. Also, maybe it's easier to convert HTML into plain text than vice versa, in which case you'd want to store your text as HTML. Or maybe even using a format like Markdown which can be easily converted into either HTML or plain text.

It's the same reason vector graphics formats (SVG, EPS, etc.) exist: an SVG file is essentially a sequence of instructions specifying how to draw the image. It's easy to convert that into a bitmap image of any size, whereas if you only had a bitmap image to start with, you'd have a hard time changing its size (e.g. to create a thumbnail) without losing quality.

David Zaslavsky
+1  A: 

It is possible you might want to store both the formatted and unformatted versions of the data. For instance, let's use American phone numbers as an example. If you store one column with just the numbers and one column with the most frequently needed format, such as (111) 111-1111, then you can easily format to client specifications for the special cases or pull the most common one out quickly without lots of casting. This takes very little extra time at the time of insert (and can be accomplished with a calculated column so it always happens no matter where the data came from).

Data should be scrubbed before being put in the database so that invalid dates or nonnumeric data etc aren't ever placed in the field. Email is one field that people often put junk into for some reason. If it doesn't have an @ sign, it shouldn't be stored. This is especially true if you actually send emails thorugh your application(s) using that field. It is a waste of time to try to send an email to 'contact his secretary' or 'aol.com' if you see what I mean.

If the format will be consistently needed, it is better to convert the data to that format once on insert or update and not have to convert it ever again. If the standard format changes, you will need to update the column for all existing records at that time, then use the new format going forth. If you have frequent changes of format and large tables or if differnt applications use different formats, it might be best to store unformatted.

HLGEM