views:

164

answers:

5

Hey,

I am currently debating an issue with a guy on my dev team. He believes that empty fields are bad news. For instance, if we have a customer details table that stores data for customers from different countries, and each country has a slightly different address configuration - plus 1-2 extra fields, e.g. French customer details may also store details for entry code, and floor/level plus title fields (madamme, etc.). South Africa would have a security number. And so on.

Given that we're talking about minor variances my idea is to put all of the fields into the table and use what is needed on each form.

My colleague believes we should have a separate table with extra data. E.g. customer_info_fr. But this seams to totally defeat the purpose of a combined table in the first place.

His argument is that empty fields / columns is bad - but I'm struggling to find justification in terms of database design principles for or against this argument and preferred solutions.

Another option is a separate mini EAV table that stores extra data with parent_id, key, val fields. Or to serialise extra data into an extra_data column in the main customer_data table.

I think I am confused because what I'm discussing is not covered by 3NF which is what I would typically use as a reference for how to structure data.

So my question specifically: -

if you have slight variances in data for each record (1-2 different fields for instance) what is the best way to proceed?

A: 

That's what nullable fields are for: "Data not available/applicable".

SQL has a different notion of null than most programming languages, so SQL's null is often a misunderstood concept.

Johannes Rudolph
Johannes - superb. Thanks for that clarification!
+4  A: 

I'd be interested in your colleague's justification as to why empty fields are bad. As far as I'm aware, empty or null fields aren't bad in and of themselves. If you have a lot of empty data values for a column that you are planning on putting an important index on, you may want to consider other options. This goes for any column where you have a lot of duplicate records actually and need an index, as duplicated records lower the cardinality of the column, making indexes less useful. In your case, I don't see it being an issue.

For this kind of data, you're likely using a VARCHAR or some kind of TEXT column anyway, which are variable length fields in the database. It doesn't matter if your field is full of data or empty, you're still going to incur the overhead of a variable-length column (which isn't worth worrying about in normal circumstances). So again, there's no difference to the RDBMS.

From the sounds of what you're designing, I think if you came up with a generic method of handling address variances in a single table, it would be the way to go. Your code and structure would be much simpler at the negligible (in my opinion) cost of some empty data fields.

zombat
Yep, this is a great answer. Thank you! These fields typically will not need to be indexed, or even searched on (though I can't say never). I have a sneaky feeling my colleague doesn't like empty columns because he has a perfectionistic tendency. But I needed some more solid ideas to counter his beliefs. Yes, these fields would always be varchar, I think.It makes the most sense. Thanks!
+3  A: 

There is definitely a school of thought which holds that NULL fields are bad, in and of themeselves. Relational theory demands that databases consist of facts, and NULLs are the absence of fact. So, a rigourously designed database would have no nullable columns.

What your colleague is proposing is something which is on the road to 6th Normal Form, where all the tables consist of a primary key and at most one other column. Only in such a schema we wouldn't have tables called customer_info_fr. That's not normalised. Many countries might include ENTRY_CODE in there addresses. So we would need address_entry_codes and address_floor_numbers. Not to mention address_building_number and address_building_name, as some places are identified by number and other by name.

It's completely accurate and truthful as a logical design. Alas from a physical perspective it is Teh Suck! The simplest query - select * from addresses - becomes a multi-table join, and outer joins at that. Nullable columns are a way of reconciling ugly design with the hard truth, "you cannae break the laws of physics". Nullable columns allow us to combine disjoint data sets into a single table, albeit at the cost of handling nulls (thety can affect data retrieval, index usage, maths, etc).

APC
I disagree with that school of thought, because NULL **is** (or can be) a fact, not the absence of a fact. e.g. It is a fact that the building in which I'm currently sitting _has no building number_The only area of contention is distinguishing a _missing fact_ - we don't know the building number currently.
Stephen P
@StephenP - but you will concede that such a scenario could be represented by the absence of a record in a notional `building_number` table, which is subsequently populated by a record when the building is assigned a number.
APC
@APC - yes, that's certainly also a valid representation of the situation.
Stephen P
A: 

Nulls invariably add complexity to a data model because the behaviour of null in SQL rarely matches the maths, logic or reality that you intended to model with it. In other words, some queries return incorrect results, which you then need to compensate for with additional logic.

All information can be represented accurately without nulls. Since nulls add complexity it is sound design practice to begin your data model without them and then only add a null where you find some special reason to do so or where some database feature or limitation forces a null upon you.

dportas
all ingformation cannot be represented accurately without null
HLGEM
I beg to differ. Science, maths and logic managed to describe the world without using nulls for thousands of years before SQL came along. Science, maths and logic continue to do so today. A relational database is nothing more or less than a set of propositions about the world. Adding nulls to that paradigm demonstrably doesn't add any extra expressive power - it just means you have a new set of rules to learn. Added to that, the use of nulls in standard SQL is so inconsistent that it pretty much defies any everyday semantics you care to apply it to.
dportas
A: 

Whatever you do, do not go down the EAV route. This is a prescription for a poorly performing database, far, far worse than a few empty fields.

If you must have a separate related tables for the different situations, a lot of that will depend on how different the entities are and how they will be queried. If you will be querying across categories, you will find that joins to a bunch of tables to get all the data you may or may not need is a nightmare (I don't know if Germany will be in my result set so I join to the Germany details tables, oops didn't need to). It can be far simpler to handle nulls than to try it figure out which of many tables you need to join to (and to always remember to left join to those tables).

However, if you will never be querying across the entitites and the fields make sense separate, then put them in a separate table.

HLGEM