ansaurus

Question

Best to have hundreds of columns or split into multiple tables?

Answer 1

A:

Normalization ensures that you don't repeat data in your schema.

There are limits to how far you should go, of course. JOINS for 7 tables or more are not performant.

But one monster table? I'd break it up.

duffymo 2009-01-09 14:49:48

Answer 2

+2 A:

When I see hundreds of columns in a table, i tend to suspect the data schema hasn't been properly normalised. Are the hundreds of columns really unique, or are they groups of similar things that can be normalised out into smaller tables?

If you can reduce the number of columns, you are liable to reduce the total amount of data transacted and hence improve performance at a number of levels. For example, if you have a record that contains 1000 bytes of data, and you want to change 1 byte for each record, you risk fetching and storing 999 bytes unnecessarily. This does impact performance.

Shane MacLaughlin 2009-01-09 14:51:09

Thankfully, it looks like once this data is entered, it will never be modified, although it will eventually be deleted to free up space.

Eli Courtwright 2009-01-09 15:11:52

Answer 3

A:

In this situation I would create a couple of tables. One would be the machine table. One would be a problem lookup table. Finally, a junction table between the two that also contains info related to the status. Maintenance will be easier and writing crazy reports will be easier. Also, adding new types of statuses will be easier as well.

machine
id
name
description

status_flag
id
caption

machine_history
machine_id
status_flag_id
information

Then you can do stuff like: select count(distinct machine_id) from machine_history where status_flag_id = 23 and information < 5;

The only thing is the information field in the machine_history table may need to contain numbers or characters. If that's the case, I would create two information fields so that you don't impede performance.

Also I'm assuming there is a programming component to this that will allow you to create some methods for easily working with this data.

Jeremy 2009-01-09 14:53:44

Just to add to this, you can create status_flag categories that group them by your different types, ie malfunctions, jams, etc.

Jeremy 2009-01-09 15:12:46

Answer 4

+2 A:

Do you mean 100s of types of statistics?

Some medical databases have tried a schema or idiom which is called "entity attribute value" or "EAV" (you can Google these terms): the reasoning is that there are innumerable different types of fact about a patient, which may or may not have been captured for any given patient, and that EAV is a better way to store this than to have innumerable different columns in a table.

Beware however that EAV is controversial: some say it's a "code smell" and typical newbie mistake; others say that it's useful occasionally (or rarely) but depends on (specifying and having) good support for metadata.

ChrisW 2009-01-09 15:00:24

I did mean hundreds of types of statistics. Thanks for the suggestion about EAV; I'll look into this.

Eli Courtwright 2009-01-09 15:03:38

Answer 5

A:

I tend to not like tables with too many columns. One option you might consider is to store the stats as rows in a stats table:

CREATE TABLE Statistics (id AS INTEGER PRIMARY KEY, statusType As VarChar,
statusValue As Float);

Then you just add a new row with for each status you're tracking. This is a lot cleaner from a DB perspective, but it does make getting at the data trickier for reports.

Paul Lefebvre 2009-01-09 15:00:49

Answer 6

+4 A:

I think we need to know more about your design to answer properly. For example, I'm curious that there could be lots of columns relating to malfuctions, lots (of different ones) relating to jams etc. (Isn't a jam just a kind of malfunction anyway?)

Is your design normalised? Presumaly you don't have columns like "jam1", "jam2", etc.?!

Assuming the design is good and normalised, the decision as to whether to have one wide table or many narrower ones is a trade-off between various factors:

Do all/most records have statistics of all types? Yes => one table, no => many
Do you often need to query statistics of all types together? Yes => one table, no => many
Do you maintain all the different stats together in the same screen? Yes => one table, no => many
Are you likely to hit any database limits e.g. max 1000 columns per table?

Whichever way you go, you can use views to present the alternative structure for the convenience of the developer:

One table: many views that select stats of particular types
Many tables: a view that joins all the tables together

Update

From your comments, I now know that you have counts of jams at 40 different locations on the machine, and other types of stats are counts of a similar nature. This suggests the following table design:

create table machines (machine_id ... primary key, ...);
create table machine_stats 
   ( machine_id references machines
   , stat_group -- 'jams', 'malfunctions' etc.
   , stat_name  -- 'under the hood', 'behind the door' etc.
   , stat_count 
   );

As someone commented below, these allows you to sum stats more easily - within or across stat types. It is also easily extended if a new stat needs to be added to a stat type.

Tony Andrews 2009-01-09 15:02:40

Each jam statistic is a count of the number of jams which have occurred at a specific place in the machine. Because there are over 40 different places where a jam can occur, we have over 40 jam counts. Most of our statistics are counts such as this.

Eli Courtwright 2009-01-09 15:14:47

At a glance this sounds like it could be normalised down to jam position, jam counts. This would be worth considering if most of the 40 jam counts were 0 for a given row.

Shane MacLaughlin 2009-01-09 15:28:11

Even if most of the jams weren't going to be zero, I think smacl's suggestion would be worth following. Wide tables give me headaches. Also, what if you want to calculate total jams? JamPos1 + JamPos2 ... gets old fast. sum(jams) is much nicer.

rmeador 2009-01-09 15:44:28

Thanks for the update! I hadn't thought about using a stat group field to make it easy to index, retrieve, and sum related statistics efficiently, but it's a great idea and I'll probably do just that.

Eli Courtwright 2009-01-12 17:26:04

ansaurus

tags:

views:

answers:

Best to have hundreds of columns or split into multiple tables?

Update

related questions