views:

59

answers:

2

The entities to be stored have 25+ properties (table columns). The entities are pretty diverse, meaning that, most of the columns are empty. On average, I'd say, less than 20% (<5) properties have a value in any particular item. So, I have a lot of redundant empty columns for most of the table rows. Almost all of the columns are decimal numbers.

Given this scenario, would you suggest serializing the columns instead, or perhaps, create another table named "Property", which would contain all the possible properties and then creating yet another table "EntityProperty" which would map an property to an entity using foreign keys? Or would you leave it as it is?

An example scenario where this kind of redundancy might occur could be the following:

We have an imaginary universe with lots of planets. We are creating a space mining game and each planet has 30 different mineral contents. Most of the planets have only 2-3 minerals.

The simplest solution would be to create a single table 'Planets' with 30 columns, one for each mineral. The problem here is that most of rows in the 'Planets' table have 25+ columns, in which each of one the value is null or zero. Thus we have lot of redundant data. Say, we would have 500k-1M records. I would guess it costs a byte at most to save a null or zero decimal value. Thus, we waste space 500,000-1,000,000 bytes, ie. one megabyte at most.

The other solution would be to create two additional tables. Instead of storing all the minerals in the 'Planets' table, we take them all out and create a table for the minerals called 'Minerals'. This would contain only 30 rows, one for each different mineral type. Then, we create a table called 'PlanetMineral' which contains a reference to a planet row and to a mineral row, and additionally this table would have a column telling the amount of the mineral the planet has. Apparently, in many database systems this complicates queries since you have to do possible several joins. I'm using SQL server with LINQ to SQL, which scaffolds the foreign key constraint into class object property, accessible through code. (ie. I can simply access the minerals a planet has with planet.Minerals) So, from this perspective it doesn't add complexity. The redundancy is a small portion (like 1/15) of the first solution. The reason there is still some overhead is because of the foreign keys we need to store.

As for the data query efficiency, I really don't know how the costs of the queries would compare between these two solutions.

A: 

For numbers, I would personally leave it as is, in 1 table. Numbers are compressed into a few bytes, and the overhead for having an EntityProperty table would far outweight that. Serializing is an option, but it means you cannot use SQL to search or compute the properties, you have to get the data, deserialise, and then compute.

oedo
A: 

It depends:

  • How many entities (rows) you are planning to have?
  • What kind of queries you run against that table?
  • Will there be a lot of new properties in future?
  • How are you planning to use the properties?

You seem to be concerned about wasting space with simple table? Try to calculate if space saving with other approaches are really significant and worthwhile. The disk is (usually) cheap.

If you have low number of rows, then the single table is probably better (it is easier to implement).

If you plan to create complex queries against the properties (eg. where property1 < 123) then the simple table is probably easier.

If you are planing to add lot of new properties in the future then the Property/EntityProperties approach could be useful.


I'd go with the simple one table approach because you have a rather small amount of rows (<1M), you are probably running your database with server machines and not some handheld/mobile thing (SQLServer) and your database schema is rather rigid.

Juha Syrjälä
Terve! 500k-1M rows. Most of the queries deal with retrieving all or part of the rows with all relevant information (for displaying a list of the entities on a website with paging). There shouldn't be a lot of new properties in the future since we hopefully have documented the needed columns. The data of the properties themselves would change very little, if at all, so most (>99%) of the queries deal with retrieval from a dataset. It would seem the single table approach is more suitable.
thenextwebguy