views:

612

answers:

4

Let's say you have a database with a single table like...

---------------------------------------------
| Name    |  FavoriteFood                   |
---------------------------------------------
| Alice   | Pizza                           |
| Mark    | Sushi                           |
| Jack    | Pizza                           |
---------------------------------------------

Would it be more space-efficient to have an additional table called "Strings" that stores strings, and change the FavoriteFood column to an index in the string table. In the above example, "Pizza" looks like it is stored twice, but with the additional table, it would appear to be stored only once. Of course, please assume there are 1,000,000 rows and 1,000 unique strings instead of just 3 rows and 2 unique strings.

Edit: We don't know what the FavoriteFoods are beforehand: they are user-supplied. The programmatic interface to the string table would be something like...

String GetString(int ID) { return String at with Row-ID == ID }

int GetID(String s) {
  if s exists, return row-id;
  else {
    Create new row;
    return new row id;
  }
}

So the string-table seems more efficient, but do modern databases already do that in the background, so I can just do the simple one table approach and be efficient?

+4  A: 

What are you measuring efficiency by? Assuming there is no other data associated with each FavoriteFood (in which case obviously you want two tables), a one-table approach is probably more time efficient, as the unnecessary join would incur an extra processing cost. On the other hand, a two-table approach may be more space-efficient, since it takes less space to store an index than a string, but that depends on how the particular database that you're using optimizes storage of repeated strings.

Tyler McHenry
Actually you are right, join process would consume more time and performance and will be worse than repeated string records.
Braveyard
+2  A: 

In case you have another table to store the strings, it will be easier when you want to update the descriptions, for example, if u need to update all Pizzas to Italian Pizza, then u can do with one row update if u use a separate table. Another advantage would be translations, u can use the other table to store translations of the string in different languages and select the one based on the current language.

But the problem with that approach would be for inserts. U need to insert in both tables and also need to maintain the foreign key constraints, so it adds a bit of complexity to a simple table.

Arvind
+3  A: 

You should be thinking in terms of what makes a good design in terms of your problem domain rather than efficiency (unless you expect to have tens of millions+ rows).

A well designed database should be in 3NF (third normal form). Only denormalise when you have identified a performance problem by measuring.

Mitch Wheat
Assuming there is no auxiliary data associated with the food items (e.g. nutrition info), then his design is already in 3NF. Not every bit of repeated data must be an integer in order to have a properly normalized database.
Tyler McHenry
@Tyler McHenry: I didn't say it did, did I? I was making the point that design is more important than optimising a performance problem you don't have.
Mitch Wheat
I would say that the best reason to go with non-normalized data in this case would be simply because it's user entered data. So getting them to find their already entered string in the 1000 strings that are already entered is going to be difficult. You're going to end up with 6 different variations of each thing anyway, because users will want to be original, and say they like pepperoni pizza, or pizza with just cheese, or pizza with 6 kinds of meat on it. Even though Pizza would just suffice, you're going to get 6 different kinds of pizza, so there isn't much of a point in normalizing it.
Kibbee
Ugh, "well-designed" != 3NF in a lot of situations.
jfar
@jfar: thx for the downvote. Would you mind explaining your 'Ugh!' Please give an example where 3NF is bad (apart from cubes).
Mitch Wheat
3NF is a bad choice if you read performance is critical or for some application development scenarios where your ORM restricted to 1:1 mapping. Sometimes the extra bit of writes/updates and data integrity checks are a small price to pay for faster development. Also, I tend to vote down any "this way is the only way answers".
jfar
@jfar: that's hasn't been my experience. 3NF is what you should design for, and then maybe (just maybe) in extreme circumstances, denormalise as little as possible. If you have performance problems with a 3NF schema, your indexes are not right.
Mitch Wheat
+1  A: 

Pros for having a separate "Strings" table:

  • Likely, less space, if strings repeat really frequently
  • Likely, faster typical queries - because of less I\O

Cons:

  • You'll write more complex queries to achieve the same result
  • If the repetition factor is rather small, you'll get higher query execution time. To resolve each ID to string (or back), database server will perform a single lookup (seek operation) per each ID. So you get additional log(Strings.Count()) factor ~ for each query doing this.

But actually this is really effecient. E.g. most of full-text search engines use nearly this approach to store document-word maps.

Alex Yakunin