views:

159

answers:

5

I'd like to work on a project, but it's a little odd. I want to create a site that shows lyrics and their translations, but they are shown simultaneously side-by-side (so this isn't just a normal i18n of the site).

I have normalized the tables like this (formatted to show hierarchy).

artists
  artistNames

  albums
    albumNames

    tracks
      trackNames
      trackLyrics
      user

So questions,

First, that'll be a whopping seven joins. I must have written pretty small queries in the past because I've never come across something like this. Is joining so many tables a bad thing? I'm pretty sure I'll be using SQLite for this project, but does anyone think PostgreSQL or MySQL could perform better with a pretty big join like this?

Second, my current self-built framework uses a data mapper to create domain objects. This is the first time I will be working with so many one-to-many relationships, so my mapper really only takes one row as one object. For example,

id      name
------  ----------
1       Jackie Chan
2       Stephen Chow

So it's super easy to map objects. But with those one to many relationships...

id      language    name
------  ----------  -------
1       en          Jackie Chan
1       zh          陳港生
2       en          Stephen Chow
2       zh          周星馳

...I'm not sure what to do. Is looping through the result set to create a massive array and feeding it to my domain object factory the only option when dealing with a data set like this?

<?php
    array(
        array(
            'id' => 1,
            'names' => array(
                'en' => 'Jackie Chan'
                'zh' => '陳港生'
            )
        ),
        array(
            'id' => 2,
            'names' => array(
                'en' => 'Stephan Chow',
                'zh' => '周星馳'
            )
        )
    );
?>

I have an itch to just denormalize these tables so I can get my one row per object application working, but I've always read this is not the way to go.

Third, does this schema sound right for the job?

+3  A: 

Just a note. I'm not really sure that 7 tables is that big a join. I seem to remember that Postgres has a special query optimiser (based on a genetic algorithm, no less) that only kicks in once you join 12 tables or more.

Stig Brautaset
+3  A: 

General rule is to make schema as normalized as possible. Then perform stress tests with expected amount of data. If you find performance bottlenecks you should try to optimize in following order:

  1. Profile and optimize queries
    • add indices to schema
    • add hints to query optimizer (don't know if SQLite has any, but most of databases do)
  2. If 1. does not gain any performance benefits, consider denormalizing database.

Denormalizing database is usually needed only if you work with "large" amounts of data. I checked several lyrics databases on internet and the largest I found have lyrics for about 400.000 songs. Let's assume you can find 1.000.000 of lyrics performed by 500.000 artists. That is amount of data that all databases can easily handle on average modern computer.

zendar
+3  A: 

Twelve way joins are not unheard of in serious industrial work. You need sufficient hardware, a strong DBMS, and good database design. Seven way joins should be a breeze for any good environment.

You separate out data, as needed, to avoid difficulties like database update anomalies. These anomalies are what you get when you don't follow the normalization rules. You join data as needed to get the data that you need in a single result.

Sometimes it's better to ignore some of the normalization rules when you build a database. In that case, you need an alternative set of design principles in order to avoid design by trial and error. The amount of joining you are doing has little to do with the disadvantages of looping through results or unfortunate mapping between tuples and objects.

Most of the mappings between tuples (table rows) and objects are done in an incorrect fashion. A tuple is an object, but it isn't an application oriented object. This can cause either performance problems or difficult programmming or both.

As far as you can avoid it, don't loop through results, one row at a time. Deal with results as a set of data. If you can't do that in PHP, then you need to learn how, or get a better programming environment.

Walter Mitty
+2  A: 

Doing this many joins shouldn't be an issue on any serious DB. I haven't worked with SQLite to know if it's in the "serious" category. The only way to find out would be to create your schema, load up a lot of data and start looking at query plans (visual explains are very useful here). When I am doing these kinds of tests, I usually shoot for 10x the data I expect to have in production. If things work ok with this much data, I know I should be ok with real data.

Also, depending on how you need to retrieve the data, you may want to try subqueries instead of joins:

select a.*, (select r.name from artist r where r.id=a.artist a and r.locale='en') from album where a.id=1;
AngerClown
A: 

I've helped a friend optimize a web storefront. In your case, it's a lot the same.

First. What is your priority, webpage speed or update speed?

Normal forms were designed to make data maintenance simple. If Prince changes his name again, voila, just one row is updated. But if you want your web pages to render as fast as possible, then 3rd normal isn't your best plan. Yes, every one is correct that it will do a 7 way join no problem, but that will be dozens of i/o's... index lookup on every table then table access by rowid, then again and again. If you denormalize for webpage loading speed you may do 2 or 3 i/o's. Which will also allow for greater scaling since every page hit will need fewer i/o's to complete, you'll be able to do more simultaneous hits before maxing your i/o.

But there's no reason not to do both. you can keep the base data, the official copy in a normal form, then write a script that can generate a denormal table for web performance. If it's not that big, you can regen the whole thing in a few minute of maintenance downtime. If it is very big, you may need to be smart about the update and only change what needs to be keeping change vectors in an intermediate driving table.

But at the heart of your design I have a question.

Artist names change over time. John Cougar became John Cougar Melonhead (or something) and then later he became John Mellancamp. Do you care which John did a song? will you stamp the entries with from and to valid dates?

It looks like you have a 1-n relationship from artists to albums but that really should many-many.

Sometimes the same album is released more than once, with different included tracks and sometimes with different names for a track. Think international releases. Or bonus tracks. How will you know that's all the same album?

If you don't care about those details then why bother with normalization? If Jon and Vangelis is 1 artist, then there is simply no need to normalize. You're not interested in the answers normalization will provide.

Stephanie Page