ansaurus

Question

Help with designing a schema for a lyrics database

Answer 1

+3 A:

Just a note. I'm not really sure that 7 tables is that big a join. I seem to remember that Postgres has a special query optimiser (based on a genetic algorithm, no less) that only kicks in once you join 12 tables or more.

Stig Brautaset 2009-07-04 23:48:36

Answer 2

+3 A:

General rule is to make schema as normalized as possible. Then perform stress tests with expected amount of data. If you find performance bottlenecks you should try to optimize in following order:

Profile and optimize queries
- add indices to schema
- add hints to query optimizer (don't know if SQLite has any, but most of databases do)
If 1. does not gain any performance benefits, consider denormalizing database.

Denormalizing database is usually needed only if you work with "large" amounts of data. I checked several lyrics databases on internet and the largest I found have lyrics for about 400.000 songs. Let's assume you can find 1.000.000 of lyrics performed by 500.000 artists. That is amount of data that all databases can easily handle on average modern computer.

zendar 2009-07-04 23:50:30

Answer 3

+3 A:

Twelve way joins are not unheard of in serious industrial work. You need sufficient hardware, a strong DBMS, and good database design. Seven way joins should be a breeze for any good environment.

You separate out data, as needed, to avoid difficulties like database update anomalies. These anomalies are what you get when you don't follow the normalization rules. You join data as needed to get the data that you need in a single result.

Sometimes it's better to ignore some of the normalization rules when you build a database. In that case, you need an alternative set of design principles in order to avoid design by trial and error. The amount of joining you are doing has little to do with the disadvantages of looping through results or unfortunate mapping between tuples and objects.

Most of the mappings between tuples (table rows) and objects are done in an incorrect fashion. A tuple is an object, but it isn't an application oriented object. This can cause either performance problems or difficult programmming or both.

As far as you can avoid it, don't loop through results, one row at a time. Deal with results as a set of data. If you can't do that in PHP, then you need to learn how, or get a better programming environment.

Walter Mitty 2009-07-05 11:35:08

Answer 4

+2 A:

Doing this many joins shouldn't be an issue on any serious DB. I haven't worked with SQLite to know if it's in the "serious" category. The only way to find out would be to create your schema, load up a lot of data and start looking at query plans (visual explains are very useful here). When I am doing these kinds of tests, I usually shoot for 10x the data I expect to have in production. If things work ok with this much data, I know I should be ok with real data.

Also, depending on how you need to retrieve the data, you may want to try subqueries instead of joins:

select a.*, (select r.name from artist r where r.id=a.artist a and r.locale='en') from album where a.id=1;

AngerClown 2009-07-05 17:25:50

Answer 5

A:

I've helped a friend optimize a web storefront. In your case, it's a lot the same.

First. What is your priority, webpage speed or update speed?

Normal forms were designed to make data maintenance simple. If Prince changes his name again, voila, just one row is updated. But if you want your web pages to render as fast as possible, then 3rd normal isn't your best plan. Yes, every one is correct that it will do a 7 way join no problem, but that will be dozens of i/o's... index lookup on every table then table access by rowid, then again and again. If you denormalize for webpage loading speed you may do 2 or 3 i/o's. Which will also allow for greater scaling since every page hit will need fewer i/o's to complete, you'll be able to do more simultaneous hits before maxing your i/o.

But there's no reason not to do both. you can keep the base data, the official copy in a normal form, then write a script that can generate a denormal table for web performance. If it's not that big, you can regen the whole thing in a few minute of maintenance downtime. If it is very big, you may need to be smart about the update and only change what needs to be keeping change vectors in an intermediate driving table.

But at the heart of your design I have a question.

Artist names change over time. John Cougar became John Cougar Melonhead (or something) and then later he became John Mellancamp. Do you care which John did a song? will you stamp the entries with from and to valid dates?

It looks like you have a 1-n relationship from artists to albums but that really should many-many.

Sometimes the same album is released more than once, with different included tracks and sometimes with different names for a track. Think international releases. Or bonus tracks. How will you know that's all the same album?

If you don't care about those details then why bother with normalization? If Jon and Vangelis is 1 artist, then there is simply no need to normalize. You're not interested in the answers normalization will provide.

Stephanie Page 2010-02-22 22:31:57

ansaurus

tags:

views:

answers:

Help with designing a schema for a lyrics database

related questions