views:

344

answers:

6

I've got a problem that keeps coming up with normalized databases and was looking for the best solution.

Suppose I've got an album information database. I want to setup the schema in a normalized fashion, so I setup two tables - albums, which has one listing for each album, and songs, which lists all songs contained by albums.

albums
------
aid
name

songs
-----
aid
sid
length

This setup is good for storing the data in a normalized fashion, as an album can contain any number of songs. However, accessing the data in an intuitive manner has now become a lot more difficult. A query which only grabs the information on a single album is simple, but how do you grab multiple albums at once in a single query?

Thus far, the best answer I have come up with is grouping by aid and converting the songs information as arrays. For example, the result would look something like this:

aid, sids,      lengths
1,   [1, 2],    [1:04, 5:45]
2,   [3, 4, 5], [3:30, 4:30, 5:30]

When I want to work with the data, I have to then parse the sids and lengths, which seems a pointless exercise: I'm making the db concatenate a bunch of values just to separate them later.

My question: What is the best way to access a database with this sort of schema? Am I stuck with multiple arrays? Should I store the entirety of a song's information in an object and then those songs into a single array, instead of having multiple arrays? Or is there a way of adding an arbitrary number of columns to the resultset (sort of an infinite-join), to accommodate N number of songs? I'm open to any ideas on how to best access the data.

I'm also concerned about efficiency, as these queries will be run often.

If it makes any difference, I'm using a PostgreSQL db along with a PHP front-end.

A: 

The join queries will ask the database to put the tables together, matching the ids and return a single table. That way the data can be dynamically configured to the current task, something that non normalized databases cannot do.

Phil
+2  A: 

I have difficulty seeing your point. What exactly do you mean by "how do you grab multiple albums at once in a single query"? What exactly do you have difficulties with?

Intuitively I would say:

SELECT
  a.aid    album_id,
  a.name   album_name,
  s.sid    song_id,
  s.name   song_name,
  s.length song_length
FROM
  albums a
  INNER JOIN songs s ON a.aid = s.aid
WHERE
  a.aid IN (1, 2, 3)

and

SELECT
  a.aid         album_id,
  a.name        album_name,
  COUNT(s.sid)  count_songs,
  SUM(s.length) sum_length   /* assuming you store an integer seconds value  */
FROM                         /* here, not a string containing '3:18' or such */
  albums a
  INNER JOIN songs s ON a.aid = s.aid
WHERE
  a.aid IN (1, 2, 3)
GROUP BY
  a.aid

Depending on what you want to know/display. Either you query the database for aggregate information, or you calculate it yourself out of the query result #1 in your app.

Depending on how much data is cached in your app, and how long queries take the one strategy can be faster than the other. I would recommend querying the DB, though. DBs are made for this kind of stuff.

Tomalak
I see your point, but I have issues with the first query, because you end up with a lot of repeated data - the album name is repeated many times.I'm trying to have my cake and eat it, too - I want the data to be as compact as possible, but that's not realistic without aggregates.
Daniel Lew
Leave off the album name from the first query. You have it in the second one (which probably comes first anyway), and your app can store some context as well. Other than that, I see your point as well. But I guess the repeated album name won't clog your performance too badly. ;-)
Tomalak
(Funnily enough I rephrased my second paragraph before posting to avoid the "you can't have your cake and eat it too" platitude :-D)
Tomalak
A: 
SELECT aid,GROUP_CONCAT(sid) FROM songs GROUP BY aid; 

+----+-------------------------+
|aid | GROUP_CONCAT(sid)       |
+----+-------------------------+
|  3 | 5,6,7                   |
+----+-------------------------+
Lance Kidwell
My googling suggests that GROUP_CONCAT() is not supplied by PostgreSQL. However you can build it yourself using CREATE AGGREGATE.
j_random_hacker
Yes, that's true. I didn't notice the PostgreSQL part of the question.
Lance Kidwell
+2  A: 

I see your point, but I have issues with the first query, because you end up with a lot of repeated data - the album name is repeated many times. I'm trying to have my cake and eat it, too - I want the data to be as compact as possible, but that's not realistic without aggregates.

Ah, I understand your question now. You're asking how best to micro-optimize something that's actually not very expensive for most cases. And the solution you're toying with is actually going to be significantly less efficient than the "problem" it's trying to solve.

My advice would be to join the tables and return the columns you need. For anything less than 10,000 records returned, you won't notice any significant wire time penalty for handing back that AlbumName with each Song record.

If you notice it performing slowly in the field, then optimize it. But keep in mind that a lot of smart people have spent about 50 years of research making the "join the tables & return what you need" solution fast. I doubt you'll beat it with your home-rolled string concatenation/de-concatenation strategy.

Jason Kester
This is an example. The actual Albums table will have approximately 10 columns that I'll want, and that's a lot of repeated data. I'm going with two queries instead.Also, no need to be condescending. I know that string concat/de-concate would be slow, which is why I posted the question. :P
Daniel Lew
You should also know that returning two recordsets will be slow. Certainly not worth it to avoid repeating 10 columns a few hundred times. Sorry if I sounded condescending. This seems to be new ground for you, and it's DB 101.
Jason Kester
+1  A: 

I agree with Jason Kester insofar as I think this is unlikely to really be a performance bottleneck in practice, even if you have 10 columns with repeated data. However, if you're bent on cutting out that repeated data then I'll suggest using 2 queries:

Query #1:

SELECT sid, length     -- And whatever other per-song fields you want
FROM songs
ORDER BY aid

Query #2:

SELECT aid, a.name, COUNT(*)
FROM albums a
JOIN songs s USING (aid)
GROUP BY aid, a.name
ORDER BY aid, a.name

The second query enables you to break up the output of the first query into segments appropriately. Note that this will only work reliably if you can assume that no changes will be made to the table between these two queries -- otherwise you'll need a transaction with SET TRANSACTION ISOLATION LEVEL SERIALIZABLE.

Again, the mere fact that you're using two separate queries is likely to make this slower overall as in most cases the doubled network latency + query parsing + query planning is likely to swamp the effective increase in network throughput. But at least you won't have that nasty horrible feeling of sending repeated data... :)

j_random_hacker
A: 

I wouldn't break your normalisation for that. Leave the tables normailsed and then use the following to query - http://stackoverflow.com/questions/43870/how-to-concatenate-strings-of-a-string-field-in-a-postgresql-group-by-query

Guy C