views:

137

answers:

5

I have to create 2 tables:

Magazine ( 10 millions of rows with these columns: id, title, genres, printing, price )

Author ( 180 millions of rows with these columns: id, name, magazine_id )

. Every author can write on ONLY ONE magazine and every magazine has more authors.

So if I want to know all authors of Motors Magazine, I have to use this query:

SELECT * FROM Author, Magazine WHERE ( Author.magazine_id = Magazine.id ) AND ( genres = 'Motors' )

The same applies to Printing and Price column.

To avoid these joins with tables of millions of rows, I thought to use this tables:

Magazine ( 10 millions of rows with this column: id, title, genres, printing, price )

Author ( 180 millions of rows with this column: id, name, magazine_id, genres, printing, price )

. and this query:

SELECT * FROM Author WHERE  genres = 'Motors' 

Is it a good approach ?

I want to make it run faster

I can use Postgresql or Mysql.

+1  A: 

You should do this:

SELECT * FROM Author
JOIN Magazine ON Author.id = Magazine.id
WHERE genres = 'Motors'

This should be fast. If it is too slow, make sure you have all the relevant indexes, including primary key indexes on the id fields for all tables and an index on genres.

You should also list the columns you want rather than returning all of them. Note that this query could potentially return millions of rows. Are you sure you want to fetch all of them? I would consider a solution using paging and fetching only the first 50 until the user requests to see the next page.

Mark Byers
+2  A: 
DVK
I am already using indexes... I wanted to know only if there are other tecnique for improve the performances. Memcached can help me in this ?Thanks ^_^
xRobot
@xRobot: There's a fair chance that your index will be useless for certain queries if you `SELECT *`. This varies by server and by query but you'll always get better performance if you select only the columns that are actually covered.
Aaronaught
+3  A: 

No, I don't think duplicating the information as you describe is a good design for a relational database.

If you change the genre or price of a given magazine, you would have to remember to change it in all the author rows where the information is duplicated. And if you forget sometimes, you end up with anomalies in your data. How can you know which one is correct?

This is one of the benefits of relational database normalization, to represent information with minimal redundancy, so you don't get anomalies.

To make it run faster, which is I think what you're trying to do, you should learn how to use indexes, especially covering indexes.

Bill Karwin
I am already using indexes... I wanted to know only if there are other tecnique for improve the performances. Memcached can help me in this ?Thanks ^_^
xRobot
Caching with memcached (or any other caching technology) can help only if your queries tend to be repeated. It can't speed up the first time you query for a given result set, but then you can store the result in cache to make it faster to recall the *same* query result subsequently.
Bill Karwin
Don't denormalize this schema if you have any need at all to maintain it. You'll make everything slower by wasting space, and updates could turn into a nightmare of bad performance.
Ollie Jones
+1  A: 

You don't need to do a JOIN, and even then your basic query is wrong. You meant to say:

SELECT name FROM author
WHERE magazine_id in 
    (SELECT id FROM magazine WHERE genres = 'motors')

There are many different ways to manage huge data stores like this. If you give an example of what you want to get out of this data, people can suggest efficient ways to do it.

egrunin
+1  A: 

If you only need to get the Authors of a Magazine (and no information about the Magazine) you can use EXISTS. Some say EXISTS are faster than JOIN because an EXISTS stops the search after the first hit. Then you should use:

SELECT *
FROM Author
WHERE EXISTS (SELECT 1 FROM Magazine WHERE genres = 'Motor' AND Author.id = Magazine.id)

Also, as mentioned before, specifying the columns would speed things up.

Lars Nyström