views:

180

answers:

9

This is for an upcoming project. I have two tables - first one keeps tracks of photos, and the second one keeps track of the photo's rank

Photos:
+-------+-----------+------------------+ 
| id    | photo     | current_rank     |
+-------+-----------+------------------+ 
| 1     | apple     | 5                |
| 2     | orange    | 9                |
+-------+-----------+------------------+

The photo rank keeps changing on a regular basis and this is the table that tracks it:

Ranks:
+-------+-----------+----------+-------------+ 
| id    | photo_id  | ranks    | timestamp   |
+-------+-----------+----------+-------------+
| 1     | 1         | 8        | *           |
| 2     | 2         | 2        | *           |
| 3     | 1         | 3        | *           |
| 4     | 1         | 7        | *           |
| 5     | 1         | 5        | *           |
| 6     | 2         | 9        | *           |
+-------+-----------+----------+-------------+ * = current timestamp

Every rank is tracked for reporting/analysis purpose. [Edit] Users will have access to the stats on demand.

I talked to someone who has experience in this field and he told me that storing ranks like above is the way to go. But I'm not so sure yet.

The problem here is data redundancy. There are going to be tens of thousands of photos. The photo rank changes on a hourly basis (many time within minutes) for recent photos but less frequently for older photos. At this rate the table will have millions of records within months. And since I do not have experience in working with large databases, this makes me a little nervous.

I thought of this:

Ranks:
+-------+-----------+--------------------+
| id    | photo_id  | ranks              |
+-------+-----------+--------------------+
| 1     | 1         | 8:*,3:*,7:*,5:*    |
| 2     | 2         | 2:*,9:*            |
+-------+-----------+--------------------+ * = current timestamp

That means some extra code in PHP to split the rank/time (and sorting) but that looks OK to me.

Is this a correct way to optimize the table for performance? What would you recommend?

Any suggestions would be great.

+2  A: 

I would stick with your first approach. In the second you will have a lot of data stored in the row, as time goes by it gets more ranks! Ie. if a photo gets thousands and thousands of rankings.

First approach is also more maintaineble, ie. i you wish to delete a rank.

Phliplip
Actually the data in the rank table is never updated because it is used for displaying statistics.
Yeti
+1  A: 

I'd think the database 'hit' of over normalistion (querying the ranks table over and over) is nicely avoided by 'caching' the last rank in current_rank. It does not really matter ranks is growing tremendously if it is seldom queried (analyis / reporting you said), never updated but just gets records inserted at the end: even a very light box would have no problem having millions of rows in that table.

You alternative would require lots of updates on different locations on the disk, possibly resulting in degraded performance.

Of course, if you need all the old data, and always by photo_id, you could plan a scheduled run to another table rankings_old, possibly with photo_id, year,month, rankings (including timestamps) when a month is over, so retrieving old data stays easily possible, but there are no updates needed in rankings_old or rankings, only inserts at the end of the table.

And take it from me: millions of records in a pure logging table should be absolutely no problem.

Wrikken
My bad, didn't mention that the log is available to the owner of the photo owner (they can view rank stats). So, there might as well be a lot of queries.
Yeti
Then it's less clear-cut, but I still think the first solution would be better (update performance is key there, and updates shouldn't lock the selects). It's possible that I would work with with a daily rankings_old rather then a month.
Wrikken
+6  A: 

The first one. Period.

Actually you'll lose much more. a timestamp stored in the int column will occupy only 4 bytes of space.
while the same timestamp stored in the string format will take 10 bytes.

Col. Shrapnel
I forgot to mention the timestamp is not unix timestamp, it's datetime. Is that the same?
Yeti
@Lost for your homework: http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
Col. Shrapnel
+2  A: 

Your first design is correct for a relational database. The redundancy in the key columns is preferable because it gives you a lot more flexibility in how you validate and query the rankings. You can do sorts, counts, averages, etc. in SQL without having to write any PHP code to split your string six ways from Sunday.

It sounds like you would like to use a non-SQL database like CouchDB or MongoDB. These would allow you to store a semi-structured list of rankings right in the record for the photo, and subsequently query the rankings efficiently. With the caveat that you don't really know that the rankings are in the right format, as you do with SQL.

Bill Karwin
A: 

You stated the rank is only linked to the image, in which case all you need is table 1 and keep updating the rank in real time. Table 2 just stores unnecessary data. The disadvantage of this approach is that user cant change his vote.

Ravi Vyas
+1  A: 

Normalized data or not normalized data. You will find thousands of articles about that. :)

It really depends of your needs.

If you want to build your database only with performance (speed or RAM consumption or...) in mind you should only trust the numbers. To do that you have to profile your queries with the expected data "volume" (You can generate the data with some script you write). To profile your queries, learn how to read the results of the 2 following queries:

  • EXPLAIN extended...
  • SHOW STATUS

Then learn what to do to improve the figures (mysql settings, data structure, hardware, etc).

As a starter, I really advise these 2 great articles:

  1. http://www.xaprb.com/blog/2006/10/12/how-to-profile-a-query-in-mysql/
  2. http://ajohnstone.com/archives/mysql-php-performance-optimization-tips/

If you want to build for the academic beauty of the normalization: just follow the books and the general recommandations. :)

Toto
A: 

You said the second table is for analysing /statistics, so it actually isn't something that needs to be stored in db. My suggestion is to get rid of the second table and use a logging facility to record rank changes.

stereofrog
I forgot to mention that users can access the data from the 2nd table. Have edited the question.
Yeti
+1  A: 

Out of the two options - like everyone before me said - it has to be option 1.

What you should really be concerned about are the bottlenecks in the application itself. Are users going to refer to the historical data often, or does it only show up for a few select users? If the answer is that everyone gets to see historical data of the ranks, then option 1 is good enough. If you are not going to refer to the historical ranks that often, then you could create a third "archive" table, and before updating the ranks, you can copy the rows of the original rank table to the archive table. This ensures that the number of rows stays minimal on the main table that is being called.

Remember, if you're updating the rows, and there's 10s of thousands, it might be more fruitful to get the results in your code (PHP/Python/etc), truncate the table and insert the results in rather than updating it row by row, as that would be a potential bottleneck.

You may want to look up sharding as well (horizontal partitioning) - http://en.wikipedia.org/wiki/Shard_%28database_architecture%29

And never forget to index well.

Hope that helped.

Duniyadnd
A: 

Your second design is very dangerous in case you have 1m votes for a photo. Can PHP handle that?

With the first design you can do all math on the db level which will be returning you a small resultset.

cherouvim