views:

107

answers:

4

I have a site where people can add their favorite TV series. There is one feature that makes it possible to check off episodes that you have seen.

Each episodes that is checked off, creates one record in a DB table (with user_id, show_id and episode_id).
This table is now over 600.000 rows and is growing very fast!

I have indexes set up, but I feel like the performance when querying this table is getting worse and worse.

My thoughts for a new solution:

So instead of:

user_id | show_id | episode_id  
1 ....... 123 ......7675  
1 ....... 123 ......7676   
1 ....... 123 ......7677  
1 ....... 456 ......5678  
1 ....... 456 ......5679  
1 ....... 456 ......5680  

I could do this:

user_id | show_id | episode_ids  
1 ....... 123 ......7675,7676,7677  
1 ....... 456 ......5678,5679,5680

Then I would have to split the string into an array, and use array.include?(some-id).
This should relieve the database, but there would be much heavier array code for Ruby to handle.

Am I on the right track? Or can anybody think of a better solution?

+13  A: 

No No no, that is absolutely NOT the way to structure such a database. Comma-separated lists in varchar fields are the least desirable anti-pattern you should consider.

This sounds to me like your performance problems are based on guesswork. So instead:

  • Determine if there really is a problem
  • Find the cause of it using appropriate instrumentation
  • Test possible solutions in a non-production environment.

600k rows is NOTHING (in a table with three ints). Really. This can fit into ram on even the tiniest of servers. Querying a table out of ram should be so fast you don't worry about it.

If you get past step 1 (there really is a problem), ask further questions containing your entire relevant schema, exact queries, explain plans and timing data.

MarkR
Thank you :) I guess I'll have to investigate further.
Frexuz
+1  A: 

Here's how I'd structure the tables:

USERS
userid INTEGER PRIMARY KEY 
username text/varchar/whatever

SHOWS
showid INTEGER PK
showname   varchar or nvarchar or text  [depending on what database I was using]
etc etc


EPISODES
episodeid INTEGER PK
showid    INTEGER  FK references SHOWS   [index this field]
ordinal   DECIMAL   [indicates which episode  -- DECIMAL makes it easier to insert later an episode you overlooked] 
episodename text/varchar/nvarchar whatever   
etc etc

SEENIT
id  INTEGER AUTOINCREMENT  PK
userid  INTEGER    foreign key ref USERS
episodeid  INTEGER foreign key ref EPISODES

You could place an alternate unique composite index on (userid, episodeid) or use separate indexes, one on userid, one on episodeid. I'd probably go with the latter.

Tim
+1  A: 

Whether you denormalize your data or not is a matter of debate. It can have its merits in specific circumstances, but from a relational point of view it probably shouldn't be your first choice. Instead, the preferred first steps in solving this problem should be to analyze it and implement solutions that don't change the structure of the data but predominantly deal with the database system and its environment. Therefore:

  • Is the source of your problem really the database ? Or is it some other system (network, webserver, rails, etc) ?
  • What is acceptable in terms of query response times ? Find concrete numbers that the database should adhere to under all circumstances.
  • Which queries are getting slower ? Maybe you have slow, inefficient queries that can be refactored. Make a query plan, see what the optimizer is doing.
  • Are you using indexes in the correct way ?
  • Tune your mysql instance. You can achieve a lot with tuning.
  • See that you can do something on the hardware side (get more memory, faster disks, etc)
  • Create views for the top-most used queries if there are any
  • If all of the above is done, you can still do sharding. This adds some complexity on top of your application but it will allow you to scale your system to a good extent without too much effort.
  • Eventually you may reach the conclusion that you must use a "truly scalable" distributed key/value store (nosql). But at 600k rows there is a long way to go until you reach this point.

That being said - if you find that your proposed solution is the best way to improve performance, go ahead and denormalize. The point is that you should be are aware of all options and choose the best ones with concrete performance-related goals in mind.

bunting
A: 

I would stick with the normalized data. It sounds more like a query optimization problem. Keep in mind that mysql (assuming you are using it) uses only one index per query and you might get better performance by setting up a composite index. Also make use of the EXPLAIN statement in the mysql query browser. More info here: http://dev.mysql.com/doc/refman/5.1/en/explain.html

Volker Pacher
As of MySQL 5.0, multiple indexes can be used per query and the final result found using index merge. See: http://dev.mysql.com/doc/refman/5.1/en/index-merge-optimization.html
Martin
my bad, totally forgot about that. Does it actually perform as well as a declared composite index?
Volker Pacher