views:

110

answers:

7

Hello Im in the midst of creating a social networking site and I would like to know how I would go about creating the relationships between users. Many sites say that I should create a relationship/friend table, but Im looking into the future and believe that this would be ineffective. This idea could be as popular as facebook and I want to be ready for that many users. Facebook has 400 million users so a friends table would be at least 150 times that. Doing a query for ones friends would be very slow I would think. So would the solution be a seperate table for each user containing their friends ID's. or an associated CSV file containing the ID's. Any help would be greatly appreciated to the design of my site. Thanks

+4  A: 

Both of the alternatives you've suggested would no doubt result in grief - imagine 400 million tables, or managing 400 million files.

Definitely best to maintain a properly indexed relationships table.

Will A
+4  A: 

While you think of eventually supporting millions of users, you're only ever seeing a particular persons friends list - that limits the actual amount of data substantially...

In order to maintain normalized friendship relationships in the database, you'd need two tables:

USERS

  • user_id (primary key)
  • username

FRIENDS

  • user_id (primary key, foreign key to USERS(user_id))
  • friend_id (primary key, foreign key to USERS(user_id))

This will stop duplicates (IE: 1, 2) from happening, but won't stop reversals because (2, 1) is valid. You'd need a trigger to enforce that there's only one instance of the relationship...

OMG Ponies
+1 The point about the double-check trigger is often overlooked
Xeoncross
You'll probabaly need to shard the tables too (if you're preparing for 500 million entries) you'd have an algorithm to figure out what shard to select the data from. For example, having 500 million rows of data and splitting it into 5 million row partitions will dramatically increase the speed of selects/inserts.
Kieran Allen
How would I use a trigger for controlling this, Ive never used them before?
Dom
@Dom: That deserves its own question... You'd want a BEFORE INSERT trigger, because you want to validate the data before it's inserted. If the relationship already exists, it should return an error so your application can inform the user that the relationship already exists.
OMG Ponies
Another approach to the problem of avoiding duplicates would simply be to always sort your inputs application-side, so that user_id < friend_id -- e.g., (2, 1) would always be inserted as (1, 2). A `CHECK` constraint could help you enforce that rule, although apparently that's only an option on non-MySQL databases.
Frank Farmer
+11  A: 

Build the schema you need today, not the one you think you'll need 5 years from now.

Do you think facebook designed their schema to support 400 million users on day one? Of course not. Building for that kind of scale is complicated, expensive, and honestly, if you try it now, you'll probably get it wrong and have to redo it later anyway.

And let's be honest: you have a better chance of winning the lottery than hitting 400 million users any time soon. Even if you do, your project will have hundreds of engineers by then -- plenty of bandwidth for redesigning your schema.

Now's the time to build simple.

Edit to add some solid examples:

Youtube:

They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach.

Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.

Livejournal also grew from a single database on a single server to multiple sharded replicated databases

I'm sure you could find a dozen more examples on the highscalability blog

Frank Farmer
+1!!!! Even if you wanted to prepare the best you could you couldn't Facebook has two entire dedicated servers and over 200 memcached instances...
Kieran Allen
@Frank well said. more developers should understand this.
Alec Smart
I think you can go a step further and say that you WILL need to redo the schema later on. Data WILL need to be sharded and you'll need to introduce database slaves etc (and develop the schema around those points) It's not simple and will require qualified DBA's to come up with a solution which works for the specific site and load.
Kieran Allen
@Kieran Allen: Table partitioning implementations vary, but IIRC most don't require major data model updates. I don't care for the idea of planning for a data model overhaul - that should be a *very* last ditch alternative.
OMG Ponies
Hey guys thanks alot for the responses, they were very helpful. I think I will do what most of you are saying which is to build for the amount of users that I will have now and redesign later when I reach that level. For each relationship someone mentioned duplicates. How do I go about controlling this, I never used triggers before ?
Dom
@OMG Ponies Good point. However, when partitioning you'll need to take into account the shortcomings (counting all friends, self-joins and larger queries - this all requires quite some planning) The point i'm trying to make is IF you get that big you'll have dedicated DBA's anyway.
Kieran Allen
@OMG Ponies -- added a couple of examples of big names who started simple and rewrote as they grew.
Frank Farmer
+2  A: 

If you expect the levels of success attained by Facebook (I like your confidence), you will soon realize what they realized. Relational databases begin to fall short and you'll want to look into NoSQL solutions.

That being said, why pre-optimize for 400 millions users? Build a system that will work now for, say, 500, 000 users. If you need to redesign after that, then you must be very successful and will have the resources to do so.

webbiedave
A: 

You could accomplish this using a table to represent the "Relationship" that one user has with another user. This is essentially a JOIN table between two different rows in the same table. An example join table might include the following columns:

  • USER_1_ID
  • USER_2_ID

To get a list of friends write a query that performs an INNER JOIN from the USER in question to the RELATIONSHIP table back to a second instance on the USER table.

CitizenForAnAwesomeTomorrow
+1  A: 

In your code when inserting relationships into table follow a convention.

issueSQLQuery("INSERT INTO relationships (friend1, friend2) VALUES (?, ?)", min(friend_1_ID, friend_2_ID), max(friend_1_ID, friend_2_ID))

And similarly for retrievals. Of course this could be done in a stored procedure.

Novikov
+1 for enforcing order on the relationships table. That's probably the simplest approach to the issue of duplication/reversal.
Frank Farmer
A: 

something like this should do you initially: http://pastie.org/1127206

drop table if exists user_friends;
drop table if exists users;

create table users
(
user_id int unsigned not null auto_increment primary key,
username varchar(32) unique not null,
created_date datetime not null
)
engine=innodb;

delimiter #

create trigger users_before_ins_trig before insert on users
for each row
begin
 set new.created_date = now();
end#

delimiter ;

create table user_friends
(
user_id int unsigned not null,
friend_user_id int unsigned not null,
created_date datetime not null,
primary key (user_id, friend_user_id), -- note clustered composite PK
foreign key (user_id) references users(user_id),
foreign key (friend_user_id) references users(user_id)
)
engine=innodb;

delimiter #

create trigger user_friends_before_ins_trig before insert on user_friends
for each row
begin
 set new.created_date = now();
end#

delimiter ;


drop procedure if exists insert_user;

delimiter #

create procedure insert_user
(
in p_username varchar(32)
)
proc_main:begin

  insert into users (username) values (p_username);

end proc_main #

delimiter ;

drop procedure if exists insert_user_friend;

delimiter #

create procedure insert_user_friend
(
in p_user_id int unsigned,
in p_friend_user_id int unsigned
)
proc_main:begin

  if p_user_id = p_friend_user_id then
    leave proc_main;
  end if;

  insert into user_friends (user_id, friend_user_id) values (p_user_id, p_friend_user_id);

end proc_main #

delimiter ;

drop procedure if exists list_user_friends;

delimiter #

create procedure list_user_friends
(
in p_user_id int unsigned
)
proc_main:begin

  select
    u.*
  from
    user_friends uf
  inner join users u on uf.friend_user_id = u.user_id
  where
    uf.user_id = p_user_id
  order by
   u.username;

end proc_main #

delimiter ;

call insert_user('f00');
call insert_user('bar');
call insert_user('bish');
call insert_user('bash');
call insert_user('bosh');

select * from users;

call insert_user_friend(1,2);
call insert_user_friend(1,3);
call insert_user_friend(1,4);
call insert_user_friend(1,1); -- oops

call insert_user_friend(2,1);
call insert_user_friend(2,5);

select * from user_friends;

call list_user_friends(1);
call list_user_friends(2);

-- call these stored procs from your php !!
f00