views:

897

answers:

4

I am in the middle of building a new app which will have very similar features to Facebook and although obviously it wont ever have to deal with the likes of 400,000,000 million users it will still be used by a substantial user base and most of them will demand it run very very quickly.

I have extensive experience with MySQL but a social app offers complexities which MySQL is not well suited too. I know Facebook, Twitter etc have moved towards Cassandra for a lot of their data but I am not sure how far to go with it.

For example would you store such things as user data - username, passwords, addresses etc in Cassandra? Would you store e-mails, comments, status updates etc in Cassandra? I have also read alot that something like neo4j is much better for representing the friend relationships used by social apps as it is a graph database. I am only just starting down the NoSQL route so any guidance is greatly appreciated.

Would anyone be able to advise me on this? I hope I am not being too general!

A: 

Facebook didn't move to Cassandra, they created it. :) To my knowledge, noSQL DBMSes don't require or even mention (thanks to mnemosyn for the correction, Facebook uses Oracle and Cassandra) running side by side with a relational database. This is one opposite example (storing user information in a noSQL DB).

I would say that if Cassandra is good enough for Facebook, it's likely to be good enough for your project. It might not hurt to try to abstract the persistence logic so that you have the possibility to switch to something else, if it absolutely comes to that.

Disclaimer: I have not (yet?) had any hands on experience with noSQL databases: what I know comes from reading about it.

Tomislav Nakic-Alfirevic
It seems you're mixing up concepts here: NoSQL is a very abstract term and contains both ACID databases which have basically the same guarantees as typical RDBMS have (e.g. db4o) as well as databases that scale, but don't offer the same set of guarantees (e.g. cassandra) when it comes to data consistency. These properties should be the guide for decisions. Abstracting this kind of logic is impossible, I believe: There's a significant difference in data you can trust, and data you can't trust. Transactions might not make sense, etc.
mnemosyn
Abstracting what kind of logic? ACID transactions? The DB either supports or does not support them: what I was talking about is basically providing e.g. a thin DAO layer above the database so that the part of the application above the DAO layer can remain more or less intact if the DAO implementation changes (due to a move to a different DB). As for choosing which database, Christopher described the project as having "very similar features to Facebook" so it would be quite peculiar if it turned out that it would be better for Christopher to use a database different than the one Facebook uses.
Tomislav Nakic-Alfirevic
Facebook doesn't use one database. They use (at least) Oracle, Cassandra and Hadoop in parallel. Cassandra was developed for searching your inbox on facebook, not for storing payment details. You cannot put the same abstraction on different things, i.e. use one DAO for data store that is consistent and one that is only eventually consistent.
mnemosyn
You're right, they do use Oracle. I will update my answer accordingly, thanks for the correction.
Tomislav Nakic-Alfirevic
They use MySQL as their primary data store. They write about it here: http://www.facebook.com/MySQLatFacebook
Morgan Tocker
+2  A: 

I would suggest doing some testing with MySQL and with Cassandra. When we had to make a choice between PostgreSQL and MongoDB in one of my jobs, we compared query time on millions of records in both and found out that with about 10M records Postgres would provide us with adequate response times.

We knew that we wouldn't get to that number of records for at least a couple of years, and we had experience with Postgres (while MongoDB wasn't very mature at the time), so we went with Postgres.

My point is that you can probably look at MySQL benchmarks, do some performance tests yourself, estimate the size of your dataset and how it's going to grow, and make an informed decision that way.

As for mixing relational and non-relational databases, it's something we considered as well, but decided that it would be too much of a hassle, as that would mean maintaining two kinds of software, and writing quite a bit of glue code to get the data from both. I think Cassandra would be perfectly capable of storing all your data.

Alex - Aotea Studios
+2  A: 

For example would you store such things as user data - username, passwords, addresses etc in Cassandra?

No, since it does not guarantee consistency. Cassandra is eventually consistent. Surely there shouldn't be concurrency on a certain user account's data, but I wouldn't want to bet on it. You might not need consistency on your fulltext search, your message inbox, etc. but you want consistency in anything that is security-related.

I have also read alot that something like neo4j is much better for representing the friend relationships used by social apps as it is a graph database.

I'm a big fan of the right tool for the right job. I haven't used neo4j but I've been using db4o (which is an object database) and find it very helpful. It makes development easier to use a tool that natively supports your needs. Since you need graphs and working with graphs in SQL is a pain, I'd recommend to give it a look, and evaluate whether it fits your specific needs.

Mixing databases sounds like a good idea to me as long as the choice is natural (i.e. the respective database is helpful with the specific jobs, a graph databases for graphs, a table for tables, ACID databases for anything that needs transaction safety, etc...).

mnemosyn
I don't see why you wouldn't store all data in Cassandra besides the fact that it's easier to query them in an RDBMS.Cassandra guarantees consistency if you want it (quorum reads/writes), see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html.If you are wondering about reliability see http://thread.gmane.org/gmane.comp.db.cassandra.user/3454
Mihai A
Thanks for the interesting links. I'm not entirely sure about this, but from what I understood you can guarantee consistency across nodes, but 'transactions', i.e. writes on the batch level are not atomic, are they? If that really poses a problem is a second question. I think that kind of data is just what RDBMS were made for, but you got a point there when it comes to availability / partition tolerance, so it might be better to use Cassandra for user data in certain scenarios, too.
mnemosyn
A: 

Cassandra provides a nice distributed solution, and probably better for a Facebook like platform than MySQL (if it will need to scale). But Cassandra is not suitable for data relationships where you'll have a many-to-many relationship challenge. A graph database tied to Cassandra would provide both the bulk volume needs, plus a very fast relationship query capability. We are working on something that combines the two technologies, and always interested in the types of requirements your platform would present. If you have any questions on how to handle certain data related issues I'd love to hear them, maybe we can help figure it out.

Warren