views:

706

answers:

1

I am building a Reddit clone in Erlang. I am considering using some erlang web frameworks but this is not the problem.

I am having a problem selecting a database.

How it works;

I have multiple dedicated reddits. Examples, science, funny, corporate, sport. You could consider them sub reddits. Each sub reddit has categories.

A user can post the following info:

Title, Category Tags, Description, Category, Future Date,

and Add picture, link. video

As with Reddit, users will be able to vote on the stories and comment. Comments will also have vote system.

How the problem;

I dont know what NoSQL database to use, the site will have scalability problems with Mysql (trust me it will so dont suggest sql) There will be around 10,000-20,000 concurrent connections if not more.

Now what I need;

1) A user will go to the sporting subreddit,

They will want to see all stories with a Future Date, for example NFL category, or Soccer world cup category they might want to see all stories with future dates which indicate upcoming games or events.

But since people might post crap, i need to say sort by Future date, but then filter the results by posts with more then 5 votes, Then i need to show the closest upcoming event.

So if there is a game on the weekend and the next game is 3 weeks again the closest game needs to come up first.

2) so the problem above, is using one database

1) Find all posts in subreddit: Sport. 2) Find al posts in NFL category. 3) Find all posts with future date. Sort these posts by most votes and display stories with closest date to today.

I think couchdb looks like a good candidate, but i am not sure

but what about Cassandra, Hbase, Riak, neo4j?

I am going crazy trying to figure this out.

I need something that will scale and handle a large amount of users.

Please help, thanks

+1  A: 

Cassandra should work well for you; the "users vote on stuff which gets shown in different ways" sounds pretty similar to what Digg is doing (and they are moving completely to Cassandra).

The name of the game in Cassandra is denormalization. So for each category or subreddit you will have a row containing the posts. If you are querying relatively small numbers of stories at a time you can probably get away w/o denormalizing the post information (including vote count) itself and just multiget that. For larger batches you should duplicate that into each post column so you don't have to do those extra gets.

If you use something like TimeUUID to order your columns temporally, then "give me everything in category X that after today's date" is trivial, and then you sort by vote count client side. (Why not sort server side? Because that doesn't scale.)

jbellis
@jbellis - btw, sorting client side implies doing it in JS or somesuch?
viksit