tags:

views:

1185

answers:

6

is mongodb appropriate for sites like stackoverflow?

+1  A: 

I would say no, it's not a great fit, the more complicated your objects get the more an object/document database makes sense. But if you look at SO, most of it isn't complicated object relationships.

There's a questions table, with however many properties, then a collection of answers...but all these need to be accessed independently depending on which view your coming from, e.g. your activity screen or the question/answer screens. Since you're accessing it at so many angles and each piece is comparatively simple, a relational model works better.

There are queries running in the background for badges and such, you need to quickly check if you're hitting reputation caps for votes...a lot of relational queries that are simpler in a RDBMS given the complexity of the object model.

This is of course my opinion, maybe SO's structure is way more complicated than it appears to be

Nick Craver
or simpler? questions and answers are in the same table.
lubos hasko
I don't think object relationships have to be a lot more complicated for ODBs to make sense. I find that the most important benefit tends to be that the schema is almost non-existant in the DB and can thus be modified more easily, which is a major plus for web development.
Alan
A: 

For me MongoDB is really great for all website that don't need transaction.

shingara
+1  A: 

With RDBMS for OLTP side of your application and proper caching - it should work gracefully.


Actually - there's an open source stackoverflow clone that uses RoR & MongoDB. :)

Arnis L.
+18  A: 

Put simply: Yes, it could be.

Let's break down the various pages/features and see how they could be stored/reproduced in MongoDB.

The whole information in this page could be stored in a single document under the collection questions. This could include "sub-documents" for each answer to keep the retrieval of this page fast.

Edit: as @beagleguy pointed out, you could his the document size limit of 4MB quite quickly this way, so it would be better to store answers in separate documents and link them to the question by storing the ObjectIDs in an array.

The votes could be stored in a separate collection, with simple links to the question and to the user who voted. A db.eval() call could be executed to increment/decrement the vote count directly in the document when a vote is added (though it blocks so wouldn't be very performant), or a MapReduce call could be made regularly do offset that work. It could work the same way for favourites.

Things like the "viewed" numbers, logging user's access times, etc. would generally be handled using a modifier operation to increment a counter. Since v1.3 there is a new "Find and Modify" command which can issue an update command when retrieving the document, saving you an extra call.

Any sort of statistical data (such as reputation, badges, unique tags) could be collected using MapReduce and pushed to specific collections. Things like notifications could be pushed to another collection acting as a job queue, with a number of workers listening for new items in the queue (think badge notifications, new answers since user's last access time, etc).

The Questions page and it's filters could all be handled with capped-collections rather than querying for that data immediately.

Ultimately, YMMV. As with all tools, there are advantages and costs. There are some SO features which would take a lot of work in an RDBMS but could be handled quite simply in Mongo, and vice-versa.

I think the main advantage of Mongo over RDBMSs is the schema-less approach and replication. Changing the schema regularly in a "live" RDMBS-based app can be painful, even impossible if it's heavily used with large amounts of data - those types of ops can lock the tables for far too long. In Mongo, adding new fields is trivial since you may not need to add them to every document. If you do its a relatively quick operation to run a map/reduce to update documents.

As for replication, Mongo has the advantage that the DB doesn't need to be paused to take a snapshot for slaves. Many RDBMSs can't set up replication without this approach, which on large DBs can take the master down for a long time (I'm looking at you, MySQL!). This can be a blessing for StackOverflow-type sites, where you need to scale over time - no taking the master down every time you need to add a node.

digitala
this is a good one. thanks.
mdm414
wouldn't you hit the 4MB limit on fairly big threads if you embedded the answers within it?
@beagleguy: probably, yes. Again, all depends on what you're storing. Its probably better to store ObjectIDs of Answer docs in an array in the Question doc.
digitala
4MB and big threads? Just do some numbers, you could probably store multiple versions of, say, the bible in 4MB. Do you really think _any_ thread will have this many content?
halfdan
@halfdan, your comment intrigued me, so I did a quick search and found a plain text version of the King James Bible, which happened to come out to 4.20 MB. Now, I don't doubt that you could probably store SO questions, answers, and comments together in a single document, but I think storing them separately is a much safer approach.
Ari Patrick
@Ari Patrick: Alright ;) Nice search. It's definitely a muc hsafer approach, but it's still a lot of space for text.
halfdan
A: 

You can also use $inc/$dec for vote tracking, so no need to use db.eval

+3  A: 

I think it is.

You can store the question itself, the answers and the comments on the question + answers as one mongo-document. The max doc size is 4 mb, so no document on stackoverflow will be too big for mongo. I've downloaded the content of stackoverflow (data dump) with bittorrent and I've been able to import this content into mongo.

Importing this data into mongo is not trivial because the dump of stackoverflow consists of multiple xml files and each xml file matches with one relational table, so have to recombine this data into document format.

I've also added the display name + reputation of the OP + answerers + commenters to this document. This does mean that if a user changes his/her displayname you have to update all the documents with his/her userid. There is a price to pay if you denormalize your data. Same if the reputation of a user changes.

The idea is that all the data that you see on a page like this is contained in one mongo-document. You have all the necessary information with one lookup and no joins.

Here you can download the data dump of stackoverflow: http://blog.stackoverflow.com/category/cc-wiki-dump/

TTT
+1 for info on the dump and how you "ported" it to mongo
Ari Patrick