views:

372

answers:

2

In the highscalability blog, Todd Hoff talks about the wiki architecture SO adopted (initially), crunches that followed and mentions the painful refactoring needed to get back on track.

To quote:

Stack Overflow copied a key part of the Wikipedia database design. This turned out to be a mistake which will need massive and painful database refactoring to fix. The refactorings will be to avoid excessive joins in a lot of key queries. This is the key lesson from giant multi-terabyte table schemas (like Google’s BigTable) which are completely join-free. This is significant because Stack Overflow's database is almost completely in RAM and the joins still exact too high a cost.

Edit:

Since this direct question to the Stackoverflow team did not invite any answers, I’ve modified it directed at the mediawiki database architecture.

Is it fine to copy the mediawiki database architecture to design your own wiki? Is there a better renormalized architecture out in the wild that I can take a look at? One that keeps the existing features intact but cuts down on the joins.

In general, I am looking forward to learn about wiki architectures and how its maintained on a case to case basis.

P.S. Not sure if this belongs to Meta

+2  A: 

Once you start scaling there are several issues that you have and the amount of normalization is just a design decision.

Commercial database servers are helping users scale relational database. On the other hand, if you take a denormalized approach you will certainly have a degree of data duplication, so you move the design burden from constructing queries to locking writes and synchronizing data between different structures which increases the "cost" of writes exponentially with your scale.

At the end of the day, a hybrid approach works for most medium scale projects. I don't see a strong use case for expensive joins in a site like SO.

Also the "NoSQL" approach sounds a bit too radical IMHO. You can store hundreds of millions of rows of data in your denormalized MySQL table if you wish.

Sorin Mocanu
A: 

I think before giving an answer yes or no, the developer should answer itself few other questions related to the future of the project (like 5 year from now):

  • maximum number of users
  • page loads/day
  • non-cachable page loads/day
  • traffic (gb/month,day) Answering these will tell you what kind of solutions could work for you or not.

Also regarding security/access controll - mediawiki was designed to be open (no real private sub-wikis or groups) - for this reason it may not be right for corporate wiki. I should mention that there are extensions that are trying to some problems like this but they do not integrate very well.

Sorin Sbarnea