views:

108

answers:

3

Suppose, I wanted to develop stack overflow website. How do I estimate the amount of commodity hardware required to support this website assuming 1 million requests per day. Are there any case studies that explains the performance improvements possible in this situation?

I know I/O bottleneck is the major bottleneck in most systems. What are the possible options to improve I/O performance? Few of them I know are

  1. caching
  2. replication
+1  A: 

check out this handy tool:

http://www.sizinglounge.com/

and another guide from dell:

http://www.dell.com/content/topics/global.aspx/power/en/ps3q01_graham?c=us&l=en&cs=555

if you want your own stackoverflow-like community, you can sign up with StackExchange.

you can read some case studies here:

High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

http://www.gear6.com/gear6-downloads?fid=56&dlt=case-study&ls=Veoh-Case-Study

jspcal
Actually, my question is general and don't want to buy costly servers. I am planning to write a hadoop application to do data analytic work. I just want to know if there are any case studies on this.
Algorist
try this study: http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
jspcal
+1  A: 

You can improve I/O performance in several ways depending upon what you use for your storage setup:

  1. Increase filesystem block size if your app displays good spatial locality in its I/Os or uses large files.
  2. Use RAID 10 (striping + mirroring) for performance + redundancy (disk failure protection).
  3. Use fast disks (Performance Wise: SSD > FC > SATA).
  4. Segregate workloads at different times of day. e.g. Backup during night, normal app I/O during day.
  5. Turn off atime updates in your filesystem.
  6. Cache NFS file handles a.k.a. Haystack (Facebook), if storing data on NFS server.
  7. Combine small files into larger chunks, a.k.a BigTable, HBase.
  8. Avoid very large directories i.e. lots of files in the same directory (instead divide files between different directories in a hierarchy).
  9. Use a clustered storage system (yeah not exactly commodity hardware).
  10. Optimize/design your application for sequential disk accesses whenever possible.
  11. Use memcached. :)

You may want to look at "Lessons Learned" section of StackOverflow Architecture.

Sudhanshu
10 Optimize/design your application for sequential disk accesses whenever possible.How to achieve this, considering a website like stack overflow.7 Combine small files into larger chunks, a.k.a BigTable, HBase.I think they are key-value distributed database. Is there a tutorial?
Algorist
HBase is based on Google's BigTable. The HBase link I provided has everything you need to get started (but HBase may not necessarily be the best solution for what you need - first come up with your exact data storage requirements and then decide if its good for what you want). For something like stackoverflow, you'd want to find out what are the most common operations that are being performed (will need measurements, monitoring etc. once the site is up). Then you'd want to optimize the top most operations. Remember optimization comes after deployment (even if it needs some rewrite).
Sudhanshu
You can try this for HBase:http://stackoverflow.com/questions/tagged/hbasehttp://stackoverflow.com/questions/1750556/looking-for-a-good-hbase-tutorial
Sudhanshu
+1  A: 

1 million requests per day is 12/second. Stack overflow is small enough that you could (with interesting normalization and compression tricks) fit it entirely in RAM of a 64 GByte Dell PowerEdge 2970. I'm not sure where caching and replication should play a role.

If you have a problem thinking enough about normalization, a PowerEdge R900 with 256GB is available.

If you don't like a single point of failure, you can connect a few of those and just push updates over a socket (preferably on a separate network card). Even a peak load of 12K/second should not be a problem for a main-memory system.

The best way to avoid the I/O bottleneck is to not do I/O (as much as possible). That means a prevayler-like architecture with batched writes (no problem to lose a few seconds of data), basically a log file, and for replication also write them out to a socket.

Stephan Eggermont
but there is problem with single point of failure right. how will you avoid this.
Algorist