views:

117

answers:

3

Hi everyone,

I have next to no experience dealing with high volume transactional websites and recently came across this interesting question. I am interested in knowing where the bottlenecks in a Java web application would occur under high load (thousands of requests per second). If someone could give me a high level approach to thinking about the following question, that would be great!

The only thing I've come up with is to use memcached to cache the database look-ups but I don't know how to calculate the amount of time each request will take and therefore how many requests per second the system might be able to handle.

Question: Internet-scale applications must be designed to process high volumes of transactions. Describe a design for a system that must process on average 30,000 HTTP requests per second. For each request, the system must perform a look-up into a dictionary of 50 million words, using a key word passed in via the URL query string. Each response will consist of a string containing the definition of the word (100 bytes or less).

Describe the major components of the system, and note which components should be custom-built and which components could leverage third-party applications. Include hardware estimates for each component. Please note that the design should include maximum performance at minimum hardware / software-licensing costs.

Document the rationale in coming up with the estimates.

Describe how the design would change if the definitions are 10 kilobytes each.

+1  A: 

As background you may note bechmarks such as specmarks. Compared with your scenario there is significantly more processing, but you will see that your 30,000 req/sec is a comparatively high, but not insanely high, figure.

You may also find Joines et al useful. (Disclaimer: they're colleagues.)

In your scenario I would expect in descending order of cost:

  1. Database retrieval
  2. Network activity reading and returning requests
  3. Simple processing

You're not doing complex processing (Eg. graphic rendering or rocket-science type math). So first guess: if your dictionary were a database then then the cost of doing a query is going to dominate everything else. Traditionally, when we hit bottlenecks in the Web/App server tier we scale by adding more instances, but if the database is the bottleneck that's more of a problem. So one direction: what performance can you expect from a database engine does 30k tps seem feasible?

Your first observation: cache stuff is a commonly used stategy. Here you have (presumably) random hits across the whole dictionary, hence caching recent asnwers in itself is probably not going to help, unless ... can you cache the whole thing?

50,000,000 * (100 + overhead) == ??

On a 64bit JVM on a 64bit OS maybe it fits?

If not (and as the data gets really big, then probably not) then we need to scale. Hence a strategy of slicing the cache may be used. Have (for example) 4 servers, serving A-F, G-M, N-P, T-Z respectively (and, note, 4 separate caches or 4 separate databases). Have a dispatcher directing the requests.

djna
A: 

sigh. even facebook has only 100,000 page views per second, served by at least 10,000 web servers, that's only 10 page view per second per server.

This interviewer is simply insane. I bet he knows a lot of buzz words about scalability and availability, but he has no fucking clue about real world projects.

irreputable
You might find the C10K problem interesting. http://www.kegel.com/c10k.html
Thorbjørn Ravn Andersen
it can be done, but nobody finds it interesting these days. people are fine if their server can handle one request per second. that's how twitter got started.
irreputable
A: 

The first thing I would do is question the numbers. English has about 170,000 words in common use. Add all the other common languages and you would have no more than a couple of million. If this is not the case you could cache the most common words in a fast cache and the less common words in a slower cache. Even at 30K request per second, it would take about 30 minutes to get every unqiue word.

Basically, there is no point designing an large system if the numbers are not real.

On a 64-bit JVM this fits easily. 50 million * (100 + overhead) is about 10 GB (overhead is like to be high as you need to have the key and index the data) A 12 GB server costs about $2,500.

The problem is like to be the number of requests. You will need to have multiple machines but as other posters have suggested the numbers are highly unlikely to be real. I don't image this service would be as expensive as facebook, but you are likely to need tens to hundreds of servers to support this many requests.

Peter Lawrey