views:

79

answers:

3

Hypothetically, tets say someone tells you to to expect X (like 100,000 or something) number of unique visitors per day as a result of a successful marketing campaing.

How does that translate to peak requests/second? Peak simultaneous requests?

Obviously this depends on many factors, such as typical number of pages requested per user session or load time of a typical page, or many other things. These are other variables Y, Z, V, etc.

I'm looking for some sort of function or even just a ratio to estimate these metrics. Obviously for planing out the production environment scalability strategy.

This might happen on a production site I'm working on really soon. Any kind of help estimating these is useful.

A: 

That will depend on the marketing campaign. For instance a TV ad will bring a lot of traffic at once, for a newspaper ad it will be spread out more over the day.

My experience with marketing types has been that they just pull a number from where the sun doesn't shine, typically higher than reality by at least an order of magnitude

gnibbler
+1  A: 

I'd start by assuming that "per day" means "during the 8-hour business day", because that's a worse-case scenario without perhaps being unecessarily worst-case.

So if you're getting an average of 100,000 in 8 hours, and if the time at which each one arrives is random (independent of the others) then in some seconds you're getting more and in some seconds you're getting less. The details are a branch of knowledge called "queueing theory".

Assuming that the Pollaczek-Khinchine formula is applicable, then because your service time (i.e. CPU time per request) is quite small (i.e. less than a second, probably), therefore you can afford to have quite a high (i.e. greater than 50%) server utilization.

In summary, assuming that the time per request is small, you need a capacity that's higher (but here's the good news: not much higher) than whatever's required to service the average demand.

The bad news is that if your capacity is less than the average demand, then the average queueing delay is infinite (or more realistically, some requests will be abandoned before they're serviced).

The other bad news is that when your service time is small, you're sensitive to temporary fluctations in the average demand, for example ...

  • If the demand peaks during the lunch hour (i.e. isn't the same average demand as during other hours), or even if for some reason it peaks during a 5-minute period (during a TV commercial break, for example)

  • And if you can't afford to have customers queueing for that period (e.g. queueing for the whole lunch hour, or e.g. the whole five-minute commercial break)

... then your capacity needs to be enough to meet those short-term peak demands. OTOH you might decide that you can afford to lose the surplus: that it's not worth engineering for the peak capacity (e.g. hiring extra call centre staff during the lunch hour) and that you can afford some percentage of abandoned calls.

ChrisW
+2  A: 

Edit: (following indication that we have virtually NO prior statistics on the traffic)
We can therefore forget about the bulk of the plan laid out below and directly get into the "run some estimates" part. The problem is that we'll need to fill-in parameters from the model using educated guesses (or plain wild guesses). The following is a simple model for which you can tweak the parameters based on your understanding of the situation.

Model

Assumptions:
a) The distribution of page requests follows the normal distribution curve.
b) Considering a short period during peak traffic, say 30 minutes, the number of requests can be considered to be evenly distributed.
This could be [somewhat] incorrect: for example we could have a double curve if the ad campaign targets multiple geographic regions, say the US and the Asian markets. Also the curve could follow a different distribution. These assumptions are however good ones for the following reasons:

  • it would err, if at all, on the "pessimistic side" i.e. over-estimating peak traffic values. This "pessimistic" outlook can further be further adopted by using a slightly smaller std deviation value. (We suggest using 2 to 3 hours, which would put 68% and 95% of the traffic over a period of 4 and 8 hours (2h std dev) and 6 and 12 hours (3h stddev), respectively.
  • it makes for easy calculations ;-)
  • it is expected to generally match reality.

Parameters:

  • V = expected number of distinct visitors per 24 hour period
  • Ppv = average number of page requests associated with a given visitor session. (you may consider using the formula twice, one for "static" type of responses, and the other for dynamic responses, i.e. when the application spends time crafting a response for a given user/context)
  • sig = std deviation in minutes
  • R = peak-time number of requests per minute.

Formula:

   R = (V * Ppv * 0.0796)/(2 * sig / 10)

That is because, with a normal distribution, and as per z-score table, roughly 3.98% of the samples fall within 1/10 of a std dev, on one or the other side of the mean (of the very peak), therefore get almost 8 percent of the samples within one std dev on both sides, and with the assumption of relatively even distribution during this period, we just divide by the number of minutes.

Example: V=75,000 Ppv=12 and sig = 150 minutes (i.e 68% of traffic assumed to come over 5 hours, 95% over 10 hours, 5% for the other 14 hours of the day). R = 2,388 requests per minute, i.e. 40 requests per second. Rather Heavy, but "doable" (unless application takes 15 seconds per request...)

[original response] It appears that your immediate concern is how the server(s) may handle the extra load... A very worthy concern ;-). Without distracting you from this operational concern, consider the process of estimating the scale of the upcoming surge, also provides an opportunity of preparing yourself to gather more and better intelligence about the site's traffic, during and beyond the ad-campaign. Such information will in time prove useful for making better estimates of surges etc, but also for guiding some of the site's design (for commercial efficiency as well as for improving scalability).

A tentative plan

Assume qualitative similarity with existing traffic.
The ad campaign will expose the site to a distinct population (type of users) than its current visitors/users population: different situations select different subjects. For example the "ad campaign" visitors may be more impatient, focussed on a particular feature, concerned about price... as compared to the "self selected ?" visitors. Never the less, by lack of any other supporting model and measurement, and for sake of estimating load, the general principle could be to assume that the surge users will on-the-whole behave similarly to the self-selected crowd. A common approach is "run numbers" on this basis and to use educated guesses to slightly bend the coefficients of the model to accommodate for a few distinctive qualitative distinctions.

Gather statistics about existing traffic
Unless you readily have better information for this (eg. tealeaf, Google Analytics...) your source for such information may simply be the webserver's log... You can then build some simple tools to extract parse these logs and extract the following statistics. Note that these tools will be reusable for future analysis (eg: of the campaign itself), and also look for opportunities of logging more/different data, without significantly changing the application!

  • Average, Min, Max, Std Dev. for
    • number of pages visited per session
    • duration of a session
  • percentage of 24 hour traffic for each hour of a work day (exclude week-ends and such, unless of course this is a site which receives much traffic during these periods) These percentages should be calculated over several weeks at least to remove noise.

"Run" some estimates:
For example, start with peak use estimate, using the peak hour(s) percentage, the average daily session count, the average number of pages hits per session etc. This estimate should take into account the stochastic nature of traffic. Note that you don't have to, in this phase, worry about the impact of the queuing effect, instead, assume that the service time relative to the request period is low enough. Therefore just use a realistic estimate (or rather a value informed from the log analysis, for these very high usage periods), for the way the probability of a request is distributed over short periods (say of 15 minutes).

Finally, based on the numbers you obtained in this fashion, you can get a feel for the type of substained load this would represent on the server, and plan to add resources, to refactor part of the application. Also -very important!- if the outlook for sustained at-capacity load, start running the Pollaczek-Khinchine formula, as suggested by ChrisW, to get a better estimate of the effective load.

For extra credit ;-) Consider running some experiments during the campaign for example by randomly providing a distinct look or behavior for some of the pages visited, and by measuring the impact this may have (if any) on particular metrics (registration for more info, orders place, number of pages visited ...) The effort associated with this type of experiment may be significant, but the return can be significant as well, and if nothing else it may keep your "useability expert/consultant" on his/her toes ;-) You'll obviously want to work on defining such experiments, with the proper marketing/business authorities, and you may need to calculate ahead of time the minimum percentage of users upon which the alternate site would be proposed, to keep the experiment statistically representative. It is indeed important to know that the experiment doesn't need to be applied to 50% of the visitors; one can start small, just not so small that possible variations observed may be due to random...

mjv
Current traffic is very low, so I can't experiment. The traffic is promised to appear as a surge one one or two days. All that is given is the number of visitors per day. No opportunity for experimentation.
ulver
@ulver: see my edit at top of response. Added a plausible model to use as a basis. Verify its validity, plug-in your values, and take the results with a pinch of salt... Hope this helps none the less (the lack of any prior obliges us to make more assumptions, and to make guesses, for example the number of requests made for a given session. I hope this helps!
mjv