views:

199

answers:

3

Hi,

I am tasked with building an application wherein the business users will be defining a number of rules for data manipulation & processing (e.g. taking one numerical value and splitting it equally amongst a number of records selected on the basis of the condition specified in the rule).

On a monthly basis, a batch application has to be run in order to process around half a million records as per the rules defined. Each record has around 100 fields. The environment is .NET, C# and SQL server with a third party rule engine

Could you please suggest how to go about defining and/or ascertaining what kind of hardware will be best suited if the requirement is to process records within a timeframe of let's say around 8 to 10 hours. How will the specs vary if the user either wants to increase or decrease the timeframe depending on the hardware costs?

Thanks in advance

Abby

+1  A: 

Create the application and profile it?

Kim Johansson
I need to let the IT team know what kind of hardware will be required so thay they can procure it. Thus, creating the application and then profiling it will not be the solution
Abby
Actually profiling on existing hardware is the only to guess at future hardware needs... You can't profile it without having it built...
Jason D
A: 

If this system is not first of a kind, so you can consider following:

  • Re-use (after additional evaluation) hardware requirements from previous projects
  • Evaluate hardware requirements based on workload and hardware configuration of existing application

If that is not the case and performance requirements are very important, then the best way would be to create a prototype with, say, 10 rules implemented. Process the dataset using the prototype and extrapolate to a full rule set. Based on this information you should be able to derive initial performance and hardware requirements. Then you can fine tune these specifications taking into account planned growth in processed data volume, scalability requirements and redundancy.

Dima Malenko
Just a nitpick here... 10 rules seldom extrapolates out well for a half a projected million rule system... 10,000 minimum is what I'd go for. Automate the rule generation and parameterize it so that they can measure and quantify what happens if they grossly underestimated the system.
Jason D
As I understood half a million in the question refers to number of records processed by the rules, not number of rules themselves. Any way, good point - basis for extrapolation should be sound and relevant to the task at hand.
Dima Malenko
+1  A: 

Step 0. Create the application. It is impossible to tell real world performance of a multi-computer system like you're describing from "paper" specifications... You need to try it and see what holds the biggest slow downs... This is traditionally physical IO, but not always...

Step 1. Profile with sample sets of data in an isolated environment. This is a gross metric. You're not trying to isolate what takes the time, just measuring the overall time it takes to run the rules.

What does isolated environment mean? You want to use the same sorts of network hardware between the machines, but do not allow any other traffic on that network segment. That introduces too many variables at this point.

What does profile mean? With current hardware, measure how long it takes to complete under the following circumstances. Write a program to automate the data generation.

Scenario 1. 1,000 of the simplest rules possible.

Scenario 2. 1,000 of the most complex rules you can reasonably expect users to enter.

Scenarios 3 & 4. 10,000 Simplest and most complex.

Scenarios 5 & 6. 25,000 Simplest and Most complex

Scenarios 7 & 8. 50,000 Simplest and Most complex

Scenarios 9 & 10. 100,000 Simplest and Most complex

Step 2. Anaylze the data.

See if there are trends in completion time. Figure out if they appear tied to strictly the volume of rules or if the complexity also factors in... I assume it will.

Develop a trend line that shows how long you can expect it to take if there are 200,000 and 500,000 rules. Perform another run at 200,000. See if the trend line is correct, if not, revise your method of developing the trend line.

Step 3. Measure the database and network activity as the system processes the 20,000 rule sets. See if there is more activity happening with more rules. If so the more you speed up the throughput to and from the SQL server the faster it will run.

If these are "relatively low," then CPU and RAM speed are likely where you'll want to beef up the requested machines specification...

Of course if all this testing is going to cost your employer more than buying the beefiest server hardware possible, just quantify the cost of the time spent testing vs. the cost of buying the best server and being done with it and only tweaking your app and the SQL that you control to improve performance...

Jason D
In my experience it is possible to come up with more or less useful approximation for hardware requirements even before the system is built. In real world situations you have to estimate horse power required to run the the application - it makes little sense to build an app only to discover that we can not afford running it.Trying different scenarios and gathering statistics is indeed a way to go here.
Dima Malenko