views:

54

answers:

2

I have some data analysis that needs to perform. On average, it would involve somewhere in between 50K-150K rows. From these rows I need to extract the summation of Sum(X) as well as Count(X) based on five different criteria. There are two ways of going about it:

  1. Write 10 different queries, each one designed to aggregate the data from column X using Sum() or Count(). Run each one and retrieve the result using SqlCommand.ExecuteScalar().
  2. Create a custom object to contain all of the different parameters that would be needed to evaluate the different conditions. Run one query that will return all of the data needed to make up the superset containing all of the different conditional subsets, using SqlCommand.ExecuteDataReader(). Read each row from the DataReader into a new object, adding each one into a List collection. One all data is retrieved, use Linq-to-Object to determine the different Sum() and Count() values needed based on different conditions.

I know that I could try each one out to see which is fastest, but I am interested in the community's advice on which one is likely to be faster. Assume Sql Server and Web Server each running on their own machines, each with sufficient memory.

Right now I am leaning towards option 1. Even though there are many more queries to the DB, the DB itself will do all of the aggregation work and very little data will pass in between the Sql Server and the Web Server. With option 2, there is only one query, but it will pass a very large amount of data to .Net, and then .Net will have to do all of the heavy lifting with regards to the aggregate functions (and though I don't have anything to base it on, I suspect that Sql Server is more efficient at running these types of big aggregate functions).

Any thoughts on which way to go (or a third option that I am missing)?

+1  A: 

You know, i'd go with option 1. 50-150k rows is not really that many rows IMO, especially if you haven't got too many columns.

Once you start talking millions of rows, i'd start thinking about optimising.

Another thing: as always with databases, make sure your query hits your indexes correctly. That matters much more than the difference between your two ideas.

Chris
+1  A: 

The database is generally the best option for analysing data like this, and I'd go for #1.

However, I'd also consider a third option, in which you create a single query that you pass to SQL (instead of ten). This could involve putting CASE statements in the aggregate functions, so that you can do all the work in a single pass of the data.

Consider the old way of doing PIVOT:

SELECT 
   SUM(CASE WHEN ConditionX = 1 THEN SomeField END) AS SUM1
   SUM(CASE WHEN ConditionX = 2 THEN SomeField END) AS SUM2
FROM SourceData
;
Rob Farley
I'd agree, the third option is probably the best way to go; if the OP could provide sample data, then it would be easier to say that with certainty. Anytime I hear that I need to write x number of queries that do the same thing on different columns, I start wondering how to reduce that number.
Stuart Ainsworth