views:

103

answers:

2

I need to get summary data from many many rows. The summary fields are counts of how many entries have each value for different fields. For example, a table with people's age, city, job, etc, the summary data would include fields like "countManager", "countCodeMonkey" for each job, and then "countChicago", "countNewYork" etc for cities.

I know that a simple way of getting everything is:

select count(*) from table
    group by age, city, job

But this is vertical counting - a different row for each value I need. Instead of rows with the counts, I need fields, because I have other fields to group by, for example state. So I want my results to look like this:

| State | countManager | countMonkey |
|  IL   |      3       |     25      |
|  NY   |      5       |     40      |

I'm looking at two ways to do this. We already have one implemented, and it takes 20 minutes to execute. I'm wondering if a different way would be faster.

The current way looks like this:

create view managers as
    select state, count(*) as theCount from table
        where job = 'Manager'
        group by state;

create view monkeys as
    select state, count(*) as theCount from table
        where job = 'Monkey'
        group by state;

select managers.theCount as managers, monkeys.theCount as monkeys
    from managers left join monkeys
        on managers.state = monkeys.state;

In the real case, there's about 20 more views, and hence 20 more joins.

I'm considering the following horizontal count method instead:

select state,
  sum(case when job='Manager' then 1 else 0 end) as managers,
  sum(case when job='Monkey' then 1 else 0 end) as monkeys
    from table
        group by state;

This eliminates the joins. But I have no clue about the performance of the "sum case" statement. Will this be any faster, about the same, or much slower? Does the engine have to go through the rows multiple times, once for each such counted field? Or is it smart enough to count all the fields in one run through, checking each value and incrementing the appropriate sum?

I may spend part of a day writing a script to generate a giant amount of garbage data to test this, but I'd still like to know how the db engine behaves, which I couldn't find elsewhere online.

+3  A: 

Totally depends on the engine and how you want to see the data, but your second option should definitely pull through faster.

20 minutes even for your first query is ridiculous though unless you have literally billions of rows. In that case you should look at archiving data on a monthly/weekly basis with precompiled aggregated data in a table where you can slice and dice to suit.

md5sum
The views might be slower because indexing issues--or rather, lack of indexes.
Philip Kelley
+1  A: 

If there aren't concurrency between your transaction and others, the "sum case" is a good choice. Aggregate functions as AVG, SUM, GROUP BY, degrade performance. Stay in touch with two things: "divide and conquer" and "numeric data is more faster than text data".

Create a datawarehouse (a single table, a database) to avoid concurrency and boost up processing.

CPU are marvelous calculators: convert your categorical data ("NY", "LA", "Man", "Woman") by numeric data (1, 2, 61, 62) and improve your mining.

Clean up your mind about database vendors or plataform choices, but relational algebra.

lcardosobr