views:

321

answers:

3

I've used Excel PivotTable to analyze data from my database because it allows me to "slice and dice" very quickly. As we know what is in our database tables, we all can write SQL queries that do what PivotTable does.

But I am wondering why PivotTable can construct the queries so fast while it knows nothing about the data and the meanings/relationship between the data fields we give it?

Put the question in another way, how can we build ad-hoc SQL queries in such a fast and efficient way? ("Use PivotTable, of course!", yep, but what I want is a programmatic way).

+1  A: 

Just manipulate your order and group clauses as necessary.

Excel is fast because all the data is in memory, and it can be sorted fast and efficiently.

Mark Ransom
Dude - loved the Amy Winehouse cover of Valerie you did
adolf garlic
I think you have me confused with the guy from markransom.com? That's not me.
Mark Ransom
A: 

My intuitive feeling tells me that the answer would have something to do with a Pivot Table outline, which has a fixed number of zones, namely:

- the Page Fields zone  
- the Column Fields zone  
- the Row Fields zone and
- the Data zone

In my wild guess:

- The Page zone builds the WHERE part of the ad-hoc query.  
- The Column zone will put whichever fields drag-dropped to it in the GROUP BY clause.  
- The Row zone will build a SELECT DISTINCT <field names>
- The Data zone will apply an AGGREGATE function to the field drag-dropped to it.

What do you think would happen "behind the scene" when we drag fields to those zones?

Martin
+1  A: 

@Mark Ransom is definitely onto something with the notion of Excel keeping the data in memory, making it faster computationally. It's also possible that Excel pre-indexes datasets in such a way that makes it more responsive than your database.

There's one significant, non-algorithmic possibility for why it's faster: Excel, in Pivot Table usage, has no concept of a join. When you're fetching the data ad hoc from your database, any joins or correlations between tables will result in further lookups, scans, index loads, etc. Since Excel has all the data in a single location (RAM or no), it can perform lookups without having to pre-form datasets. If you were to load your database data into a temp table, it would be interesting to see how ad hoc queries against that table stacked up, performance-wise, against Excel.

One thing's certain, though: although databases are excellent tools for producing accurate reports, a traditionally-normalized database will be far less than optimal for ad hoc queries. Because normalized data structures focus on integrity above all else (if I may take that liberty), they sacrifice ad hoc optimization at the expense of keeping all the data sensible. Although this is a poor example, consider this normalized schema:

+--------+     +---------+
|tblUsers|     |luGenders|
+--------+     +---------+
|userID  |     |genderID |
|genderID||gender   |
+--------+     +---------+

SELECT * FROM luGenders;
> 1 Female
> 2 Male

If, in this example, we wished to know the number of female/male users in our system, the database would need to process the join and behave accordingly (again, this is a bad example due to the low number of joins and low number of possible values, which generally should bring about some database-engine optimisation). However, if you were to dump this data to Excel, you'd still incur some database penalty to pull the data, but actually pivoting the data in Excel would be fairly speedy. It could be that this notion of up-front, fixed-cost penalty is being missed by your idea of Excel being quicker than straight ad hoc queries, but I don't have the data to comment.

The most tangential point, though, is that while general databases are good for accuracy, they often suck at ad hoc reports. To produce ad hoc reports, it's often necessary to de-normalize ("warehouse") the data in a more queryable structure. Looking up info on data warehousing will provide a lot of good results on the subject.

Moral of the story: having a fully algorithmic, fast ad hoc query system is an awesome ideal, but is less than practical given space and time constraints (memory and people-hours). To effectively generate an ad hoc system, you really need to understand the use cases of your data, and then denormalize it effectively.

I'd highly recommend The Data Warehouse Toolkit. For the record, I'm no DBA, I'm just a lowly analyst who spends 80 hours per week munging Excel and Oracle. I know your pain.

kyle