views:

79

answers:

4

I have a table named Projects that has the following relationships:

has many Contributions has many Payments

In my result set, I need the following aggregate values:

  • Number of unique contributors (DonorID on the Contribution table)
  • Total contributed (SUM of Amount on Contribution table)
  • Total paid (SUM of PaymentAmount on Payment table)

Because there are so many aggregate functions and multiple joins, it gets messy do use standard aggregate functions the the GROUP BY clause. I also need the ability to sort and filter these fields. So I've come up with two options:

Using subqueries:

SELECT Project.ID AS PROJECT_ID,
(SELECT SUM(PaymentAmount) FROM Payment WHERE ProjectID = PROJECT_ID) AS TotalPaidBack,
(SELECT COUNT(DISTINCT DonorID) FROM Contribution WHERE RecipientID = PROJECT_ID) AS ContributorCount,
(SELECT SUM(Amount) FROM Contribution WHERE RecipientID = PROJECT_ID) AS TotalReceived
FROM Project;

Using a temporary table:

DROP TABLE IF EXISTS Project_Temp;
CREATE TEMPORARY TABLE Project_Temp (project_id INT NOT NULL, total_payments INT, total_donors INT, total_received INT, PRIMARY KEY(project_id)) ENGINE=MEMORY;
INSERT INTO Project_Temp (project_id,total_payments)
 SELECT `Project`.ID, IFNULL(SUM(PaymentAmount),0) FROM `Project` LEFT JOIN `Payment` ON ProjectID = `Project`.ID GROUP BY 1;
INSERT INTO Project_Temp (project_id,total_donors,total_received)
 SELECT `Project`.ID, IFNULL(COUNT(DISTINCT DonorID),0), IFNULL(SUM(Amount),0) FROM `Project` LEFT JOIN `Contribution` ON RecipientID = `Project`.ID  GROUP BY 1
 ON DUPLICATE KEY UPDATE total_donors = VALUES(total_donors), total_received = VALUES(total_received);

SELECT * FROM Project_Temp;

Tests for both are pretty comparable, in the 0.7 - 0.8 seconds range with 1,000 rows. But I'm really concerned about scalability, and I don't want to have to re-engineer everything as my tables grow. What's the best approach?

+2  A: 

Knowing the timing for each 1K rows is good, but the real question is how they'll be used.

Are you planning to send all these back to a UI? Google doles out results 25 per page; maybe you should, too.

Are you planning to do calculations in the middle tier? Maybe you can do those calculations on the database and save yourself bringing all those bytes across the wire.

My point is that you may never need to work with 1,000 or one million rows if you think carefully about what you do with them.

You can EXPLAIN PLAN to see what the difference between the two queries is.

duffymo
Thanks for the response. Unfortunately, one of the requirements is that all of this data gets sent to a map that contains the world of all Projects, so pagination isn't good enough.Assuming that weren't a requirement, though, there are 16 projects per page. Do you think it's not worthwhile to do all of this and instead run 3 or 4 simple queries for each result? That's 64 queries per page, but if they're simple, maybe it's trivial?
Aaron Carlino
+1  A: 

I would go with the first approach. You are allowing the RDBMS to do it's job, rather than trying to do it's job for it.

By creating a temp table, you will always create the full table for each query. If you only want data for one project, you still end up creating the full table (unless you restrict each INSERT statement accordingly.) Sure, you can code it, but it's already becoming a fair amount code and complexity for a small performance gain.

With a SELECT, the db can fetch the appriate amount of data, optimizing the whole query based on context. If other users have queried the same data, it may even be cached (query, and possibly data, depending upon your db). If performance is truly a concern, you might consider using Indexed/Materialized Views, or generating a table on an INSERT/UPDATE/DELETE trigger. Scaling out, you can use server clusters and partioned views - something that I believe will be difficult if you are creating temporary tables.

EDIT: the above is written without any specific rdbms in mind, although the OP added that mysql is the target db.

mdma
I love the suggestion of using views. I've never created one before, but it seems like this is the perfect application of one.Interesting that temporary tables seem to be a redundant exercise, doing what mysql will do anyway.
Aaron Carlino
Ok, it's mysql - you might want to add that to your question, as well as the version you are using.
mdma
Mysql doesn't support indexed views - you can create a view, but it's more like a simple way of hiding details, so the text of the query lives in your database rather than in your code. Views also provide a way of breaking complex queries into smaller pieces. They can hide a lot of detail, which is usually a benefit, but also can lead to a simple select hiding complex queries if not managed carefully.
mdma
+1  A: 

There is a third option which is derived tables:

Select Project.ID AS PROJECT_ID
    , Payments.Total AS TotalPaidBack
    , Coalesce(ContributionStats.DonarCount, 0) As ContributorCount
    , ContributionStats.Total As TotalReceived
From Project
    Left Join   (
                Select C1.RecipientId, Sum(C1.Amount) As Total, Count(Distinct C1.DonarId) ContributorCount
                From Contribution As C1
                Group By C1.RecipientId
                ) As ContributionStats
        On ContributionStats.RecipientId = Project.Project_Id
    Left Join   (
                Select P1.ProjectID, Sum(P1.PaymentAmount) As Total
                From Payment As P1
                Group By P1.RecipientId
                ) As Payments
        On Payments.ProjectId = Project.Project_Id

I'm not sure if it will perform better, but you might give it shot.

Thomas
+1  A: 

A few thoughts:

  • The derived table idea would be good on other platforms, but MySQL has the same issue with derived tables that it does with views: they aren't indexed. That means that MySQL will execute the full content of the derived table before applying the WHERE clause, which doesn't scale at all.

  • Option 1 is good for being compact, but syntax might get tricky when you want to start putting the derived expressions in the WHERE clause.

  • The suggestion of materialized views is a good one, but MySQL unfortunately doesn't support them. I like the idea of using triggers. You could translate that temporary table into a real table that persists, and then use INSERT/UPDATE/DELETE triggers on the Payments and Contribution tables to update the Project Stats table.

  • Finally, if you don't want to mess with triggers, and if you aren't too concerned with freshness, you can always have the separate stats table and update it offline, having a cron job that runs every few minutes that does the work that you specified in Query #2 above, except on the real table. Depending on the nuances of your application, this slight delay in updating the stats may or may not be acceptable to your users.

Tyler Dumont