views:

94

answers:

4

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)

CREATE TABLE foodbar (
    id          INT NOT NULL AUTO_INCREMENT,
    user_id     INT NOT NULL,
    weight      double not null,
    created_at  date not null
);

I have four questions:

  1. How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
  2. How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
  3. Since in question 2 (and indeed question 1), I am searching the records in the table using a calculated field, indexing would be preferable to optimise the query - however since it is a calculated field, it is not clear which field to index (I'm guessing the 'weight' field is the one that needs indexing). Am I right in that assumption?.
  4. Assuming I had another field in the foodbar table (say 'height') and I wanted to select records from the table based on (say) the product (i.e. multiplication) of 'height' and 'weight' - would I be right in assuming again that I need to index 'height' and 'weight'?. Do I also need to create a composite key (say (height,weight)). If this question is not clear, I would be happy to clarify
+3  A: 

I don't see why you should need the synthetic key, so I'll use this table instead:

CREATE TABLE foodbar (
  user_id     INT NOT NULL
, created_at  date not null
, weight      double not null
, PRIMARY KEY (user_id, created_at)
);

How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.

SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
  AND curr.created_at = CURRENT_DATE
  AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;

the date arithmetic syntax is probably wrong but you get the idea

How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.

see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N

for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.

just somebody
OMG Ponies
+1 (especially because you dropped the arbitrary synthetic key). I must point out however that this assumes the data is filled very regularly and consistently on a daily basis. Gaps will mess horribly with your query.
Craig Young
I don't like your join syntax, but some may
Scoregraphic
This looks "most intuitive" to me. I'll run some tests/checks and see. Also thanks for pointing out about the synthetic index.
Stick it to THE MAN
+1  A: 

I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.

For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:

X     Y
1     8
2     2
4     4

The products would order them as:

X     Y     Product
2     2     4
1     8     8
4     4     16

If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.

Tom H.
Correct, as you've illustrated: in this case it's mathematically impossible to determine how indexes on the individual columns can be used to efficiently query the calculation. There are a few cases where a query optimiser could figure out how to rearrange a calculation and leverage an appropriate index. However, this is often hard to do or even recognise, which is the reason one should always try to keep columns away from calculations on the 'other side' of the expression if possible.
Craig Young
+1  A: 

I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:

CREATE TABLE foodbar (
  user_id      INT NOT NULL, 
  created_at   date not null,
  weight_delta double not null, 
  PRIMARY KEY (user_id, created_at)
);

It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:

user_id   weight_delta
------------------------
1         2
1         5
1         -3

Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.

This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.

Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.

OMG Ponies
This is a very clever way of rearranging things to solve the specific requirements more efficiently. Note that the first time you capture a reading for a user, you could just enter the current weight (because the delta is simply a reflection of change from the previous 'reading' of "zero"). Although you may want to create a more obvious distinction between true deltas and the initial reading. Even a separate table is justified. Remember to holistically consider all requirements to ensure this design is what you need.
Craig Young
A: 

It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:

  • The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
  • The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)

Now to answer the questions:

1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:

Current weights:

select  rd.*,
        w.Weight
from    (
        select  User_id,
                max(Created_at) AS Read_date
        from    Foodbar
        group by User_id
        ) rd
        inner join Foodbar w on
            w.User_id = rd.User_id
        and w.Created_at = rd.Read_date

Similarly for the x days ago reading:

select  rd.*,
        w.Weight
from    (
        select  User_id,
                max(Created_at) AS Read_date
        from    Foodbar
        where   Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
        group by User_id
        ) rd
        inner join Foodbar w on
            w.User_id = rd.User_id
        and w.Created_at = rd.Read_date

Now simply join these results as subqueries

select  cur.User_id,
        cur.Weight as Cur_weight,
        prev.Weight as Prev_weight
        cur.Weight - prev.Weight as Weight_change
from    (
        /*Insert query #1 here*/
        ) cur
        inner join (
        /*Insert query #2 here*/
        ) prev on
            prev.User_id = cur.User_id

If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:

ORDER BY cur.Weight - prev.Weight DESC limit N

2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:

The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.

3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.

Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)

Where indexes can benefit:

  • You are trying to identify specific Foodbar rows based on User_id and Created_at.
  • You are also joining back to the Foodbar table again using User_id and Created_at.

This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).

4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1. You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.

Craig Young