ansaurus

Question

MySQL query help: how to deal with data in most-recent-row-per-day from a big dataset

Answer 1

+1 A:

Don't know where Company and Division join...but here this is:

select
    c.name as company,
    d.name as division,
    s.name as salesperson,
    sum(h.callsinbound) as callsinboundsum,
    sum(h.callsoutbound) as callsoutboundsum,
    sum(h.issuedorders) as issuedorderssum,
    sum(h.revenue) as revenuesum
from
    sales_history_performance h
    inner join
        (select
            th.salespersonid,
            date(th.timestamp) as my_date,
            max(th.timestamp) as max_time
        from
            sales_history_performance th
            inner join salesperson ts on
                th.salespersonid = ts.id
        where
            th.timestamp between '5/1/2009' and '5/3/2009' --inclusive in MySQL
        group by
            th.salespersonid,
            date(th.timestamp)
        ) t on
      h.salespersonid = t.salespersonid
      and h.timestamp = t.max_time
    inner join salesperson s on
        h.salespersonid = s.id
    inner join division d on
        s.divisionid = d.id
    inner join company c on
        d.companyid = c.id
group by
    c.name,
    d.name,
    s.name
order by 1,2,3

You can edit the and sp.name like '%' line that's commented out to add whatever sales person filter you need to it.

What this does is thusly: It goes out and builds a table of the top timestamp in each day. If ID in sales_history_performance is reliably larger for later entries, use that, since you're less likely to get duplicates. Anyway, then it joins that to the table summing up all of the metric columns, per salesperson. You can take the sales person out of the outer query if you want to get a company wide number. As this were, it will return all sales people.

Update: I added in company and division. This is a pretty generic query. If you'd like to limit on division/company/salesperson, you can do so in the WHERE clause of the outer query, although you may be able to get some performance gains out of doing it in the inner query--it's just a bit harder to maintain.

Eric 2009-06-12 21:14:45

Thanks for the answer...I forgot to include the FK's for both the SalesPerson (FK into Division) and the Division (FK into Company).I'm trying to follow the logic and run this in mysql but he's complaining: "Unknown column 'c.salespersonid' in 'on clause'" Any thoughts? Thanks!

DarkSquid 2009-06-12 21:28:05

c.salespersonid doesn't exist. It should have been c.id, since c was the salesperson table alias (where salespersonid doesn't exist). Sorry! I re-aliased the tables to make a little more sense and fixed that bug.

Eric 2009-06-12 21:35:28

I really appreciate the help! Still not quite working here: now I'm getting "Unknown column 'a.salespersonid' in 'on clause'" from mysql :-\

DarkSquid 2009-06-12 21:52:58

Didn't fix all of my aliases. Fixed now :)

Eric 2009-06-12 21:56:35

My bacon: you've saved it!! Thanks very much Eric!!

DarkSquid 2009-06-12 22:01:09

Answer 2

A:

keeping in mind that for each day, the row to use for the sum is the last one by date for >that day, for that salesperson)

This information is hard to swallow. I was wondering wether you were saying that the sum for a day is stored in the salesperson_hourly_performance table, mixing day summaries and hourly summaries in the same table.

There's no relation in your example to the division and company. But to break down sales per person per day for a given date range:

select s.name,substring(timestamp,1,11) as day,sum(callsInBound),sum(callsOutBound),sum(issuedOrders),sum(salesRevenue) 
from salesperson_hourly_performance facts , salesperson s  
where facts.salesPersonId = s.id and  timestamp >= "2009-05-03 00:00:00" and timestamp < "2009-05-07 00:00:00" 
group by s.name,day 
order by day asc;
+-----------+-------------+-------------------+--------------------+-------------------+-------------------+
| name      | day         | sum(callsInBound) | sum(callsOutBound) | sum(issuedOrders) | sum(salesRevenue) |
+-----------+-------------+-------------------+--------------------+-------------------+-------------------+
| bob jones | 2009-05-03  |               101 |                125 |                93 |        72836.7372 |
| bob jones | 2009-05-04  |                19 |                 17 |                 6 |         4200.7100 |
| bob jones | 2009-05-06  |                 0 |                  2 |                 1 |          120.0000 |
+-----------+-------------+-------------------+--------------------+-------------------+-------------------+

Storing the timestamp as an actual timestamp/datetime type would give you easier flexibility dealing with dates and times. There's mysql functions for converting strings to datetimes that probably could help your queries if it really has to be a varchar column

Edit, I would really not mix granularity in this table. Keep one table for day summaries, one table for hours.

if you'd only need the row with the largest date per day use e.g.

SELECT   p.name,
         Substring(TIMESTAMP,1,11) AS DAY,
         Sum(callsinbound),
         Sum(callsoutbound),
         Sum(issuedorders),
         Sum(salesrevenue)
FROM     (SELECT   sh.salespersonid,
                   Substring(sh.TIMESTAMP,1,11) AS DAY,
                   Max(TIMESTAMP)               AS max_ts
          FROM     salesperson_hourly_performance sh
          GROUP BY sh.salespersonid,
                   DAY) t
         INNER JOIN salesperson_hourly_performance shp
           ON t.salespersonid = shp.salespersonid
              AND t.max_ts = shp.TIMESTAMP
         INNER JOIN salesperson p
           ON shp.salespersonid = p.id
GROUP BY p.name,
         DAY;

Add where clauses where you need e.g. as per the first query

nos 2009-06-12 21:31:30

-1: This sums up every row in that day. It only needs to sum up the rows with the latest timestamp in that day (assumedly, the facts are a cumulative balance).

Eric 2009-06-12 21:37:28

Sorry for the lack of clarity. What I was trying/failing to make clear was that for any given day, there will be X rows of salesperson_hourly_performance data. The one which should be used is the "last" one on that day (e.g., the one closest to 23:59:59). One can ignore the other rows for the purpose of this query. Also, I've updated the table as indeed the timestamp is a DATETIME. Will try to digest this - thanks for the help!

DarkSquid 2009-06-12 21:38:56

The last query here is pretty much what Eric wrote I quess. Didn't see that until now.

nos 2009-06-12 21:59:51

ansaurus

tags:

views:

answers:

MySQL query help: how to deal with data in most-recent-row-per-day from a big dataset

Infrastructure Background:

Data Objects

Business requirements:

My issue/plea to SO:

UPDATE:

related questions