views:

365

answers:

2

Hi everyone,

I have a somewhat complicated assortment of tables for which I need to do some SQL query construction/optimization. Currently a lot of the logic being used to obtain the results we need is being done at the app layer, which is resulting in terrible performance due to full table traversals, etc. SQL is not my strong suit, so I thought I'd reach out to the SO crowd to see if anybody could lend a hand.

Infrastructure Background:

  1. DB is MySQL5
  2. We're accessing this data via Hibernate using Java
  3. Most of these tables' contents are relatively static, with the exception of the "salesperson-hourly-performance" table which contains a row for each hour of each day a given salesperson is active (e.g., has made or received a call) with the running tally of that salesperson's performance for the entire day. Given the # of sales people across the companies in question this table can grow by 20K+ rows per day.

Data Objects

I've created a simplified version of the table setup which incorporates the relevant data. The "real" tables have about 20 companies, 300 divisions, 20K sales people, and millions of records of salesperson performance data.

CREATE TABLE  `so_test`.`company` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `name` varchar(45) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1;

INSERT INTO company VALUES (7, 'CompanyXX');

CREATE TABLE  `so_test`.`division` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `name` varchar(45) NOT NULL,
  `campanyId` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=18 DEFAULT CHARSET=latin1;

INSERT INTO division VALUES (17, 'APAC #1');

CREATE TABLE  `so_test`.`salesperson` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `divisionId` int(10) unsigned NOT NULL,
  `name` varchar(45) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=213860 DEFAULT CHARSET=latin1;

INSERT INTO salesperson VALUES (213859, 'bob jones');

CREATE TABLE  `so_test`.`salesperson_hourly_performance` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `timestamp` DATETIME NOT NULL,
  `salesPersonId` int(10) unsigned NOT NULL,
  `callsInBound` int(10) unsigned NOT NULL,
  `callsOutBound` int(10) unsigned NOT NULL,
  `issuedOrders` int(10) unsigned NOT NULL,
  `salesRevenue` decimal(10,4) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=552395 DEFAULT CHARSET=latin1;

INSERT INTO salesperson_hourly_performance VALUES (552394, '2009-05-03 22:00:00', 213859, 15, 17, 14, 10798.0478),
(551254, '2009-05-03 21:00:00', 213859, 14, 16, 13, 9802.3620),
(551115, '2009-05-03 20:00:00', 213859, 13, 14, 12, 9183.8250),
(550072, '2009-05-03 19:00:00', 213859, 11, 13, 11, 8490.8678),
(549613, '2009-05-03 18:00:00', 213859, 10, 11, 9, 7230.1125),
(549389, '2009-05-03 17:00:00', 213859, 9, 10, 8, 6486.2173),
(548861, '2009-05-03 16:00:00', 213859, 7, 9, 7, 5537.8553),
(548059, '2009-05-03 15:00:00', 213859, 6, 8, 6, 4663.8469),
(547466, '2009-05-03 14:00:00', 213859, 5, 7, 5, 4082.6388),
(546729, '2009-05-03 13:00:00', 213859, 4, 6, 4, 3057.7368),
(546611, '2009-05-03 12:00:00', 213859, 3, 5, 2, 1751.6135),
(545642, '2009-05-03 11:00:00', 213859, 2, 4, 2, 1751.6135),
(545558, '2009-05-03 10:00:00', 213859, 1, 3, 0, 0.0000),
(545072, '2009-05-03 09:00:00', 213859, 1, 2, 0, 0.0000),
(565071, '2009-05-04 13:00:00', 213859, 19, 17, 6, 4200.1710),
(575070, '2009-05-06 14:00:00', 213859, 0, 2, 1, 120.0000);

Business requirements:

  1. Populate a set of web-based sales performance "dashboard" UIs which provide a separate performance overview for the companies, the divisions, and the individual sales people.
  2. The UIs are largely similar to one-another, aside from the dataset: the "company" dashboard aggregates all the data of all the salespeople in each of the compay's divisions and outputs a row per company, whereas the divisions dashboard for a particular company aggregates the data of each of the salespeople in that division and a row per division.
  3. The UIs allow the user to pick a date range for the report dashboard and sort by any of the columns. The columns displayed include:

    (Company|Division|Sales Person) Name, Total issued orders, Total sales revenue, Total calls inbound, Total calls outbound.

My issue/plea to SO:

The "legacy" approach (which was shameful yet kinda-sorta-marginally-acceptable when the output was to a daily journal) was to programatically iterate through the performance data for each of the relevant objects (e.g., each sales person in a division in a company), find the "last" one on each of the given days in the specified date range, and sum the data. However, given the massive dataset and the need to present this data "live" in a UI, I need guidance/examples of how to construct efficient SQL queries against this dataset which will allow for pagination and sorting.

  1. Would some kind soul please show me a reasonable query which gets the sum of each of the sales person performance data columns for a given date range (keeping in mind that for each day, the row to use for the sum is the last one by date for that day, for that salesperson).

  2. A query which performs query #1 over a range of sales people (e.g., all the sales people in a given company) with support for pagination and ordering on a particular column?

I hope I've included sufficient details to make clear what I'm asking...please let me know if you need any additional information.

Many thanks SO SQL gods!

UPDATE:

Added missing keys from salesPerson -> division & from division -> company. Also, fixed datatype of "timestamp" to be DATETIME instead of VARCHAR.

+1  A: 

Don't know where Company and Division join...but here this is:

select
    c.name as company,
    d.name as division,
    s.name as salesperson,
    sum(h.callsinbound) as callsinboundsum,
    sum(h.callsoutbound) as callsoutboundsum,
    sum(h.issuedorders) as issuedorderssum,
    sum(h.revenue) as revenuesum
from
    sales_history_performance h
    inner join
        (select
            th.salespersonid,
            date(th.timestamp) as my_date,
            max(th.timestamp) as max_time
        from
            sales_history_performance th
            inner join salesperson ts on
                th.salespersonid = ts.id
        where
            th.timestamp between '5/1/2009' and '5/3/2009' --inclusive in MySQL
        group by
            th.salespersonid,
            date(th.timestamp)
        ) t on
      h.salespersonid = t.salespersonid
      and h.timestamp = t.max_time
    inner join salesperson s on
        h.salespersonid = s.id
    inner join division d on
        s.divisionid = d.id
    inner join company c on
        d.companyid = c.id
group by
    c.name,
    d.name,
    s.name
order by 1,2,3

You can edit the and sp.name like '%' line that's commented out to add whatever sales person filter you need to it.

What this does is thusly: It goes out and builds a table of the top timestamp in each day. If ID in sales_history_performance is reliably larger for later entries, use that, since you're less likely to get duplicates. Anyway, then it joins that to the table summing up all of the metric columns, per salesperson. You can take the sales person out of the outer query if you want to get a company wide number. As this were, it will return all sales people.

Update: I added in company and division. This is a pretty generic query. If you'd like to limit on division/company/salesperson, you can do so in the WHERE clause of the outer query, although you may be able to get some performance gains out of doing it in the inner query--it's just a bit harder to maintain.

Eric
Thanks for the answer...I forgot to include the FK's for both the SalesPerson (FK into Division) and the Division (FK into Company).I'm trying to follow the logic and run this in mysql but he's complaining: "Unknown column 'c.salespersonid' in 'on clause'" Any thoughts? Thanks!
DarkSquid
c.salespersonid doesn't exist. It should have been c.id, since c was the salesperson table alias (where salespersonid doesn't exist). Sorry! I re-aliased the tables to make a little more sense and fixed that bug.
Eric
I really appreciate the help! Still not quite working here: now I'm getting "Unknown column 'a.salespersonid' in 'on clause'" from mysql :-\
DarkSquid
Didn't fix all of my aliases. Fixed now :)
Eric
My bacon: you've saved it!! Thanks very much Eric!!
DarkSquid
A: 

keeping in mind that for each day, the row to use for the sum is the last one by date for >that day, for that salesperson)

This information is hard to swallow. I was wondering wether you were saying that the sum for a day is stored in the salesperson_hourly_performance table, mixing day summaries and hourly summaries in the same table.

There's no relation in your example to the division and company. But to break down sales per person per day for a given date range:

select s.name,substring(timestamp,1,11) as day,sum(callsInBound),sum(callsOutBound),sum(issuedOrders),sum(salesRevenue) 
from salesperson_hourly_performance facts , salesperson s  
where facts.salesPersonId = s.id and  timestamp >= "2009-05-03 00:00:00" and timestamp < "2009-05-07 00:00:00" 
group by s.name,day 
order by day asc;
+-----------+-------------+-------------------+--------------------+-------------------+-------------------+
| name      | day         | sum(callsInBound) | sum(callsOutBound) | sum(issuedOrders) | sum(salesRevenue) |
+-----------+-------------+-------------------+--------------------+-------------------+-------------------+
| bob jones | 2009-05-03  |               101 |                125 |                93 |        72836.7372 |
| bob jones | 2009-05-04  |                19 |                 17 |                 6 |         4200.7100 |
| bob jones | 2009-05-06  |                 0 |                  2 |                 1 |          120.0000 |
+-----------+-------------+-------------------+--------------------+-------------------+-------------------+

Storing the timestamp as an actual timestamp/datetime type would give you easier flexibility dealing with dates and times. There's mysql functions for converting strings to datetimes that probably could help your queries if it really has to be a varchar column

Edit, I would really not mix granularity in this table. Keep one table for day summaries, one table for hours.

if you'd only need the row with the largest date per day use e.g.

SELECT   p.name,
         Substring(TIMESTAMP,1,11) AS DAY,
         Sum(callsinbound),
         Sum(callsoutbound),
         Sum(issuedorders),
         Sum(salesrevenue)
FROM     (SELECT   sh.salespersonid,
                   Substring(sh.TIMESTAMP,1,11) AS DAY,
                   Max(TIMESTAMP)               AS max_ts
          FROM     salesperson_hourly_performance sh
          GROUP BY sh.salespersonid,
                   DAY) t
         INNER JOIN salesperson_hourly_performance shp
           ON t.salespersonid = shp.salespersonid
              AND t.max_ts = shp.TIMESTAMP
         INNER JOIN salesperson p
           ON shp.salespersonid = p.id
GROUP BY p.name,
         DAY;

Add where clauses where you need e.g. as per the first query

nos
-1: This sums up every row in that day. It only needs to sum up the rows with the latest timestamp in that day (assumedly, the facts are a cumulative balance).
Eric
Sorry for the lack of clarity. What I was trying/failing to make clear was that for any given day, there will be X rows of salesperson_hourly_performance data. The one which should be used is the "last" one on that day (e.g., the one closest to 23:59:59). One can ignore the other rows for the purpose of this query. Also, I've updated the table as indeed the timestamp is a DATETIME. Will try to digest this - thanks for the help!
DarkSquid
The last query here is pretty much what Eric wrote I quess. Didn't see that until now.
nos