views:

588

answers:

4
+5  A: 

You should be able to get everything you need in one go:

 SELECT
    AVG(D.AMOUNT) as AMOUNT,
    Y.YEAR as YEAR,
    MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
    Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,             
    ((avg(AVG(D.AMOUNT) * Y.YEAR)) - avg(AVG(D.AMOUNT)) * avg(Y.YEAR)) /                  
    (stddev( AVG(D.AMOUNT) ) * stddev( Y.YEAR )) as CORRELATION,                     
    ((sum(Y.YEAR) * sum(AVG(D.AMOUNT))) - (count(1) * sum(Y.YEAR * AVG(D.AMOUNT)))) /
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as SLOPE,   
    ((sum( Y.YEAR ) * sum( Y.YEAR * AVG(D.AMOUNT) )) -
    (sum( AVG(D.AMOUNT) ) * sum(power(Y.YEAR, 2)))) / 
    (power(sum(Y.YEAR), 2) - count(1) * sum(power(Y.YEAR, 2))) as INTERCEPT
 FROM
    CITY C,
    STATION S,
    YEAR_REF Y,
    MONTH_REF M,
    DAILY D
 WHERE
    $X{ IN, C.ID, CityCode } AND
    SQRT(
        POW( C.LATITUDE - S.LATITUDE, 2 ) +
        POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius} AND
    S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
    Y.YEAR BETWEEN 1900 AND 2009 AND
    M.YEAR_REF_ID = Y.ID AND
    M.CATEGORY_ID = $P{CategoryCode} AND
    M.ID = D.MONTH_REF_ID AND
    D.DAILY_FLAG_ID <> 'M'
 GROUP BY
    Y.YEAR

The things will not work straight from the query above (it has nonsensically combined aggregates and other errors); this can be a good time to check your formulas

If you decide to do sub queries do simplify the formulas, then:

  • you can grab (you do grab) all the necessary data in the inner most query and you don't have to repeat all the tables in the outer queries any more (just select the relevant columns from the t, they are already at your disposal)
  • you don't have to repeat the where condition
Unreason
@Unreason: Thank you, again, for all your help.
Dave Jarvis
I don't think I can mix the two queries like this. The problem is that the daily data needs to be grouped by `Y.YEAR` and averaged per year. I want to get all that data plus the slope and such. I have updated the question accordingly.
Dave Jarvis
The whole thing is grouped by year and averaging is directly possible on that level; time permitting I'll come back.
Unreason
@Unreason: I'm still taking a look at it, myself. Thanks for any help you can offer.
Dave Jarvis
@Unreason: I've migrated to PostgreSQL. Still looking at the same query, but so far the query itself does not return for over a minute. MySQL did not have this problem. Once I get the query to be performant in PostgreSQL, I will come back to this question.
Dave Jarvis
@Dave, sorry could not get back to you earlier. Could you post create table statements and some test data? Either for postgres or for mysql. Don't have to pretty-format it.
Unreason
@Unreason: I was able to simplify the query by using a stored procedure (which creates a temporary table).
Dave Jarvis
A: 

As the SQL in the question was substantially hanged (now showing only relevant parts) here is my new answer

Assumption: The condition is really the same and no tricky column aliasing occurs between subquery and outer query

Answer: You can remove the where in the outer query.

SELECT
  /* aggregate data */
  ymxb.*
FROM (
  SELECT
    /* similar aggregate data */
  WHERE
    /* some condition */
  GROUP BY
    YEAR
) ymxb
GROUP BY
  YEAR

This should give you the same result.

(Also note that you could have removed the inner where and kept the outer one - results should be the same, however performance might not).

Finally, repeating the where clause probably does not have big impact on the performance - evaluating extra conditions (even expressions such as sqrt, etc) is very cheap compared to any I/O (and these conditions do not operate on any new columns, so all I/O had already been done)

Furthermore, your inner query and outer query use the same GROUP BY and the outer query gets it all data from subquery.

This makes any aggregate functions in the outer query redundant (the rows from the subquery, which are the source for the outer query, have already been grouped by year).

That makes the whole subselect redundant.

Unreason
When I remove the inner query the results take much longer to retrieve. Strange.
Dave Jarvis
When I remove the outer query I get, "Column 'YEAR' cannot be null'. I understand that, in theory, the extra WHERE clause should not have a big impact. In practice it appears that it does -- possibly because of the size of the data set, or I've not tested the speed properly.
Dave Jarvis
@Dave, Re performance: if you mean that when you remove the WHERE from the inner query the results take much longer to retrieve this is actually expected with MySQL (what happens probably is that MySQL calculates all of the aggregates and expressions for ALL rows in the table and then drops what you don't need; keeping where inside helps MySQL filter out unwanted data first and then it does less work)
Unreason
@Dave, Re NULL, not sure where are you getting the NULL from (are you maybe forgetting to show more relevant parts of the query? are you selecting only from ymbx or are join listing more tables as you were doing initially? that would change the reasoning/answer). How many rows are your recordsets? I've been testing the above claims on a very slow machine, but also on relatively small set of records (~100k) at repeating the where condition was negligable.
Unreason
@Unreason: Yes, when the inner WHERE clause is removed, the query was taking so long that I killed it before it finished. The query as pasted in the question is exactly what I edit in iReport. When I remove the outer WHERE, I have to add some variables. The DAILY table has 272 million records, MONTH_REF 9 million, YEAR_REF about 165k, and STATION almost 8k.
Dave Jarvis
@Unreason: The query is not so bad as it stands (with the duplication). I still have to profile the MySQL server and figure out how to optimize it. For example, Handler_read_rnd_next is high (indicates table scans), Created_tmp_disk_tables needs to be increased, key_buffer_size is too small, and Opened_tables indicates an improperly sized table cache. Then there are hardware upgrades I can do, as well.
Dave Jarvis
+1  A: 

The problem is a bit more difficult than in your generalization. I would state it as the following:

SELECT a.group, func(a.group, avg_avg)
FROM a
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (YOUR_CONDITION)
           GROUP BY a.group) as several_lines -- potentially
    ) as one_line -- always
WHERE (YOUR_CONDITION)
GROUP BY a.group -- again, potentially several lines

You have a subset of data (limited by your condition), which is grouped and an aggregation is made for each group. Then, you merge down aggregations to a single value and you want to apply a function of the value to each group again. Obviously, you can not reuse the condition until the result of the grouped subquery can be referenced as an entity.

In MSSQL and Oracle, you would use WITH operator. In MySQL the only option is to use a temporary table. I assume that there is more than one year in your report (otherwise, the query would be much simplier).

UPD: I am sorry, I can not post the ready code now (can do it tomorrow), but I have an idea:

You can concatenate the data you need to output in the subquery with GROUP_CONCAT AND split it back in the outer query with FIND_IN_SET, and SUBSTRING_INDEX functions. the outer query will JOIN only the YEAR_REF and the result of the aggregation.

The condition in the outer query then will be just WHERE FIND_IN_SET(year, concatenated_years).

UPD:

Here is the version that uses GROUP_CONCAT to pass the required data to the outer JOIN.

My comments start with --newtover:. By the way, 1) I do not think STRAIGHT_JOIN adds any benefit, and 2) COUNT(*) has a special meaning in MySQL and should be used when you want to count rows.

SELECT STRAIGHT_JOIN
  -- newtover: extract the corresponding amount back
  SUBSTRING_INDEX(SUBSTRING_INDEX(GROUPED_AMOUNTS, '|', @pos),'|', -1) as AMOUNT,
  Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
  Y.YEAR as YEAR,
  MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
  ymxb.SLOPE,
  ymxb.INTERCEPT,
  ymxb.CORRELATION,
  ymxb.MEASUREMENTS
FROM
  -- newtover: list of tables now contains only the subquery, YEAR_REF for grouping and init_vars to define the variable
  YEAR_REF Y,
  (SELECT
    SUM(MEASUREMENTS) as MEASUREMENTS,
    ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
    ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
    (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
    (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
    ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
    (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION,
    -- newtover: grouped fields for matching years and the corresponding amounts
    GROUP_CONCAT(Y.YEAR) as GROUPED_YEARS,
    GROUP_CONCAT(AMOUNT SEPARATOR '|') as GROUPED_AMOUNTS
  FROM (
    SELECT STRAIGHT_JOIN
      COUNT(1) as MEASUREMENTS,
      AVG(D.AMOUNT) as AMOUNT,
      Y.YEAR as YEAR
    FROM
      CITY C,
      STATION S,
      STATION_DISTRICT SD,
      YEAR_REF Y,
      MONTH_REF M,
      DAILY D
    WHERE
      -- For a specific city ...
      $X{ IN, C.ID, CityCode } AND
      -- Find all the stations within a specific unit radius ...
      6371.009 *
      SQRT(
        POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
        (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
         POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
      SD.ID = S.STATION_DISTRICT_ID AND
      -- Gather all known years for that station ...
      Y.STATION_DISTRICT_ID = SD.ID AND
      -- The data before 1900 is shaky; insufficient after 2009.
      Y.YEAR BETWEEN 1900 AND 2009 AND
      -- Filtered by all known months ...
      M.YEAR_REF_ID = Y.ID AND
      -- Whittled down by category ...
      M.CATEGORY_ID = $P{CategoryCode} AND
      -- Into the valid daily climate data.
      M.ID = D.MONTH_REF_ID AND
      D.DAILY_FLAG_ID <> 'M'
    GROUP BY
      Y.YEAR
  ) t
) ymxb,
(SELECT @pos:=NULL) as init_vars
WHERE
    -- newtover: check if the year is in the list and store the index into the variable
    @pos:=CAST(FIND_IN_SET(Y.YEAR, GROUPED_YEARS) as UNSIGNED)
GROUP BY
  Y.YEAR
newtover
True enough; updated accordingly. Thank you!
Dave Jarvis
@Dave Jarvis: I updated my answer with a possible solution.
newtover
@newtover: The question has been re-tagged with PostgreSQL, which I believe supports the `WITH` keyword.
Dave Jarvis
@Dave Jarvis: I updated my answer with the MySQL query that uses `GROUP_CONCAT`. Check if that helps.
newtover
A: 

Are you able to use a temp table in your situation? Although it still requires you to use the WHERE clause twice, it should considerably boost your performance.

DROP TEMPORARY TABLE IF EXISTS TEMP_DATA

CREATE TEMPORARY TABLE TEMP_DATA 
    (SELECT AVG(field1_avg) as avg_avg
     FROM (SELECT a.group, AVG(field1) as field1_avg
           FROM a
           WHERE (SOME_CONDITION)
           GROUP BY a.group)
    )

SELECT t.group, func(t.group, t.avg_avg)
FROM TEMP_DATA AS t
WHERE (SOME_CONDITION)
GROUP BY t.group

Hope this helps! --Dubs

Dubs
Cannot use anything but a single SQL statement as the statement must be executed by JasperReports, which does not allow multiple statements as its query.
Dave Jarvis
There may be a way around this single SQL statment limitation. I found a couple examples of people calling complex SQL queries with JasperReports. They weren't pretty, but it looked like they would work: http://www.google.com/search?q=jasperreports+queryexecuter+"temp+table"
Dubs