ansaurus

Question

Optimal two variable linear regression calculation

Answer 1

A:

This has now been verified as correct:

SELECT
  ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

  ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
  (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,

  ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
  (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
  SELECT
    AVG(D.AMOUNT) as AMOUNT,
    Y.YEAR as YEAR
  FROM
    CITY C,
    STATION S,
    YEAR_REF Y,
    MONTH_REF M,
    DAILY D
  WHERE
    C.ID = 8590 AND

    SQRT(
      POW( C.LATITUDE - S.LATITUDE, 2 ) +
      POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND

    S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND

    Y.YEAR BETWEEN 1900 AND 2009 AND

    M.YEAR_REF_ID = Y.ID AND

    M.CATEGORY_ID = '001' AND

    M.ID = D.MONTH_REF_ID AND
    D.DAILY_FLAG_ID <> 'M'
  GROUP BY
    Y.YEAR
) t

See the image for details on slope, intercept, and (Pearson's) correlation.

Dave Jarvis 2010-05-09 23:28:10

Congratulations. Maybe someone should talk to you about R http://www.r-project.org/ You could do it with a one-liner

gd047 2010-05-10 09:52:13

@gd047: I never even thought to look for an R integration for MySQL. http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html Will check it out! Thanks for the tip.

Dave Jarvis 2010-05-10 15:48:43

Keep in mind that response variables in regression are assumed to be independent of each other. This assumption is not appropriate sometimes, if the data are collected over time (which is true in your case) and the observations that are taken close together are related. Lack of independence causes no bias in least squares estimates of the coefficients, but standard errors are seriously affected.

gd047 2010-05-11 11:11:40

@gd047: Would you mind telling me a bit more about that? Or point me to a website (or Wikipedia page) that discusses what to do when there is a dependency?

Dave Jarvis 2010-05-11 16:07:10

google for "linear regression assumptions" independence "serial correlation" i.e. http://www.basic.northwestern.edu/statguidefiles/mulreg_ass_viol.html#Lack%20of%20independence

gd047 2010-05-11 16:29:17

If you do things in R it will be easy for you to try more complicated models. Autoregressive models would probably be useful here.

Jonathan Chang 2010-05-11 20:25:37

@Jonathan and gd047: Switching to R was *far* easier said than done. I finally have PostgreSQL running at decent speeds where the reports can be generated with the users falling asleep. Integration with R should be relatively easy now.

Dave Jarvis 2010-05-27 22:35:48

Answer 2

A:

Try to split up the function, you have miscalculated the parameters. Have a look here for reference.

I would do something like the following (please excuse the fact that I don't remember much about SQL syntax and temporary variables, so the code might actually be wrong):

SELECT

sum(t.YEAR) / count(1) AS avgX,

sum(t.AMOUNT) / count(1) AS avgY,

sum(t.AMOUNT*t.YEAR) / count(1) AS avgXY,

sum(power(t.YEAR, 2)) / count(1) AS avgXsq,

( avgXY - avgX * avgY ) / ( avgXsq - power(avgX, 2) )  as SLOPE,

avgY - SLOPE * avgX as INTERCEPT,

NeXuS 2010-05-12 05:21:52

The numbers are okay now; there was a conflict of data somewhere before.

Dave Jarvis 2010-05-12 05:59:24

ansaurus

tags:

views:

answers:

Optimal two variable linear regression calculation

related questions