views:

3735

answers:

4

I'm using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel's best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.

Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?

Here's my function:

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)
     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    correlation = numpy.corrcoef(x, y)[0,1]

     # r
    results['correlation'] = correlation
     # r-squared
    results['determination'] = correlation**2

    return results
+1  A: 

R-squared is a statistic that only applies to linear regression.

Essentially, it measures how much variation in your data can be explained by the linear regression.

So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .

\sum_{i}(y_{i} - y_bar)^2

where y_bar is the mean of the y's.

Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean

\sum_{i}(yHat_{i} - y_bar)^2

and find the ratio of those two.

Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.

Here is a link I found that speaks to it a little.

Baltimark
This seems to be the root of my problem. How does Excel get a different r-squared value for a polynomial fit vs. a linear regression then?
Travis Beale
are you just giving excel the fits from a linear regression, and the fits from a polynomial model? It's going to calculate the rsq from two arrays of data, and just assume that you're giving it the fits from a linear model. What are you giving excel? What is the 'best fit trendline' command in excel?
Baltimark
It's part of the graphing functions of Excel. You can plot some data, right-click on it, then choose from several different types of trend lines. There is the option to see the equation of the line as well as an r-squared value for each type. The r-squared value is also different for each type.
Travis Beale
@Travis Beale -- you are going to get a different r-squared for each different mean function you try (unless two models are nested and the extra coeffecients in the larger model all work to be 0). So of course Excel gives a different r-squared values.@Baltimark -- this is linear regression so it is r-squared.
leif
+5  A: 

From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function

E(y|x) = p_d * x*d + p{d-1} * x **(d-1) + ... + p_1 * x + p_0

So you just need to calculate the R^2 for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.

I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your Rsquared, but the following should be correct

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = [p(z) for z in x]
    ybar = sum(y)/len(y)
    ssreg = sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results
leif
Thank you, this explanation is very clear to me. I'm going to try this one out.
Travis Beale
No problem, glad to help.
leif
Exactly what I was looking for.
Travis Beale
I just want to point out that using the numpy array functions instead of list comprehension will be much faster, e.g. numpy.sum((yi - ybar)**2) and easier to read
A: 

The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.

Glenton
+4  A: 

A very late reply, but just in case someone need a ready function for this :)

scipy.stats.stats.linregress

Gökhan Sever