ansaurus

Question

How do I calculate r-squared using Python and Numpy?

Answer 1

+1 A:

R-squared is a statistic that only applies to linear regression.

Essentially, it measures how much variation in your data can be explained by the linear regression.

So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .

\sum_{i}(y_{i} - y_bar)^2

where y_bar is the mean of the y's.

Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean

\sum_{i}(yHat_{i} - y_bar)^2

and find the ratio of those two.

Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.

Here is a link I found that speaks to it a little.

Baltimark 2009-05-21 16:54:49

This seems to be the root of my problem. How does Excel get a different r-squared value for a polynomial fit vs. a linear regression then?

Travis Beale 2009-05-21 16:59:06

are you just giving excel the fits from a linear regression, and the fits from a polynomial model? It's going to calculate the rsq from two arrays of data, and just assume that you're giving it the fits from a linear model. What are you giving excel? What is the 'best fit trendline' command in excel?

Baltimark 2009-05-21 17:45:06

It's part of the graphing functions of Excel. You can plot some data, right-click on it, then choose from several different types of trend lines. There is the option to see the equation of the line as well as an r-squared value for each type. The r-squared value is also different for each type.

Travis Beale 2009-05-21 20:19:37

@Travis Beale -- you are going to get a different r-squared for each different mean function you try (unless two models are nested and the extra coeffecients in the larger model all work to be 0). So of course Excel gives a different r-squared values.@Baltimark -- this is linear regression so it is r-squared.

leif 2009-05-21 20:20:34

Answer 2

+5 A:

From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function

E(y|x) = p_d * x*d + p{d-1} * x **(d-1) + ... + p_1 * x + p_0

So you just need to calculate the R^2 for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.

I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your Rsquared, but the following should be correct

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = [p(z) for z in x]
    ybar = sum(y)/len(y)
    ssreg = sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results

leif 2009-05-21 20:48:35

Thank you, this explanation is very clear to me. I'm going to try this one out.

Travis Beale 2009-05-22 00:46:58

No problem, glad to help.

leif 2009-05-22 00:59:56

Exactly what I was looking for.

Travis Beale 2009-05-22 13:52:33

I just want to point out that using the numpy array functions instead of list comprehension will be much faster, e.g. numpy.sum((yi - ybar)**2) and easier to read

2010-10-18 03:31:53

Answer 3

A:

The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.

Glenton 2009-08-25 06:06:24

Answer 4

+4 A:

A very late reply, but just in case someone need a ready function for this :)

scipy.stats.stats.linregress

Gökhan Sever 2009-10-04 21:15:51

ansaurus

tags:

views:

answers:

How do I calculate r-squared using Python and Numpy?

related questions