tags:

views:

2107

answers:

8

It's been a while since I was in college and knew how to calculate a best fit line, but I find myself needing to. Suppose I have a set of points, and I want to find the line that is the best of those points.

What is the equation to determine a best fit line? How would I do that with PHP?

A: 

An often used approach is to iteratively minimize the sum of squared y-differences between your points and the fit function.

Svante
+3  A: 

Although you can use an iterative approach, you can directly calculate the slope and intercept of a line given a set of observations using a least-squares approach. See the "Univariate Linear Case" section of the Wikipedia article on linear regression for how to calculate the coefficients a and b in y = a + bx given sets of (x,y) points.

Tim Whitcomb
+5  A: 

Method of Least Squares http://en.wikipedia.org/wiki/Least_squares. This book Numerical Recipes 3rd Edition: The Art of Scientific Computing (Hardcover) has all you need for algorithms to implement Least Squares and other techniques.

+1  A: 

You may want to check out linear regression, or more generally, curve fitting.

Zach Scrivena
+2  A: 

Here's an article comparing two ways to fit a line to data. One thing to watch out for is that there is a direct solution that is correct in theory but can have numerical problems. The article shows why that method can fail and gives another method that is better.

John D. Cook
+2  A: 

Implemented from wiki page, untested.

$sx = 0;
$sy = 0;
$sxy = 0;
$sx2 = 0;
$n = count($data);
foreach ($data as $x => $y)
{
    $sx += $x;
    $sy += $y;
    $sxy += $x * $y;
    $sx2 += $x * $x;
}
$beta = ($n*$sxy - $sx*$sy) / ($n*$sx2 - $sx*$sx);
$alpha = $sy/$n - $sx*$beta/$n;

echo "y = $alpha + $beta x";
FryGuy
+1  A: 

Of additional interest is probably how good of a fit the line is. For that, use the Pearson correlation, here in a PHP function:

/**
 * returns the pearson correlation coefficient (least squares best fit line)
 * 
 * @param array $x array of all x vals
 * @param array $y array of all y vals
 */

function pearson(array $x, array $y)
{
    // number of values
    $n = count($x);
    $keys = array_keys(array_intersect_key($x, $y));

    // get all needed values as we step through the common keys
    $x_sum = 0;
    $y_sum = 0;
    $x_sum_sq = 0;
    $y_sum_sq = 0;
    $prod_sum = 0;
    foreach($keys as $k)
    {
     $x_sum += $x[$k];
     $y_sum += $y[$k];
     $x_sum_sq += pow($x[$k], 2);
     $y_sum_sq += pow($y[$k], 2);
     $prod_sum += $x[$k] * $y[$k];
    }

    $numerator = $prod_sum - ($x_sum * $y_sum / $n);
    $denominator = sqrt( ($x_sum_sq - pow($x_sum, 2) / $n) * ($y_sum_sq - pow($y_sum, 2) / $n) );

    return $denominator == 0 ? 0 : $numerator / $denominator;
}
ruquay
btw, the Pearson coefficient ranges from 0 (no correlation) to 1.0 (points lie on a straight line)
ruquay