views:

3651

answers:

14

Is there a way, given a set of values (x,f(x)), to find the polynomial of a given degree that best fits the data?

I know polynomial interpolation, which is for finding a polynomial of degree n given n+1 data points, but here there are a large number of values and we want to find a low-degree polynomial (find best linear fit, best quadratic, best cubic, etc.). It might be related to least squares...

More generally, I would like to know the answer when we have a multivariate function -- points like (x,y,f(x,y)), say -- and want to find the best polynomial (p(x,y)) of a given degree in the variables. (Specifically a polynomial, not splines or Fourier series.)

Both theory and code/libraries (preferably in Python, but any language is okay) would be useful.

+4  A: 

Yes, the way this is typically done is by using least squares. There are other ways of specifying how well a polynomial fits, but the theory is simplest for least squares. The general theory is called linear regression.

Your best bet is probably to start with Numerical Recipes.

R is free and will do everything you want and more, but it has a big learning curve.

If you have access to Mathematica, you can use the Fit function to do a least squares fit. I imagine Matlab and its open source counterpart Octave have a similar function.

John D. Cook
This is helpful, but do you know which of these do multivariate fitting?
ShreevatsaR
least squares can be multivariate. gil strang's "intro to applied mathematics" has a very nice, readable discussion.
duffymo
Yes, thanks... when I asked that comment about multivariate fitting I didn't know enough about least squares :-)
ShreevatsaR
+1  A: 

For (x, f(x)) case:

import numpy

x = numpy.arange(10)
y = x**2

coeffs = numpy.polyfit(x, y, deg=2)
poly = numpy.poly1d(coeffs)
print poly
yp = numpy.polyval(poly, x)
print (yp-y)
J.F. Sebastian
+1  A: 

If you want to fit the (xi, f(xi)) to an polynomial of degree n then you would set up a linear least squares problem with the data (1, xi, xi, xi^2, ..., xi^n, f(xi) ). This will return a set of coefficients (c0, c1, ..., cn) so that the best fitting polynomial is *y = c0 + c1 * x + c2 * x^2 + ... + cn * x^n.*

You can generalize this two more than one dependent variable by including powers of y and combinations of x and y in the problem.

David Norman
A: 

The lagrange polynomial is in some sense the "simplest" interpolating polynomial that fits a given set of data points.

It is sometimes problematic because it can vary wildly between data points.

Also, the Lagrange polynomial has degree n-1 with n points -- that's what I wrote in the question about polynomial interpolation -- it doesn't give the best-fit polynomial for a given degree.
ShreevatsaR
A: 

Not an answer--

I've always thought that for game development it would be really nice to have a program that let you draw some arbitrary (x, f(x)) style curve with the mouse, add numbers to the scales, and have it print out a formula to use for f.

Using a pure polynomial solution seems to me to have the problem that--well like I don't think it could come up with sin(x) if you input a sine wave. I could be wrong, I haven't done serious maths in decades.

Bill K
For visual applications, this is usually done with splines -- piecewise polynomials that fit together smoothly -- rather than with polynomials. Polynomials wiggle too much when they fit data.
John D. Cook
WEll, technically you could. Transcendental functions can be represented to any degree of accuracy with Taylor Series. However, it takes a polynomial of *infinite* degree to *exactly* match the function, so I'd probably look at sinusoidal methods for this type of problem.
stalepretzel
+2  A: 

Lagrange polynomials (as @j w posted) give you an exact fit at the points you specify, but with polynomials of degree more than say 5 or 6 you can run into numerical instability.

Least squares gives you the "best fit" polynomial with error defined as the sum of squares of the individual errors. (take the distance along the y-axis between the points you have and the function that results, square them, and sum them up) The MATLAB polyfit function does this, and with multiple return arguments, you can have it automatically take care of scaling/offset issues (e.g. if you have 100 points all between x=312.1 and 312.3, and you want a 6th degree polynomial, you're going to want to calculate u = (x-312.2)/0.1 so the u-values are distributed between -1 and +=).

NOTE that the results of least-squares fits are strongly influenced by the distribution of x-axis values. If the x-values are equally spaced, then you'll get larger errors at the ends. If you have a case where you can choose the x values and you care about the maximum deviation from your known function and an interpolating polynomial, then the use of Chebyshev polynomials will give you something that is close to the perfect minimax polynomial (which is very hard to calculate). This is discussed at some length in Numerical Recipes.

Edit: From what I gather, this all works well for functions of one variable. For multivariate functions it is likely to be much more difficult if the degree is more than, say, 2. I did find a reference on Google Books.

Jason S
Thanks for the reference, BTW. The relevant stuff is a few pages later; 4.10.4 on page 231. The same thing works for multivariate polynomials of higher degree too without great difficulty, although there is the concern of "overfitting".
ShreevatsaR
+12  A: 
ShreevatsaR
Chebyshev nodes work for least squares.
Jason S
@Jason:Are you sure that Chebyshev nodes are known to be the best points to choose even for least squares? [It seems there is the different question of picking Chebyshev polynomials themselves as the approximating polynomials, for a different kind of fit from least squares -- "minimax" polynomial.]
ShreevatsaR
df/dx = 0 does not necessarily mean that f is minimized, it might also be maximized.
quant_dev
True, but that's not what we're saying here. We're saying that (assuming the function has partial derivatives etc.), any minimum occurs (either on the boundary or) at a point where the gradient is 0.
ShreevatsaR
about your concerns on numerical stability: defining a polynomial (= "linear combination of monomials") is a dangerous thing to do, because (saying it in non mathematical words) monomials (say) above degree 4 are terribly similar to each other in the area around 0 and then they just "grow crazy". a much better thing is to decide in which interval you are attempting to fit the data, redefine your independent variable so that you are actually fitting in (-1, 1), and look for a linear combination of good polynomials instead of monomials. I would use the Chebyshev set.
mariotomo
@mariotomo: Thanks, that makes sense after you say it :) Good point.
ShreevatsaR
+1  A: 

Bare in mind that a polynomial of higher degree ALWAYS fits the data better. Polynomials of higher degree typically leads to highly improbable functions (see Occam's Razor), though (overfitting). You want to find a balance between simplicity (degree of polynomial) and fit (e.g. least square error). Quantitatively, there are tests for this, the Akaike Information Criterion or the Bayesian Information Criterion. These tests give a score which model is to be prefered.

Fredriku73
Right, I realised after a while that there was some trade-off between fit and simplicity, as you said. Thanks for the information about the criteria.
ShreevatsaR
A: 

It's called approximation theory, a vast field in numerical analysis, well researched.

Sure it's probably a vast ocean, but it would be nice to know *something* from it. :-)
ShreevatsaR
A: 

Remember, there's a big difference between approximating the polynomial and finding an exact one.

For example, if I give you 4 points, you could

  1. Approximate a line with a method like least squares
  2. Approximate a parabola with a method like least squares
  3. Find an exact cubic function through these four points.

Be sure to select the method that's right for you!

stalepretzel
Yes, I know :-) That is why I mentioned "polynomial interpolation" in the question, which is for finding an exact cubic through four points, or an exact n-th degree polynomial through n+1 points.
ShreevatsaR
A: 

Some theory:

Vandermonde matrix: http://en.wikipedia.org/wiki/Vandermonde_matrix

Moore-Penrose pseudoinverse: http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse

Tony Arkles
A: 

Here's a VB6 code library that might be of use.

VB6 Polynomial Regression

A: 

It's rather easy to scare up a quick fit using Excel's matrix functions if you know how to represent the least squares problem as a linear algebra problem. (That depends on how reliable you think Excel is as a linear algebra solver.)

duffymo
+1  A: 

at college we had this book which I still find extremely useful: Conte, de Boor; elementary numerical analysis; Mc Grow Hill. The relevant paragraph is 6.2: Data Fitting.
example code comes in FORTRAN, and the listings are not very readable either, but the explanations are deep and clear at the same time. you end up understanding what you are doing, not just doing it (as is my experience of Numerical Recipes).
I usually start with Numerical Recipes but for things like this I quickly have to grab Conte-de Boor.

maybe better posting some code... it's a bit stripped down, but the most relevant parts are there. it relies on numpy, obviously!

def Tn(n, x):
  if n==0:
    return 1.0
  elif n==1:
    return float(x)
  else:
    return (2.0 * x * Tn(n - 1, x)) - Tn(n - 2, x)

class ChebyshevFit:

  def __init__(self):
    self.Tn = Memoize(Tn)

  def fit(self, data, degree=None):
    """fit the data by a 'minimal squares' linear combination of chebyshev polinomials.

    cfr: Conte, de Boor; elementary numerical analysis; Mc Grow Hill (6.2: Data Fitting)
    """

    if degree is None:
      degree = 5

    data = sorted(data)
    self.range = start, end = (min(data)[0], max(data)[0])
    self.halfwidth = (end - start) / 2.0
    vec_x = [(x - start - self.halfwidth)/self.halfwidth for (x, y) in data]
    vec_f = [y for (x, y) in data]

    mat_phi = [numpy.array([self.Tn(i, x) for x in vec_x]) for i in range(degree+1)]
    mat_A = numpy.inner(mat_phi, mat_phi)
    vec_b = numpy.inner(vec_f, mat_phi)

    self.coefficients = numpy.linalg.solve(mat_A, vec_b)
    self.degree = degree

  def evaluate(self, x):
    """use Clenshaw algorithm

    http://en.wikipedia.org/wiki/Clenshaw_algorithm
    """

    x = (x-self.range[0]-self.halfwidth) / self.halfwidth

    b_2 = float(self.coefficients[self.degree])
    b_1 = 2 * x * b_2 + float(self.coefficients[self.degree - 1])

    for i in range(2, self.degree):
      b_1, b_2 = 2.0 * x * b_1 + self.coefficients[self.degree - i] - b_2, b_1
    else:
      b_0 = x*b_1 + self.coefficients[0] - b_2

    return b_0
mariotomo
Thanks again; that's quite clear. Why is it good to normalise the range to (-1,1), BTW?
ShreevatsaR
because that's the range where the chebyshev polynomials are good behaving. in fact inside that range you can characterize them this way: T_n(x) = cos(n*acos(x)). this formula has no meaning for x not in (-1, 1).
mariotomo
I have tested my module against numpy.polyfit (just that single example from the page you point to) and I was a bit surprised seeing that my fit matches the numpy.polyfit (even for extrapolation) up to the 15th digit. I should try more bad conditioned cases... if they still match then maybe numpy uses chebyshev polynomials behind the scenes and returns the corresponding monomial coefficients...
mariotomo