views:

1948

answers:

6

I have a bunch of data, generally in the form a, b, c, ..., y

where y = f(a, b, c...)

Most of them are three and four variables, and have 10k - 10M records. My general assumption is that they are algebraic in nature, something like:

y = P1 a^E1 + P2 b^E2 + P3 c^E3

Unfortunately, my last statistical analysis class was 20 years ago. What is the easiest way to get a good approximation of f? Open source tools with a very minimal learning curve (i.e. something where I could get a decent approximation in an hour or so) would be ideal. Thanks!

+2  A: 

The basics of data fitting involve assuming a general form of a solution, guessing some initial values for constants, and then iterating to minimize the error of the guessed solution to find a specific solution, usually in the least-squares sense.

Look into R or Octave for open source tools. They are both capable of least-squares analysis, with several tutorials just a Google search away.

Edit: Octave code for estimating the coefficients for a 2nd order polynomial

x = 0:0.1:10;
y = 5.*x.^2 + 4.*x + 3;

% Add noise to y data
y = y + randn(size(y))*0.1;

% Estimate coefficients of polynomial
p = polyfit(x,y,2)

On my machine, I get:

ans =

   5.0886   3.9050   2.9577
Scottie T
Thanks, I have...that's why I said "very minimal learning curve"! Those are both excellent general purpose statistical languages, but have a pretty hefty learning curve (IMHO).
I see. I'd think that, with simple functions, it shouldn't take too long to get up to speed with either tool, or even to do this in Python or Perl.
Scottie T
I'd think that they are relatively simple (I added detail to the question), and I've already spent an hour or so on Google, which is why I've turned here ;-)
Unfortunately, polyfit will only work for single-valued functions f(x). The OP specifically mentioned non-linear multi-dimensional fitting, which I doubt Octave supports out of the box.
Kena
I don't think you're going to get much simpler than that Octave code (or Numpy/Scipy in Python, which has nearly the same syntax - see http://www.scipy.org ).
David Zaslavsky
A: 

If you have a guess at the form of f,[*] you need a minimizer to find the optimal parameters. The tools Scottie T suggests would work, as would ROOT, and many others.

If you don't have a clue what form f might take you are in deep trouble indeed.


[*] That is, you know that

f = f(x,y,z,w,...;p1,p2,p3...)

where the ps are parameters and the coordinates are x, y...

dmckee
+1  A: 

There's a tool for fitting 1D and 2D curves at zunzun.com, but I don't think it goes beyond two variables. Likewise, Matlab doesn't support more than two dimensions fitting (as far as I know) and it's certainly not free.

Otherwise, you might be able to find part of your solution in the Numerical Recipes.

But as other posters indicated, you'll probably need at least a basic idea of your function model (which, hopefully, is linear or can be linearized, in which case you'll have a much larger array of solutions at your disposal)

Kena
NR would be among the most powerful paths to follow, but not likely a minimal learning curve.
Scottie T
I agree. But I don't think this is the kind of problem where there's an easy way out.
Kena
A: 

Do you know to what power you want to limit your polynomial?

If there is no limit, then you can always get an exact match for N points by matching it to a polynomial that has N coefficients. To do this, you plug N different points into your equation, yielding N equations and N unknowns (the coefficients), which you can then use either simple high school algebra or a matrix to solve for the unknowns.

mbeckish
+1  A: 

In case it's useful, here's a Numpy/Scipy (Python) template to do what you want:

from numpy import array
from scipy.optimize import leastsq

def __residual(params, y, a, b, c):
    p0, e0, p1, e1, p2, e2 = params
    return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y

# load a, b, c
# guess initial values for p0, e0, p1, e1, p2, e2
p_opt = leastsq(__residual,  array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c))
print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt)

If you really want to understand what's going on, though, you're going to have to invest the time to scale the learning curve for some tool or programming environment - I really don't think there's any way around that. People don't generally write specialized tools for doing things like 3-term power regressions exclusively.

David Zaslavsky
scipy.odr (orthogonal distance regression) could be useful if a, b, c don't have infinite precision (least square assumes infinite precision for coordinates).
J.F. Sebastian
Surely the function requires some sample output to minimise towards i.e. some sample y values given a set of a, b, c values?
Brendan
A: 

Short Answer: it isn't so simple. Consider a non-parametric approach on data sub-sets.

There are 2 main issues you need to decide about (1) Do you actually care about the parameters of the function, i.e. your P1, E1, ..., or would you be okay with just estimating the mean function (2) do you really need to estimate the function on all of the data?

The first thing I'll mention is that your specified function is non-linear (in the parameters to be estimated), so ordinary least squares won't work. Let's pretend that you specified a linear function. You'd still have a problem with the 10M values. Linear regression can be performed in an efficient way using QR factorization, but you are still left with an O(p * n^2) algorithm, where p is the number of parameters you are trying to estimate. If you want to estimate the non-linear mean function it gets much worse.

The only way you are going to be able to estimate anything in such a large data set is by using a subset to perform the estimation. Basically, you randomly select a subset and use that to estimate the function.

If you don't care about your parameter values, and just want to estimate the mean function you will probably be better off using a non-parametric estimation technique.

Hopefully this helps.

leif

leif