views:

804

answers:

6

I'm looking for a non-linear curve fitting routine (probably most likely to be found in R or Python, but I'm open to other languages) which would take x,y data and fit a curve to it.

I should be able to specify as a string the type of expression I want to fit.

Examples:

"A+B*x+C*x*x"
"(A+B*x+C*x*x)/(D*x+E*x*x)"
"sin(A+B*x)*exp(C+D*x)+E+F*x"

What I would get out of this is at least the values for the constants (A, B, C, etc.) And hopefully stats about the fitness of the match.

There are commercial programs to do this, but I expected to be able to find something as common as fitting to a desired expression in a language library nowadays. I suspect SciPy's optimization stuff might be able to do this, but I can't see that it lets me define an equation. Likewise, I can't seem to find exactly what I want in R.

Is what I'm looking for out there, or do I need to roll my own? I hate to do it if it's there and I'm just having trouble finding it.


Edit: I want to do this for a bit more control over the process than I get from LAB Fit. The LAB Fit UI is dreadful. I'd also like to be able to break the range into multiple pieces and have different curves represent the different pieces of the range. In the end, the result has to be able to (speed-wise) beat a LUT with linear interpolation or I'm not interested.

In my current set of problems, I have trig functions or exp() and I need to execute them 352,800 times per second in real time (and use only a fraction of the CPU). So I plot the curve and use the data to drive the curve fitter to get less expensive approximations. In the old days, LUTs were almost always the solution, but nowadays skipping the memory lookups and doing an approximation is sometimes faster.

+1  A: 

Check out GNU Octave - between its polyfit() and the nonlinear constraints solver it ought to be possible to construct something suitable for your problem.

moonshadow
I actually use Octave sometimes. I'll see what I can figure out.
Nosredna
+5  A: 

Your first model is actually linear in the three parameters and can be fit in R using

 fit <- lm(y ~ x + I(x^2), data=X)

which will get you your three parameters.

The second model can also be fit using nls() in R with the usual caveats of having to provide starting values etc. The statistical issues in optimization are not necessarily the same as the numerical issues -- you cannot just optimize any functional form no matter which language you choose.

Dirk Eddelbuettel
Although you'd be better off with `y ~ poly(x, 2)` or `y ~ ns(x, 2)`
hadley
+1  A: 

You're probably not going to find a single routine with the flexibility implied in your examples (polynomials and rational functions using the same routine), let alone one that will parse a string to figure out what sort of equation to fit.

A least-squares polynomial fitter would be appropriate for your first example. (It's up to you what degree polynomial to use -- quadradic, cubic, quartic, etc.). For a rational function like your second example, you might have to "roll your own" if you can't find a suitable library. Also, keep in mind that a sufficiently high-degree polynomial can be used to approximate your "real" function, as long as you don't need to extrapolate beyond the limits of the data set you're fitting to.

As others have noted, there are other, more generalized parameter estimation algorithms which might also prove useful. But those algorithms aren't quite "plug and play": they usually require you to write some helper routines, and supply a list of initial values for the model parameters. It's possible for these sorts of algorithms to diverge, or get stuck in a local minimum or maximum for an unlucky choice of initial parameter estimates.

Jim Lewis
When I use the commercial products, I typically have _no idea_ what will work best. LAB Fit will try several hundred equations to see what fits the data best in the range I specify.
Nosredna
I hadn't considered that use case -- if you're in the early stages of trying to characterize a data set, it does make sense to try several families of functions (linear, polynomial, power law, periodic...) to see what a good fit might look like. I'll edit my answer accordingly.
Jim Lewis
"It's possible for these sorts of algorithms to diverge..." Yeah, I assume the commercial programs just bail out when this happens during checking all the choices. They do let you play with initial values when you choose one expression at a time.
Nosredna
+1  A: 

In R, this is quite easy.

The built in method is called optim(). It takes as arguments a starting vector of potential parameters, then a function. You have to go build your own error function, but that's really simple.

Then you call it like out = optim( 1 , err_fn)

where err_fn is

err_fn = function(A) {
    diff = 0;
    for(i in 1:data_length){
      x = eckses[i];
      y = data[i];
      model_y = A*x;
      diff = diff + ( y - model_y )^2
    }
    return(diff);
}

This just assumes you have a vector of x and y values in eckses and data. Change the model_y line as you see fit, even add more parameters.

It works on nonlinear just fine, I use it for four dimensional e^x curves and it is very fast. The output data includes the error value at the end of the fitting, which is a measure of how well it fits, given as a sum of squared differences (in my err_fn).

EDIT: If you NEED to take in the model as a string, you can have your user interface construct this whole model fitting process as an R script and load it in to run. R can take text from STDIN or from a file, so it shouldn't be too hard to craft this function's string equivalent, and have it run optim automatically.

Karl
But why not use nls() in R?
Dirk Eddelbuettel
I dont use nls for two reasons, first, I like being able to hand craft the error function to be optimized, and second, I'm not actually all that experienced with R. So nls() does just what I wrote up there? Neat.
Karl
My ultimate goal is to hand it a list of strings and have the code try them all out to find the best fit.
Nosredna
+1  A: 

if you have constraints on your coefficients, and you know that there is a specific type of function you'd want to fit to your data and that function is a messy one where standard regression methods or other curve fitting methods won't work, have you considered genetic algorithms?

they're not my first choice, but if you are trying to find the coefficients of the second function you mentioned, then perhaps GAs would work --- especially if you are using non-standard metrics to evaluate best fit. for example, if you wanted to find the coefficients of "(A+Bx+Cx^2)/(Dx+Ex^2)" such that the sum of square differences between your function and data is minimal and that there be some constraint on the arclength of the resulting function, then a stochastic algorithm might be a good way to approach this.

some caveats: 1) stochastic algorithms won't guarantee the best solution, but they will often be very close. 2) you have to be careful about the stability of the algorithm.

on a longer note, if you are at the stage where you want to find a function from some space of functions that best fits your data (e.g., you are not going to impose, say, the second model on your data), then genetic programming techniques may also help.

B Rivera
That's an interesting idea. I'll think about that. Obviously, it would be slow. The commercial programs run through hundreds of equation forms in seconds.
Nosredna
yes, another downside is that stochastic algorithms can be slow. on the upside though, it's possible to obtain an equation form outside of the set that commercial programs run through. by allowing a genetic program to search through *classes* of functions (with operations on those functions) such as, power functions, exponentials, logarithms, trig functions, pdfs/cdfs, etc. it's possible to find a solution not given by a fixed set of equation forms. but again on the downside, this requires a reasonable up front coding effort which may not be worth its while.
B Rivera
I'm always up for a Quixotic adventure.
Nosredna
+1  A: 

To answer your question in a general sense (regarding parameter estimation in R) without considering the specifics of the equations you pointed out, I think you are looking for nls() or optim()... 'nls' is my first choice as it provides error estimates for each estimated parameter and when it fails I use 'optim'. If you have your x,y variables:

out <- tryCatch(nls( y ~ A+B*x+C*x*x, data = data.frame(x,y), 
                start = c(A=0,B=1,C=1) ) ,
                error=function(e) 
                optim( c(A=0,B=1,C=1), function(p,x,y)  
                      sum((y-with(as.list(p),A + B*x + C*x^2))^2), x=x, y=y) )

to get the coefficients, something like

getcoef <- function(x) if(class(x)=="nls") coef(x) else x$par
getcoef(out)

If you want the standard errors in the case of 'nls',

summary(out)$parameters

The help files and r-help mailing list posts contain many discussions regarding specific minimization algorithms implemented by each (the default used in each example case above) and their appropriateness for the specific form of the equation at hand. Certain algorithms can handle box constraints, and another function called constrOptim() will handle a set of linear constraints. This website may also help:

http://cran.r-project.org/web/views/Optimization.html

Stephen
Can I feed the formula in as strings?
Nosredna
yes - something like as.formula(paste("y","A+B*x+C*x^2",sep="~")) should do it.
Stephen
that was in the nls case, in optim something like eval(parse(text=sprintf("sum((y-%s)^2)","A+B*x+C*x^2"))) should work (the sprintf construction is shown so you can insert the formula that you desire).
Stephen