views:

87

answers:

3

I've had an idea for a non-linear regression algorithm that I haven't seen before:

We fit a simple parametric function, such as a radial basis function, to the data using gradient descent. We find the residual from this and then fit a function to this, repeating this process to reduce the error and build up a collection of superimposed functions. (I'm assuming that the search can be persuaded to find the functions that fit the most points first)

As presented this algorithm would overfit. I think there are a few ways to overcome that but probably the most obvious is to limit the number of functions that are fitted.

I think it should be faster than a neural or rbf network because it doesn't have to tune so many parameters at once. There is no network architecture to choose. It should be more accurate than a decision tree algorithm such as M5 because it could follow a continuous curve more closely and doesn't have to choose attributes on which to split.

Has it been tried before? If so, why wasn't it successful?

+1  A: 

By fixing the parameters of the functions fitted in steps 1..n while fitting the parameters of function n+1 you will most likely find a WORSE fit (as e.g. defined by the mean squared error) than you would by fitting all n+1 functions at the same time. So you probably will need more functions to achieve the same mean squared error with your proposal compared to leaving all n+1 functions 'floating' at the same time.

In order to prevent overfitting, you should use something like Cross Validation, e.g. stop adding functions as soon as the mean squared error on a test sample not used in the fit stops decreasing.

Andre Holzner
Thanks for the answer. Why do think the fit will be worse doing it incrementally? An advantage of incrementally fitting is that the the number of functions doesn't have to be decided in advance.
Michael Burrows
Consider the data x^2 + y, and the fitting ordering of x, y, x^2. By the time we get to x^2, there's no way it can "fix" that the coefficient of x is nonzero.
Eric Towers
+1  A: 

This is definitely a question for http://metaoptimize.com/qa - a stackoverflow clone for the machine learning community.

That aside, the procedure you describe is basically a way to do regularization. For example consider how your procedure compares against using millions of RBF's with a L1 or L2 regression term. Obviously it is different only in the way the the regression term works. What you are basically doing is regularization on only the number of non-zeros weights instead of the sum or the squared sum.

This method of regularization is not new, people have been trying to do this effectively for a long time but the problem is that it is not convex or even continuous. That means the optimization is NP-Hard in general. Your technique is basically an approximation alogorithm for this problem and, while it might work on some example, you will have no guarantees of optimality (finding the "best" N rbfs to fit) and it will be highly sensitive to the starting point you choose. It is for these reasons that you usually do not see research papers in top conferences and journals doing this, but rather trying to L1-regularization more effectively, since that has been proven to be as close as we can possibly get while maintaining convexity.

anonymous_21321
Thanks for the link: I didn't know about that. I wish I could understand your answer: I should have said how little I know about the subject in the question.
Michael Burrows
+1  A: 

The idea of iteratively fitting a function based on residue is not new. For example see AdaBoost.

Amit Prakash