ansaurus

Question

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

Answer 1

+2 A:

Yes, it is a projection. For

y = X beta + error

where lowercased terms are vectors, and X is a matrix, you have the solution vector

\hat{beta} = inverse(X'X) X' y

as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.

If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.

Dirk Eddelbuettel 2009-11-11 17:52:02

Can you provide a pointer as to how to do this with more than 1 independent variable? Specifically open source Java code would be useful, but other languages are fine too.

sanity 2009-11-11 18:36:11

I can't help with Java but here is e.g. the GNU GSL on this: http://www.gnu.org/software/gsl/manual/html_node/Multi_002dparameter-fitting.html

Dirk Eddelbuettel 2009-11-11 19:20:58

And you can confirm that this requires only a single pass through the data?

sanity 2009-11-13 14:38:19

Please specify your problem more clearly. What is your dimension? For vector x and vector y, computing crossproducts x*y and x*x works in one pass. For larger problems you need to get the inverse of the matrix product X' * X which is not.

Dirk Eddelbuettel 2009-11-13 15:20:29

And as per your comment below, no you do not get non-linear fits in single passes either.

Dirk Eddelbuettel 2009-11-13 15:35:41

Dirk, there must be some way to do it, even if its a sub-optimal fit.

sanity 2009-11-13 19:17:52

Dirk, fyi it turned out that this is possible - I answered my own question with a link to the source - see below

sanity 2009-11-14 13:51:17

Sure but that is a special case and not the more general ordinary least squares fitting I had talked about. But good to know you have your problem solved.

Dirk Eddelbuettel 2009-11-14 14:03:52

Answer 2

+1 A:

If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:

Interpolation.h:

#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H

#include "DataPoint.h"

class Interpolation
{
private:
  int m_count;
  double m_sumX;
  double m_sumXX;  /* sum of X*X */
  double m_sumXY;  /* sum of X*Y */
  double m_sumY;
  double m_sumYY;  /* sum of Y*Y */

public:
  Interpolation();

  void addData(const DataPoint& dp);

  double slope() const;
  double intercept() const;

  double interpolate(double x) const;
  double correlate() const;
};

#endif // __INTERPOLATION_H

Interpolation.cpp:

#include <cmath>

#include "Interpolation.h"

Interpolation::Interpolation()
{
  m_count = 0;
  m_sumX = 0.0;
  m_sumXX = 0.0;
  m_sumXY = 0.0;
  m_sumY = 0.0;
  m_sumYY = 0.0;
}

void Interpolation::addData(const DataPoint& dp)
{
  m_count++;
  m_sumX += dp.getX();
  m_sumXX += dp.getX() * dp.getX();
  m_sumXY += dp.getX() * dp.getY();
  m_sumY += dp.getY();
  m_sumYY += dp.getY() * dp.getY();
}

double Interpolation::slope() const
{
  return (m_sumXY - (m_sumX * m_sumY / m_count)) /
    (m_sumXX - (m_sumX * m_sumX / m_count));
}

double Interpolation::intercept() const
{
  return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}


double Interpolation::interpolate(double X) const
{
  return intercept() + slope() * X;
}


double Interpolation::correlate() const
{
  return m_sumXY / sqrt(m_sumXX * m_sumYY);
}

Chip Uni 2009-11-13 06:22:22

Unfortunately I need a better fit than a simple straight line, but if you could generalize this to more flexible curves that would be exactly what I need.

sanity 2009-11-13 14:37:47

I won't have time to generalize this code. I hope someone else has what you need!

Chip Uni 2009-11-13 17:38:38

Make a post to SO when you're book's done. I'm interested.

Stefan Kendall 2009-11-13 19:33:45

Answer 3

A:

Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?

If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.

Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).

If you are limiting to a max number of coefficients, what is your max?

Following your comments and edit:

What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.

mbeckish 2009-11-13 19:31:13

Yes, I am limiting the number of polynomial coefficients, because I'm trying to find an underlying pattern in data that contains a **lot** of noise. I've clarified the question accordingly. Also note that using matrices may require more than one pass through the data, and if so, it isn't applicable to my problem.

sanity 2009-11-13 19:47:25

Why the downvote?

mbeckish 2009-11-13 19:47:51

If you have a lot of random noise, then the max number of coefficients will ALWAYS give you the best fit. You should just say "How do I best fit a 10th degree polynomial to my data?"

mbeckish 2009-11-13 19:53:09

Plus, finding the best fit polynomial foes not filter out the noise. For example, data that pretty much follows a straight line, but where each point randomly fluctuates up and down, will fit a 10th degree polynomial better than a first degree.

mbeckish 2009-11-13 19:55:03

Answer 4

A:

Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.

Drew Hall 2009-11-13 19:44:21

Several issues: given the degree of noise in my data, 1000 points is unlikely to be sufficient to find the underlying curve, really I need to use all available data, which may be tens of millions of points. Further, even if 1000 points was sufficient, this is too much state to maintain per-curve in this particular application (since I need to fit potentially hundreds of thousands of curves simultaneously).

sanity 2009-11-13 19:51:15

Answer 5

A:

Given the nature of your data:

the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.

Then you don't need even a single pass, as these two lines will pass exactly through every point:

X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0

Two short line segments, unit length, and every point falls on one line or the other.

Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.

Bevan 2009-11-13 20:07:23

Two lines are not a curve.

sanity 2009-11-13 23:16:52

To quote the original question: "I need to fit thousands of curves simultaneously"

Bevan 2009-11-14 01:03:41

Ok, I can see how that could have been misinterpreted. I meant fit thousands of curves, each to a separate dataset.

sanity 2009-11-14 13:48:10

Answer 6

A:

You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:

Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.

Jed 2009-11-13 20:13:17

Answer 7

A:

Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.

The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.

Advantages for this situation:

Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.

An approach

To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.

Bevan 2009-11-13 20:18:45

Answer 8

A:

I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

sanity 2009-11-13 23:15:22

ansaurus

tags:

views:

answers:

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

related questions