I used to work with MATLAB, and for the question I raised I can use p = polyfit(x,y,1) to estimate the best fit line for the scatter data in a plate. I was wondering which resources I can rely on to implement the line fitting algorithm with C++. I understand there are a lot of algorithms for this subject, and for me I expect the algorithm should be fast and meantime it can obtain the comparable accuracy of polyfit function in MATLAB.
This page describes the algorithm easier than Wikipedia, without extra steps to calculate the means etc. : http://faculty.cs.niu.edu/~hutchins/csci230/best-fit.htm . Almost quoted from there, in C++ it's:
#include <vector>
#include <cmath>
struct Point {
double _x, _y;
};
struct Line {
double _slope, _yInt;
double getYforX(double x) {
return _slope*x + _yInt;
}
// Construct line from points
bool fitPoints(const std::vector<Point> &pts) {
int nPoints = pts.size();
if( nPoints < 2 ) {
// Fail: infinitely many lines passing through this single point
return false;
}
double sumX=0, sumY=0, sumXY=0, sumX2=0;
for(int i=0; i<nPoints; i++) {
sumX += pts[i]._x;
sumY += pts[i]._y;
sumXY += pts[i]._x * pts[i]._y;
sumX2 += pts[i]._x * pts[i]._x;
}
double xMean = sumX / nPoints;
double yMean = sumY / nPoints;
double denominator = sumX2 - sumX * xMean;
// You can tune the eps (1e-7) below for your specific task
if( std::fabs(denominator) < 1e-7 ) {
// Fail: it seems a vertical line
return false;
}
_slope = (sumXY - sumX * yMean) / denominator;
_yInt = yMean - _slope * xMean;
return true;
}
};
Please, be aware that both this algorithm and the algorithm from Wikipedia ( http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line ) fail in case the "best" description of points is a vertical line. They fail because they use
y = k*x + b
line equation which intrinsically is not capable to describe vertical lines. If you want to cover also the cases when data points are "best" described by vertical lines, you need a line fitting algorithm which uses
A*x + B*y + C = 0
line equation. You can still modify the current algorithm to produce that equation:
y = k*x + b <=>
y - k*x - b = 0 <=>
B=1, A=-k, C=-b
In terms of the above code:
B=1, A=-_slope, C=-_yInt
And in "then" block of the if
checking for denominator equal to 0, instead of // Fail: it seems a vertical line
, produce the following line equation:
x = xMean <=>
x - xMean = 0 <=>
A=1, B=0, C=-xMean
I've just noticed that the original article I was referring to has been deleted. And this web page proposes a little different formula for line fitting: http://hotmath.com/hotmath_help/topics/line-of-best-fit.html
double denominator = sumX2 - 2 * sumX * xMean + nPoints * xMean * xMean;
...
_slope = (sumXY - sumY*xMean - sumX * yMean + nPoints * xMean * yMean) / denominator;
The formulas are identical because nPoints*xMean == sumX
and nPoints*xMean*yMean == sumX * yMean == sumY * xMean
.