How to fit the 2D scatter data with a line with C++

feelfree picture feelfree · Jul 12, 2012 · Viewed 20.2k times · Source

I used to work with MATLAB, and for the question I raised I can use p = polyfit(x,y,1) to estimate the best fit line for the scatter data in a plate. I was wondering which resources I can rely on to implement the line fitting algorithm with C++. I understand there are a lot of algorithms for this subject, and for me I expect the algorithm should be fast and meantime it can obtain the comparable accuracy of polyfit function in MATLAB.

Answer

Serge Rogatch picture Serge Rogatch · Jan 25, 2015

This page describes the algorithm easier than Wikipedia, without extra steps to calculate the means etc. : http://faculty.cs.niu.edu/~hutchins/csci230/best-fit.htm . Almost quoted from there, in C++ it's:

#include <vector>
#include <cmath>

struct Point {
  double _x, _y;
};
struct Line {
  double _slope, _yInt;
  double getYforX(double x) {
    return _slope*x + _yInt;
  }
  // Construct line from points
  bool fitPoints(const std::vector<Point> &pts) {
    int nPoints = pts.size();
    if( nPoints < 2 ) {
      // Fail: infinitely many lines passing through this single point
      return false;
    }
    double sumX=0, sumY=0, sumXY=0, sumX2=0;
    for(int i=0; i<nPoints; i++) {
      sumX += pts[i]._x;
      sumY += pts[i]._y;
      sumXY += pts[i]._x * pts[i]._y;
      sumX2 += pts[i]._x * pts[i]._x;
    }
    double xMean = sumX / nPoints;
    double yMean = sumY / nPoints;
    double denominator = sumX2 - sumX * xMean;
    // You can tune the eps (1e-7) below for your specific task
    if( std::fabs(denominator) < 1e-7 ) {
      // Fail: it seems a vertical line
      return false;
    }
    _slope = (sumXY - sumX * yMean) / denominator;
    _yInt = yMean - _slope * xMean;
    return true;
  }
};

Please, be aware that both this algorithm and the algorithm from Wikipedia ( http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line ) fail in case the "best" description of points is a vertical line. They fail because they use

y = k*x + b 

line equation which intrinsically is not capable to describe vertical lines. If you want to cover also the cases when data points are "best" described by vertical lines, you need a line fitting algorithm which uses

A*x + B*y + C = 0

line equation. You can still modify the current algorithm to produce that equation:

y = k*x + b <=>
y - k*x - b = 0 <=>
B=1, A=-k, C=-b

In terms of the above code:

B=1, A=-_slope, C=-_yInt

And in "then" block of the if checking for denominator equal to 0, instead of // Fail: it seems a vertical line, produce the following line equation:

x = xMean <=>
x - xMean = 0 <=>
A=1, B=0, C=-xMean

I've just noticed that the original article I was referring to has been deleted. And this web page proposes a little different formula for line fitting: http://hotmath.com/hotmath_help/topics/line-of-best-fit.html

double denominator = sumX2 - 2 * sumX * xMean + nPoints * xMean * xMean;
...
_slope = (sumXY - sumY*xMean - sumX * yMean + nPoints * xMean * yMean) / denominator;

The formulas are identical because nPoints*xMean == sumX and nPoints*xMean*yMean == sumX * yMean == sumY * xMean.