Python - calculating trendlines with errors

Aaron Powell picture Aaron Powell · Aug 24, 2011 · Viewed 18.1k times · Source

So I've got some data stored as two lists, and plotted them using

plot(datasetx, datasety)

Then I set a trendline

trend = polyfit(datasetx, datasety)
trendx = []
trendy = []

for a in range(datasetx[0], (datasetx[-1]+1)):
    trendx.append(a)
    trendy.append(trend[0]*a**2 + trend[1]*a + trend[2])

plot(trendx, trendy)

But I have a third list of data, which is the error in the original datasety. I'm fine with plotting the errorbars, but what I don't know is using this, how to find the error in the coefficients of the polynomial trendline.

So say my trendline came out to be 5x^2 + 3x + 4 = y, there needs to be some sort of error on the 5, 3 and 4 values.

Is there a tool using NumPy that will calculate this for me?

Answer

joris picture joris · Aug 25, 2011

I think you can use the function curve_fit of scipy.optimize (documentation). A basic example of the usage:

import numpy as np
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a*x**2 + b*x + c

x = np.linspace(0,4,50)
y = func(x, 5, 3, 4)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

Following the documentation, pcov gives:

The estimated covariance of popt. The diagonals provide the variance of the parameter estimate.

So in this way you can calculate an error estimate on the coefficients. To have the standard deviation you can take the square root of the variance.

Now you have an error on the coefficients, but it is only based on the deviation between the ydata and the fit. In case you also want to account for an error on the ydata itself, the curve_fit function provides the sigma argument:

sigma : None or N-length sequence

If not None, it represents the standard-deviation of ydata. This vector, if given, will be used as weights in the least-squares problem.

A complete example:

import numpy as np
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a*x**2 + b*x + c

x = np.linspace(0,4,20)
y = func(x, 5, 3, 4)
# generate noisy ydata
yn = y + 0.2 * y * np.random.normal(size=len(x))
# generate error on ydata
y_sigma = 0.2 * y * np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn, sigma = y_sigma)

# plot
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x, yn, yerr = y_sigma, fmt = 'o')
ax.plot(x, np.polyval(popt, x), '-')
ax.text(0.5, 100, r"a = {0:.3f} +/- {1:.3f}".format(popt[0], pcov[0,0]**0.5))
ax.text(0.5, 90, r"b = {0:.3f} +/- {1:.3f}".format(popt[1], pcov[1,1]**0.5))
ax.text(0.5, 80, r"c = {0:.3f} +/- {1:.3f}".format(popt[2], pcov[2,2]**0.5))
ax.grid()
plt.show()

result


Then something else, about using numpy arrays. One of the main advantages of using numpy is that you can avoid for loops because operations on arrays apply elementwise. So the for-loop in your example can also be done as following:

trendx = arange(datasetx[0], (datasetx[-1]+1))
trendy = trend[0]*trendx**2 + trend[1]*trendx + trend[2]

Where I use arange instead of range as it returns a numpy array instead of a list. In this case you can also use the numpy function polyval:

trendy = polyval(trend, trendx)