Local linear regression in R -- locfit() vs locpoly()

user1870614 picture user1870614 · Feb 2, 2015 · Viewed 7.8k times · Source

I am trying to understand the different behaviors of these two smoothing functions when given apparently equivalent inputs. My understanding was that locpoly just takes a fixed bandwidth argument, while locfit can also include a varying part in its smoothing parameter (a nearest-neighbors fraction, "nn"). I thought setting this varying part to zero in locfit should make the "h" component act like the fixed bandwidth used in locpoly, but this is evidently not the case.

A working example:

library(KernSmooth)
library(locfit)
set.seed(314)

n <- 100
x <- runif(n, 0, 1)
eps <- rnorm(n, 0, 1)
y <- sin(2 * pi * x) + eps

plot(x, y)
lines(locpoly(x, y, bandwidth=0.05, degree=1), col=3)
lines(locfit(y ~ lp(x, nn=0, h=0.05, deg=1)), col=4)

Produces this plot:

plot of smoothers

locpoly gives the smooth green line, and locfit gives the wiggly blue line. Clearly, locfit has a smaller "effective" bandwidth here, even though the supposed bandwidth parameter has the same value for each.

What are these functions doing differently?

Answer

znr picture znr · Oct 21, 2018

The two parameters both represent smoothing, but they do so in two different ways.

locpoly's bandwidth parameter is relative to the scale of the x-axis here. For example, if you changed the line x <- runif(n, 0, 1) to x <- runif(n, 0, 10), you will see that the green locpoly line becomes much more squiggly despite the fact that you still have the same number of points (100).

locfit's smoothing parameter, h, is independent of the scale, and instead is based on a proportion of the data. The value 0.05 means 5% of the data that is closest to that position is used to fit the curve. So changing the scale would not alter the line.

This also explains the observation made in the comment that changing the value of h to 0.1 makes the two look nearly identical. This makes sense, because we can expect that a bandwidth of 0.05 will contain about 10% of the data if we have 100 points distributed uniformly from 0 to 1.

My sources include the documentation for the locfit package and the documentation for the locpoly function.