I have several data points which seem suitable for fitting a spline through them. When I do this, I get a rather bumpy fit, like overfitting, which is not what I understand as smoothing.
Is there a special option / parameter for getting back the function of a really smooth spline like here.
The usage of the penalty
parameter for smooth.spline
didn't have any visible effect. Maybe I did it wrong?
Here are data and code:
results <- structure(
list(
beta = c(
0.983790622281964, 0.645152464354322,
0.924104713597375, 0.657703886566088, 0.788138034115623, 0.801080207252363,
1, 0.858337365965949, 0.999687052533693, 0.666552625121279, 0.717453633245958,
0.621570152961453, 0.964658181346544, 0.65071758770312, 0.788971505000918,
0.980476054183113, 0.670263506919246, 0.600387040967624, 0.759173403408052,
1, 0.986409675965, 0.982996471134736, 1, 0.995340781899163, 0.999855895958986,
1, 0.846179233381267, 0.879226324448832, 0.795820998892035, 0.997586607285667,
0.848036806290156, 0.905320944437968, 0.947709125535428, 0.592172373022407,
0.826847031044922, 0.996916006944244, 0.785967729206612, 0.650346929853076,
0.84206351833549, 0.999043126652724, 0.936879214753098, 0.76674066557003,
0.591431233516217, 1, 0.999833445117791, 0.999606223666537, 0.6224971799303,
1, 0.974537160571494, 0.966717133936379
), inventoryCost = c(
1750702.95138889,
442784.114583333, 1114717.44791667, 472669.357638889, 716895.920138889,
735396.180555556, 3837320.74652778, 872873.4375, 2872414.93055556,
481095.138888889, 538125.520833333, 392199.045138889, 1469500.95486111,
459873.784722222, 656220.486111111, 1654143.83680556, 437511.458333333,
393295.659722222, 630952.170138889, 4920958.85416667, 1723517.10069444,
1633579.86111111, 4639909.89583333, 2167748.35069444, 3062420.65972222,
5132702.34375, 838441.145833333, 937659.288194444, 697767.1875,
2523016.31944444, 800903.819444444, 1054991.49305556, 1266970.92013889,
369537.673611111, 764995.399305556, 2322879.6875, 656021.701388889,
458403.038194444, 844133.420138889, 2430700, 1232256.68402778,
695574.479166667, 351348.524305556, 3827440.71180556, 3687610.41666667,
2950652.51736111, 404550.78125, 4749901.64930556, 1510481.59722222,
1422708.07291667
)
), .Names = c("beta", "inventoryCost"), class = c("data.frame")
)
plot(results$beta,results$inventoryCost)
mySpline <- smooth.spline(results$beta,results$inventoryCost, penalty=999999)
lines(mySpline$x, mySpline$y, col="red", lwd = 2)
Transform your data sensibly before modelling
Based on the scale of your results$inventoryCost
, log transform is appropriate. For simplicity, in the following I am using x
, y
. I am also reordering your data so that x
is ascending:
x <- results$beta; y <- log(results$inventoryCost)
reorder <- order(x); x <- x[reorder]; y <- y[reorder]
par(mfrow = c(1,2))
plot(x, y, main = "take log transform")
hist(x, main = "x is skewed")
The left figure looks better? Also, it is highly recommended to further take transform for x
, because it is skewed! (see right figure).
The following transform is appropriate:
x1 <- -(1-x)^(1/3)
The cubic root of (1-x)
will make data more spread out around x = 1
. I put an additional -1
so that there is a positively monotonic relation rather than a negative one between x
and x1
. Now let's check the relationship:
par(mfrow = c(1,2))
plot(x1, y, main = expression(y %~% ~ x1))
hist(x1, main = "x1 is well spread out")
Fitting a spline
Now we are ready for statistical modelling. Try the following call:
fit <- smooth.spline(x1, y, nknots = 10)
pred <- stats:::predict.smooth.spline(fit, x1)$y ## predict at all x1
## or you can simply call: pred <- predict(fit, x1)$y
plot(x1, y) ## scatter plot
lines(x1, pred, lwd = 2, col = 2) ## fitted spline
Does it look nice? Note, that I have used nknots = 10
tells smooth.spline
to place 10 interior knots (by quantile); Therefore, we are to fit a penalized regression spline rather than a smoothing spline. In fact, the smooth.spline()
function almost never fit a smoothing spline, unless you put all.knots = TRUE
(see later example).
I also dropped penalty = 999999
, as that has nothing to do with smoothness control. If you really want to control smoothness, rather than letting smooth.spline
figure out the optimal one by GCV, you should use argument df
or spar
. I will give example later.
To transform fit back to original scale, do:
plot(x, exp(y), main = expression(Inventory %~%~ beta))
lines(x, exp(pred), lwd = 2, col = 2)
As you can see, the fitted spline is as smooth as you had expected.
Explanation on fitted spline
Let's see the summary of your fitted spline:
> fit
Smoothing Parameter spar= 0.4549062 lambda= 0.0008657722 (11 iterations)
Equivalent Degrees of Freedom (Df): 6.022959
Penalized Criterion: 0.08517417
GCV: 0.004288539
We used 10 knots, ending up with 6 degree of freedom, so penalization suppresses about 4 parameters. The smoothing parameter GCV has chosen, after 11 iterations, is lambda= 0.0008657722
.
Why do we have to transform x
to x1
Spline is penalized by 2nd derivatives, yet such penalization is on the averaged/integrated 2nd derivatives at all data points. Now, look at your data (x, y)
. For x
before 0.98, the relationship is relatively steady; as x
approaches 1, the relationship quickly goes steeper. The "change point", 0.98, has very high second derivative, much much higher than the second derivatives at other locations.
y0 <- as.numeric(tapply(y, x, mean)) ## remove tied values
x0 <- unique(x) ## remove tied values
dy0 <- diff(y0)/diff(x0) ## 1st order difference
ddy0 <- diff(dy0)/diff(x0[-1]) ## 2nd order difference
plot(x0[1:43], abs(ddy0), pch = 19)
Look at that huge spike in 2nd order difference/derivative! Now, if we fit a spline directly, the spline curve around this change point will be heavily penalized.
bad <- smooth.spline(x, y, all.knots = TRUE)
bad.pred <- predict(bad, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(bad.pred), col = 2, lwd = 3)
abline(v = 0.98, lwd = 2, lty = 2)
You can see clearly that the spline is having some difficulty in approximating data after x = 0.98
.
There are of course some ways to achieve better approximation after this change point, for example, by manually setting smaller smoothing parameter, or higher degree of freedom. But we are going to another extreme. Remember, both penalization and degree of freedom are a global measure. Increasing model complexity will get better approximation after x = 0.98
, but will also make other parts more bumpy. Now let's try a model with 45 degree of freedom:
worse <- smooth.spline(x, y, all.knots = TRUE, df = 45)
worse.pred <- predict(worse, x)$y
plot(x, exp(y), main = expression(Inventory %~% ~ beta))
lines(x, exp(worse.pred), col = 2, lwd = 2)
As you can see, the curve is bumpy. Sure, we have overfitted our dataset of 50 data, with 45 degree of freedom.
In fact, your original misuse of smooth.spline()
is doing the same thing:
> mySpline
Call:
smooth.spline(x = results$beta, y = results$inventoryCost, penalty = 999999)
Smoothing Parameter spar= -0.8074624 lambda= 3.266077e-19 (17 iterations)
Equivalent Degrees of Freedom (Df): 45
Penalized Criterion: 5.598386
GCV: 0.03824885
Oops, 45 degree of freedom, overfitting!