Does anyone know how gbm
in R
handles missing values? I can't seem to find any explanation using google.
To explain what gbm does with missing predictors, let's first visualize a single tree of a gbm object.
Suppose you have a gbm object mygbm
. Using pretty.gbm.tree(mygbm, i.tree=1)
you can visualize the first tree on mygbm, e.g.:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 46 1.629728e+01 1 5 9 26.462908 1585 -4.396393e-06
1 45 1.850000e+01 2 3 4 11.363868 939 -4.370936e-04
2 -1 2.602236e-04 -1 -1 -1 0.000000 271 2.602236e-04
3 -1 -7.199873e-04 -1 -1 -1 0.000000 668 -7.199873e-04
4 -1 -4.370936e-04 -1 -1 -1 0.000000 939 -4.370936e-04
5 20 0.000000e+00 6 7 8 8.638042 646 6.245552e-04
6 -1 3.533436e-04 -1 -1 -1 0.000000 483 3.533436e-04
7 -1 1.428207e-03 -1 -1 -1 0.000000 163 1.428207e-03
8 -1 6.245552e-04 -1 -1 -1 0.000000 646 6.245552e-04
9 -1 -4.396393e-06 -1 -1 -1 0.000000 1585 -4.396393e-06
See the gbm documentation for details. Each row corresponds to a node, and the first (unnamed) column is the node number. We see that each node has a left and right node (which are set to -1 in case the node is a leaf). We also see each node has associated a MissingNode
.
To run an observation down the tree, we start at node 0. If an observation has a missing value on SplitVar
= 46, then it will be sent down the tree to the node MissingNode
= 9. The prediction of the tree for such observation will be SplitCodePred
= -4.396393e-06, which is the same prediction the tree had before any split is made to node zero (Prediction
= -4.396393e-06 for node zero).
The procedure is similar for other nodes and split variables.