in section 3.4 of their article, the authors explain how they handle missing values when searching the best candidate split for tree growing. Specifically, they create a default direction for those nodes with, as splitting feature, one with missing values in the current instance set. At prediction time, if the prediction path goes through this node and the feature value is missing, the default direction is followed.
However the prediction phase would break down when the feature values is missing and the node does not have a default direction (and this can occur in many scenarios). In other words, how do they associate a default direction to all nodes, even those with missing-free splitting feature in the active instance set at training time?
xgboost
always accounts for a missing value split direction even if none are present is training. The default is the yes
direction in the split criterion. Then it is learned if there are any present in training
From the author link
This can be observed by the following code
require(xgboost)
data(agaricus.train, package='xgboost')
sum(is.na(agaricus.train$data))
##[1] 0
bst <- xgboost(data = agaricus.train$data,
label = agaricus.train$label,
max.depth = 4,
eta = .01,
nround = 100,
nthread = 2,
objective = "binary:logistic")
dt <- xgb.model.dt.tree(model = bst) ## records all the splits
> head(dt)
ID Feature Split Yes No Missing Quality Cover Tree Yes.Feature Yes.Cover Yes.Quality
1: 0-0 28 -1.00136e-05 0-1 0-2 0-1 4000.5300000 1628.25 0 55 924.50 1158.2100000
2: 0-1 55 -1.00136e-05 0-3 0-4 0-3 1158.2100000 924.50 0 7 679.75 13.9060000
3: 0-10 Leaf NA NA NA NA -0.0198104 104.50 0 NA NA NA
4: 0-11 7 -1.00136e-05 0-15 0-16 0-15 13.9060000 679.75 0 Leaf 763.00 0.0195026
5: 0-12 38 -1.00136e-05 0-17 0-18 0-17 28.7763000 10.75 0 Leaf 678.75 -0.0199117
6: 0-13 Leaf NA NA NA NA 0.0195026 763.00 0 NA NA NA
No.Feature No.Cover No.Quality
1: Leaf 104.50 -0.0198104
2: 38 10.75 28.7763000
3: NA NA NA
4: Leaf 9.50 -0.0180952
5: Leaf 1.00 0.0100000
6: NA NA NA
> all(dt$Missing == dt$Yes,na.rm = T)
[1] TRUE
source code https://github.com/tqchen/xgboost/blob/8130778742cbdfa406b62de85b0c4e80b9788821/src/tree/model.h#L542