I have a highly unbalanced dataset and am wondering where to account for the weights, and thus am trying to comprehend the difference between scale_pos_weight
argument in XGBClassifier
and the sample_weight
parameter of the fit
method. Would appreciate an intuitive explanation of the difference between the two, if they can be used simultaneously or how either approach is selected.
The documentation indicates that scale_pos_weight
:
control the balance of positive and negative weights..& typical value to consider: sum(negative cases) / sum(positive cases)
Example:
from xgboost import XGBClassifier
import xgboost as xgb
LR=0.1
NumTrees=1000
xgbmodel=XGBClassifier(booster='gbtree',seed=0,nthread=-1,
gamma=0,scale_pos_weight=14,learning_rate=LR,n_estimators=NumTrees,
max_depth=5,objective='binary:logistic',subsample=1)
xgbmodel.fit(X_train, y_train)
from xgboost import XGBClassifier
import xgboost as xgb
LR=0.1
NumTrees=1000
xgbmodel=XGBClassifier(booster='gbtree',seed=0,nthread=-1,
gamma=0,learning_rate=LR,n_estimators=NumTrees,
max_depth=5,objective='binary:logistic',subsample=1)
xgbmodel.fit(X_train, y_train,sample_weight=weights_train)
The sample_weight
parameter allows you to specify a different weight for each training example. The scale_pos_weight
parameter lets you provide a weight for an entire class of examples ("positive" class).
These correspond to two different approaches to cost-sensitive learning. If you believe that the cost of misclassifying positive examples (missing a cancer patient) is the same for all positive examples (but more than misclassifying negative ones, e.g. telling someone they have cancer when they actually don't) then you can specify one single weight for all positive examples via scale_pos_weight
.
XGBoost treats labels = 1 as the "positive" class. This is evident from the following piece of code:
if (info.labels[i] == 1.0f) w *= param_.scale_pos_weight
See this question.
The other scenario is where you have example-dependent costs. One example is detecting fraudulent transactions. Not only a false negative (missing a fraudulent transaction) is more costly than a false positive (blocking a legal transaction), but the cost of missing a false negative is proportional to the amount of money being stolen. So you want to give larger weights to positive (fraudulent) examples with higher amounts. In this case, you can use the sample_weight
parameter to specify example-specific weights.