I have a dataset of some 20000 training examples, on which i want to do a binary classification. The problem is the dataset is heavily imbalanced with only around 1000 being in the positive class. I am trying to use xgboost (in R) for doing my prediction.
I have tried oversampling and undersampling and no matter what i do, somehow the predictions always result in classifiying everything as the majority class.
I tried reading this article on how to tune parameters in xgboost. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
But it only mentions which parameters help with imbalanced datasets, but not how to tune them.
I would appreciate if anyone has any advice on tuning the learning parameters of xgboost to handle imbalanced datasets and also on how to generate the validation set for such cases.
According to XGBoost
documentation, the scale_pos_weight
parameter is the one dealing with imbalanced classes. See, documentation here
scale_pos_weight, [default=1] Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3