how can I print variable importance in gbm function?

이순우 picture 이순우 · Feb 15, 2017 · Viewed 9.6k times · Source

I used the gbm function to implement gradient boosting. And I want to perform classification. After that, I used the varImp() function to print variable importance in gradient boosting modeling. But... only 4 variables have non-zero importance. There are 371 variables in my big data.... Is it right? This is my code and result.

>asd<-read.csv("bigdatafile.csv",header=TRUE)
>asd1<-gbm(TARGET~.,n.trees=50,distribution="adaboost", verbose=TRUE,interaction.depth = 1,data=asd)

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
 1        0.5840             nan     0.0010    0.0011
 2        0.5829             nan     0.0010    0.0011
 3        0.5817             nan     0.0010    0.0011
 4        0.5806             nan     0.0010    0.0011
 5        0.5795             nan     0.0010    0.0011
 6        0.5783             nan     0.0010    0.0011
 7        0.5772             nan     0.0010    0.0011
 8        0.5761             nan     0.0010    0.0011
 9        0.5750             nan     0.0010    0.0011
10        0.5738             nan     0.0010    0.0011
20        0.5629             nan     0.0010    0.0011
40        0.5421             nan     0.0010    0.0010
50        0.5321             nan     0.0010    0.0010

>varImp(asd1,numTrees = 50)
                    Overall
CA0000801           0.00000
AS0000138           0.00000
AS0000140           0.00000
A1                  0.00000
PROFILE_CODE        0.00000
A2                  0.00000
CB_thinfile2        0.00000
SP_thinfile2        0.00000
thinfile1           0.00000
EW0001901           0.00000
EW0020901           0.00000
EH0001801           0.00000
BS_Seg1_Score       0.00000
BS_Seg2_Score       0.00000
LA0000106           0.00000
EW0001903           0.00000
EW0002801           0.00000
EW0002902           0.00000
EW0002903           0.00000
EW0002904           0.00000
EW0002906           0.00000
LA0300104_SP       56.19052
ASMGRD2          2486.12715
MIX_GRD          2211.03780
P71010401_1         0.00000
PS0000265           0.00000
P11021100           0.00000
PE0000123           0.00000

There are 371 variables. So above the result,I didn't write other variables. That all have zero importance.

TARGET is target variable. And I produced 50 trees. TARGET variable has two levels. so I used adaboost.

Is there a mistake in my code??? There are a little non-zero variables....

Thank you for your reply.

Answer

서영재 picture 서영재 · Feb 16, 2017

In your code, n.trees is very low and shrinkage is very high. Just adjust this two factor.

  1. n.trees is Number of trees. N increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
  2. interaction.depth(maximum nodes per tree) is number of splits it has to perform on a tree(starting from a single node).
  3. shrinkage is considered as a learning rate. shrinkage is commonly used in ridge regression where it reduces regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients. I recommend uses 0.1 for all data sets with more than 10,000 records. Also! use a small shrinkage when growing many trees.

If you input 1,000 in n.trees & 0.1 in shrinkage, you can get different value. And if you want to know relative influence of each variable in the gbm, Use summary.gbm() not varImp(). Of course, varImp() is good function. but I recommend summary.gbm().

Good luck.