Obese is a binary response var with 1 indicating obese and 0 not obese. Weight is a continuous predictor.
using a RF to classify obese:
library(randomFores)
rf <- randomForest(factor(obese)~weight)
gives us a fit object containing:
> summary(rf)
Length Class Mode
call 2 -none- call
type 1 -none- character
predicted 100 factor numeric
err.rate 1500 -none- numeric
confusion 6 -none- numeric
votes 200 matrix numeric
oob.times 100 -none- numeric
classes 2 -none- character
importance 1 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 100 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
I believe the votes matrix shows how many votes, from 0 to 1, the rF gives to classifying each case to either class; not obese = 0, obese = 1:
> head(rf$votes, 20)
0 1
1 0.9318182 0.06818182
2 0.9325843 0.06741573
3 0.2784091 0.72159091
4 0.9040404 0.09595960
5 0.3865979 0.61340206
6 0.9689119 0.03108808
7 0.8187135 0.18128655
8 0.7170732 0.28292683
9 0.6931217 0.30687831
10 0.9831461 0.01685393
11 0.3425414 0.65745856
12 1.0000000 0.00000000
13 0.9728261 0.02717391
14 0.9848485 0.01515152
15 0.8783069 0.12169312
16 0.8553459 0.14465409
17 1.0000000 0.00000000
18 0.3389831 0.66101695
19 0.9316770 0.06832298
20 0.9435897 0.05641026
taking those:
votes_2 <- rf$votes[,2]
votes_1 <- rf$votes[,1]
my question is why do:
pROC::plot.roc(obese, votes_1)
and
pROC::plot.roc(obese, votes_2)
produce the same result.
The first thing to realize is that ROC analysis doesn't care about the exact values of your data. Instead it looks at the ranking on the data points, and how the ranks separate.
Second, as has been mentioned in a comment above, the votes for classes 0 and 1 sum up to 1 in each observation. This means that in terms of ranking, the two are equivalent (modulo the direction of sorting).
The last piece of the puzzle is that pROC doesn't assume that you are providing the predictor as the probability to belong to the positive class. Instead you can pass any kind of score, and the direction of the comparison is detected automatically. This is done silently by default but you can see what happens by setting the quiet
flag to FALSE
:
> pROC::roc(obese, votes_1, quiet = FALSE)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
> pROC::roc(obese, votes_2, quiet = FALSE)
Setting levels: control = 0, case = 1
Setting direction: controls > cases
Notice how in the case of votes_2
it detected that the negative class had higher values (based on the median) and set the direction of the comparison accordingly.
If this is not what you want you can always set the class levels and direction parameters explicitly:
> pROC::roc(obese, votes_2, levels = c(0, 1), direction = "<")
This will result in a "reversed" curve showing how votes_2
is performing worse than random at detecting the positive class with higher values.