I am using the randomForest package in R, which allows to calculate the proximity matrix (P). In the description of the package it describes the parameter as: "if proximity=TRUE when randomForest is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes)."
I obtain the proximity matrix of a random forest as follows:
P <- randomForest(x, y, ntree = 1000, proximity=TRUE)$proximity
When I investigate the P matrix, I see values like P(i,j)=0.971014493 where i and j are two data instances within my training data set (x). Such a value does not make sense, because when it is multplied by 1000 (number of trees in the forest), the resulting number is not an integer, hence "frequency". Could someone please help me understand, why do I get such real numbers in the proximity matrix?
Because just as with the default predictions, the default proximity is calculated only using the trees where neither observation was included in the sample used to build that tree (they were "out-of-bag").
The number of times this happens will vary slightly for each pair of cases, and certainly won't be a nice round number like 1000.
You'll note that the very next parameter listed after proximity
is called oob.prox
indicating whether to use only out of bag pairs (the default) or use each and every tree.