using k-NN in R with categorical values

Omri374 picture Omri374 · Sep 11, 2012 · Viewed 10.6k times · Source

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.

I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance. Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?

I'm using R 2.15

Answer

Backlin picture Backlin · Sep 11, 2012

As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.

# Generate dummy data
y <- rep(1:2, each=50)                          # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5  # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)

# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)

# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
    tab <- table(y[design.set][nn])
    as.integer(names(tab)[which.max(tab)])      # This is a pretty dirty line
}

# Inspect the results
table(pred, y[test.set])

If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.

The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.