I need to cluster some data and I tried kmeans
, pam
, and clara
with R.
The problem is that my data are in a column of a data frame, and contains NAs.
I used na.omit()
to get my clusters. But then how can I associate them with the original data? The functions return a vector of integers without the NAs and they don't retain any information about the original position.
Is there a clever way to associate the clusters to the original observations in the data frame? (or a way to intelligently perform clustering when NAs are present?)
Thanks
The output of kmeans
corresponds to the elements of the object passed as argument x
. In your case, you omit the NA
elements, and so $cluster
indicates the cluster that each element of na.omit(x)
belongs to.
Here's a simple example:
d <- data.frame(x=runif(100), cluster=NA)
d$x[sample(100, 10)] <- NA
clus <- kmeans(na.omit(d$x), 5)
d$cluster[which(!is.na(d$x))] <- clus$cluster
And in the plot below, colour indicates the cluster that each point belongs to.
plot(d$x, bg=d$cluster, pch=21)