I am a student of clustering and R. In order to obtain a better grip of both I would like to compute the distance between centroids and my xy-matrix for each iteration till it "converges". How can I solve for step 2 and 3 using R?
library(fields)
x <- c(3,6,8,1,2,2,6,6,7,7,8,8)
y <- c(5,2,3,5,4,6,1,8,3,6,1,7)
df <- data.frame(x,y) initial matrix
a <- c(3,6,8)
b <- c(5,2,3)
df1 <- data.frame(a,b) # initial centroids
Here is what I want to do:
I0 <- t(rdist(df, df1))
after zero iterationI tried the kmeans
function. But for some reasons it produces those centroids which have to come up at the end. That is I defined the start of:
start <- matrix(c(3,5,6,2,8,3), 3, byrow = TRUE)
cluster <- kmeans(df,centers = start, iter.max = 1) # one iteration
kmeans
doesn't allow me to track the movement of the centroids. Therefore I would like to do it "manually" by applying step 2 & 3 using R.
Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers").
For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix to all the "centers".
Here is such a function:
myEuclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
points1
is the data matrix with points as rows and dimensions as columns. points2
is the matrix of centers (points as rows again). The first line of code just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there are centers). So the point i,j
in the result matrix will be the distance from the ith point to the jth center.
Then the for loop iterates over all centers. For each center it computes the euclidean distance from each point to the current center and returns the result. This line here: sqrt(rowSums(t(t(points1)-points2[i,])^2))
is euclidean distance. Inspect it closer and look up the formula if you have any troubles with that. (the transposes there are mainly done to make sure subtraction is being done row-wise).
Now you can also implement k-means algorithm:
myKmeans <- function(x, centers, distFun, nItter=10) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory, centers=centerHistory)
}
As you can see it's also a very simple function - it takes data matrix, centers, your distance function (the one defined above) and number of wanted iterations.
The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center. Which is a basic k-means algorithm).
Let's try it out. Define some random points (in 2d, so number of columns = 2)
mat <- matrix(rnorm(100), ncol=2)
Assign 5 random points from that matrix as initial centers:
centers <- mat[sample(nrow(mat), 5),]
Now run the algorithm:
theResult <- myKmeans(mat, centers, myEuclid, 10)
Here are the centers in the 10th iteration:
theResult$centers[[10]]
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
Compare that with implemented kmeans
function:
theResult2 <- kmeans(mat, centers, 10, algorithm="Forgy")
theResult2$centers
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
Works fine. Our function however tracks the iterations. We can plot the progress over the first 4 iterations like this:
par(mfrow=c(2,2))
for(i in 1:4) {
plot(mat, col=theResult$clusters[[i]], main=paste("itteration:", i), xlab="x", ylab="y")
points(theResult$centers[[i]], cex=3, pch=19, col=1:nrow(theResult$centers[[i]]))
}
Nice.
However this simple design allows for much more. For example if we want to use another kind of distance (not euclidean) we can just use any function that takes data and centers as inputs. Here is one for correlation distances:
myCor <- function(points1, points2) {
return(1 - ((cor(t(points1), t(points2))+1)/2))
}
And we then can do Kmeans based on those:
theResult <- myKmeans(mat, centers, myCor, 10)
The resulting picture for 4 iterations then looks like this:
Even thou we specified 5 clusters - there were 2 left at the end. That is because for 2 dimensions the correlation can have to values - either +1 or -1. Then when looking for the clusters each point get's assigned to one center, even if it has the same distance to multiple centers - the first one get's chosen.
Anyway this is now getting out of scope. The bottom line is that there are many possible distance metrics and one simple function allows you to use any distance you want and track the results over iterations.