I'm messing around with machine learning, and I've written a K Means algorithm implementation in Python. It takes a two dimensional data and organises them into clusters. Each data point also has a class value of either a 0 or a 1.
What confuses me about the algorithm is how I can then use it to predict some values for another set of two dimensional data that doesn't have a 0 or a 1, but instead is unknown. For each cluster, should I average the points within it to either a 0 or a 1, and if an unknown point is closest to that cluster, then that unknown point takes on the averaged value? Or is there a smarter method?
Cheers!
To assign a new data point to one of a set of clusters created by k-means, you just find the centroid nearest to that point.
In other words, the same steps you used for the iterative assignment of each point in your original data set to one of k clusters. The only difference here is that the centroids you are using for this computation is the final set--i.e., the values for the centroids at the last iteration.
Here's one implementation in python (w/ NumPy):
>>> import numpy as NP
>>> # just made up values--based on your spec (2D data + 2 clusters)
>>> centroids
array([[54, 85],
[99, 78]])
>>> # randomly generate a new data point within the problem domain:
>>> new_data = NP.array([67, 78])
>>> # to assign a new data point to a cluster ID,
>>> # find its closest centroid:
>>> diff = centroids - new_data[0,:] # NumPy broadcasting
>>> diff
array([[-13, 7],
[ 32, 0]])
>>> dist = NP.sqrt(NP.sum(diff**2, axis=-1)) # Euclidean distance
>>> dist
array([ 14.76, 32. ])
>>> closest_centroid = centroids[NP.argmin(dist),]
>>> closest_centroid
array([54, 85])