Removing outliers from a k-mean cluster

carro picture carro · Dec 21, 2012 · Viewed 15.5k times · Source

I have number of smaller data sets, containing 10 XY coordinates each. I am using Matlab (R2012a)and k-means to obtain a centroid. In some of the clusters (see figure below) I can see some extreme points, beacuse my dataset are as small as they are, one outliner destroys the value of my centroid. Is there a easy way to exlude these points? Supposingly Matlab has a 'exclude outliers' function but I can't see it anywhere in the tool menu.. Thank you for your help! (and yes I am new to this:-)

enter image description here

Answer

Erich Schubert picture Erich Schubert · Jan 1, 2013

k-means can be quite sensitive to outliers in your data set. The reason is simply that k-means tries to optimize the sum of squares. And thus a large deviation (such as of an outlier) gets a lot of weight.

If you have a noisy data set with outliers, you might be better off using an algorithm that has specialized noise handling such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Note the "N" in the acronym: Noise. In contrast to e.g. k-means, but also many other clustering algorithms, DBSCAN can decide to not cluster objects that are in regions of low density.