Is it right to normalize data and/or weight vectors in a SOM?

Spacey picture Spacey · Dec 3, 2012 · Viewed 8.2k times · Source

So I am being stumped by something that (should) be simple:

I have written a SOM for a simple 'play' two-dimensional data set. Here is the data:

enter image description here

You can make out 3 clusters by yourself.

Now, there are two things that confuse me. The first is that the tutorial that I have, normalizes the data before the SOM gets to work on it. This means, it normalizes each data vector to have length 1. (Euclidean norm). If I do that, then the data looks like this:

enter image description here

(This is because all the data has been projected onto the unit circle).

So, my question(s) are as follows:

1) Is this correct? Projecting the data down onto the unit circle seems to be bad, because you can no longer make out 3 clusters... Is this a fact of life for SOMs? (ie, that they only work on the unit circle).

2) The second related question is that not only are the data normalized to have length 1, but so are the weight vectors of each output unit after every iteration. I understand that they do this so that the weight vectors dont 'blow up', but it seems wrong to me, since the whole point of the weight vectors is to retain distance information. If you normalize them, you lose the ability to 'cluster' properly. For example, how can the SOM possibly distinguish between the cluster on the lower left, from the cluster on the upper right, since they project down to the unit circle the same way?

I am very confused by this. Should data be normalized to unit length in SOMs? Should the weight vectors be normalized as well?

Thanks!

EDIT

Here is the data, saved as a .mat file for MATLAB. It is a simple 2 dimensional data set.

Answer

pater picture pater · Dec 4, 2012

To decide if you are going to normalize input data or not, it depends on what these data represent. Lets say that you doing clustering on two dimensional (or three dimensional) input data in which each data vector represents a spatial point. First dimension is x coordinate and second is y coordinate. In this case you don't normalize the input data because the input features (each dimension) are comparable between each other.

If you are doing clustering again on two dimension space but each input vector represents the age and the annual income of a person, the first feature (dimension) is the age and the second is the annual income, then you must normalize the input features because they represent something different (different measurement unit) and in a completely different scale. Lets examine these input vectors: D1(25, 30000), D2(50, 30000) and D3(25, 60000). Both D2 and D3 are doubling one of the features compared to D1. Keep in mind that SOM uses Euclidian distance measures. Distance(D1, D2) = 25 and Distance(D1, D3) = 30000. It's kind of "unfair" for the first input feature (age) because although you doubling it you get a much smaller distance as opposed to the second example (D1,D3).

Check this, which also has a similar example

If you are going to normalize your input data, you normalize on each feature/dimension (each column on you input data table). Quoting from som_normalize manual:

"Normalizations are always one-variable operations"

Check also this for a brief explanation on normalization and if you want to read more try this (chapter 7 is what you want)

EDIT:

The most common normalization methods are scaling each dimension data to [0,1] or transforming them to have a zero mean and standard deviation 1. The first is done by substracting from each input the min value of its dimension (column) and the dividing with the the max value minun the min value (of its dimension).

Xi,norm = (Xi - Xmin)/(Xmax-Xmin)

Yi,norm = (Yi - Ymin)/(Ymax-Ymin)

In the second method you substract the mean value of each dimension and then divide with standard deviation.

Xi,norm = (Xi - Xmean)/(Xsd)

Each method has pros/cons. For example the first method is very sensitive to outliers in data. You should choose after you have examined the statistical characteristics of your dataset.

Projecting in the unit circle is not actually a normalization method but more of a dimensionallity reduction method, since after the projection you could replace each data point with a single number (eg. its angle). You don't have to do this.