Why use softmax as opposed to standard normalization?

Tom picture Tom · Jun 19, 2013 · Viewed 40.7k times · Source

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:

enter image description here

This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?

Answer

Piotr Czapla picture Piotr Czapla · Jul 19, 2017

There is one nice attribute of Softmax as compared with standard normalisation.

It react to low stimulation (think blurry image) of your neural net with rather uniform distribution and to high stimulation (ie. large numbers, think crisp image) with probabilities close to 0 and 1.

While standard normalisation does not care as long as the proportion are the same.

Have a look what happens when soft max has 10 times larger input, ie your neural net got a crisp image and a lot of neurones got activated

>>> softmax([1,2])              # blurry image of a ferret
[0.26894142,      0.73105858])  #     it is a cat perhaps !?
>>> softmax([10,20])            # crisp image of a cat
[0.0000453978687, 0.999954602]) #     it is definitely a CAT !

And then compare it with standard normalisation

>>> std_norm([1,2])                      # blurry image of a ferret
[0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?
>>> std_norm([10,20])                    # crisp image of a cat
[0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?