Neural Network with softmax activation

Cambium picture Cambium · Apr 21, 2010 · Viewed 12.5k times · Source

edit:

A more pointed question: What is the derivative of softmax to be used in my gradient descent?


This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)


I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.

Currently, I have two layers, and I set the hidden layer to contain 10 nodes.

I came up with two different types of implementations

  1. Logistic sigmoid for hidden layer activation, softmax for output activation
  2. Softmax for both hidden layer and output activation

I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.

I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.

My conclusions:

  • I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
  • Perhaps I am not using enough hidden nodes
  • Perhaps I should increase the number of layers

Any help would be greatly appreciated!

The dataset I am working with can be found here (processed Cleveland): http://archive.ics.uci.edu/ml/datasets/Heart+Disease

Answer

bayer picture bayer · Apr 22, 2010

The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:

softmax error

(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))

where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.

Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.

To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:

numerical gradient of the softmax

(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))