Extremely small or NaN values appear in training neural network

Charles Langlois picture Charles Langlois · Jun 21, 2017 · Viewed 10.7k times · Source

I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.

I'm using the hmatrix package for linear algebra. My training framework is built using the pipes package.

My code compiles and doesn't crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN values in the computations. After some inspection, I see that extremely small values (order of 1e-100) eventually appear in the activations. But, even when that doesn't happen, the training still doesn't work. There's no improvement over its loss or accuracy.

I checked and rechecked my code, and I'm at a loss as to what the root of the problem could be.

Here's the backpropagation training, which computes the deltas for each layer:

backward lf n (out,tar) das = do
    let δout = tr (derivate lf (tar, out)) -- dE/dy
        deltas = scanr (\(l, a') δ ->
                         let w = weights l
                         in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
    return (deltas)

lf is the loss function, n is the network (weight matrix and bias vector for each layer), out and tar are the actual output of the network and the target (desired) output, and das are the activation derivatives of each layer.

In batch mode, out, tar are matrices (rows are output vectors), and das is a list of the matrices.

Here's the actual gradient computation:

  grad lf (n, (i,t)) = do
    -- Forward propagation: compute layers outputs and activation derivatives
    let (as, as') = unzip $ runLayers n i
        (out) = last as
    (ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
    let r  = fromIntegral $ rows i -- Size of minibatch
    let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
    return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)

Here, lf and n are the same as above, i is the input, and t is the target output (both in batch form, as matrices).

squeeze transforms a matrix into a vector by summing over each row. That is, ds is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs, which corresponds to the gradients for the weights.

Here's the actual update code:

move lr (n, (i,t)) (GradBatch (gs, ds)) = do
    -- Update function
    let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
        n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
    return (n', (i,t))

lr is the learning rate. FC is the layer constructor, and af is the activation function for that layer.

The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad and move, with a parameterized stop condition.

Finally, here's the code for a mean square error loss function:

mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
          f' (y,y') = (y'-y)
      in  Evaluator f f'

Evaluator just bundles a loss function and its derivative (for calculating the delta of the output layer).

The rest of the code is up on GitHub: NeuralNetwork.

So, if anyone has an insight into the problem, or even just a sanity check that I'm correctly implementing the algorithm, I'd be grateful.

Answer

jcft2 picture jcft2 · Aug 11, 2020

Do you know about "vanishing" and "exploding" gradients in backpropagation? I'm not too familiar with Haskell so I can't easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.

If you look at the plot of this function you'll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.

Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).