I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.
I'm using the hmatrix
package for linear algebra.
My training framework is built using the pipes
package.
My code compiles and doesn't crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN
values in the computations. After some inspection, I see that extremely small values (order of 1e-100
) eventually appear in the activations. But, even when that doesn't happen, the training still doesn't work. There's no improvement over its loss or accuracy.
I checked and rechecked my code, and I'm at a loss as to what the root of the problem could be.
Here's the backpropagation training, which computes the deltas for each layer:
backward lf n (out,tar) das = do
let δout = tr (derivate lf (tar, out)) -- dE/dy
deltas = scanr (\(l, a') δ ->
let w = weights l
in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
return (deltas)
lf
is the loss function, n
is the network (weight
matrix and bias
vector for each layer), out
and tar
are the actual output of the network and the target
(desired) output, and das
are the activation derivatives of each layer.
In batch mode, out
, tar
are matrices (rows are output vectors), and das
is a list of the matrices.
Here's the actual gradient computation:
grad lf (n, (i,t)) = do
-- Forward propagation: compute layers outputs and activation derivatives
let (as, as') = unzip $ runLayers n i
(out) = last as
(ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
let r = fromIntegral $ rows i -- Size of minibatch
let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)
Here, lf
and n
are the same as above, i
is the input, and t
is the target output (both in batch form, as matrices).
squeeze
transforms a matrix into a vector by summing over each row. That is, ds
is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs
, which corresponds to the gradients for the weights.
Here's the actual update code:
move lr (n, (i,t)) (GradBatch (gs, ds)) = do
-- Update function
let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
return (n', (i,t))
lr
is the learning rate. FC
is the layer constructor, and af
is the activation function for that layer.
The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad
and move
, with a parameterized stop condition.
Finally, here's the code for a mean square error loss function:
mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
f' (y,y') = (y'-y)
in Evaluator f f'
Evaluator
just bundles a loss function and its derivative (for calculating the delta of the output layer).
The rest of the code is up on GitHub: NeuralNetwork.
So, if anyone has an insight into the problem, or even just a sanity check that I'm correctly implementing the algorithm, I'd be grateful.
Do you know about "vanishing" and "exploding" gradients in backpropagation? I'm not too familiar with Haskell so I can't easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.
If you look at the plot of this function you'll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.
Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).