I am building an RNN for classification (there is a softmax layer after the RNN). There are so many options for what to regularize and I am not sure if to just try all of them, would the effect be the same? which components do I regularize for what situation?
The components being:
Regularizers that'll work best will depend on your specific architecture, data, and problem; as usual, there isn't a single cut to rule all, but there are do's and (especially) don't's, as well as systematic means of determining what'll work best - via careful introspection and evaluation.
How does RNN regularization work?
Perhaps the best approach to understanding it is information-based. First, see "How does 'learning' work?" and "RNN: Depth vs. Width". To understand RNN regularization, one must understand how RNN handles information and learns, which the referred sections describe (though not exhaustively). Now to answer the question:
RNN regularization's goal is any regularization's goal: maximizing information utility and traversal of the test loss function. The specific methods, however, tend to differ substantially for RNNs per their recurrent nature - and some work better than others; see below.
RNN regularization methods:
WEIGHT DECAY
General: shrinks the norm ('average') of the weight matrix
sigmoid
, tanh
, but less so relu
sigmoid
, tanh
grads flatten out for large activations - linearizing enables neurons to keep learningRecurrent weights: default activation='sigmoid'
Kernel weights: for many-to-one (return_sequences=False
), they work similar to weight decay on a typical layer (e.g. Dense
). For many-to-many (=True
), however, kernel weights operate on every timestep, so pros & cons similar to above will apply.
Dropout:
0.2
in practice. Problem: tends to introduce too much noise, and erase important context information, especially in problems w/ limited timesteps.recurrent_dropout
): the recommended dropoutBatch Normalization:
Weight Constraints: set hard upper-bound on weights l2-norm; possible alternative to weight decay.
Activity Constraints: don't bother; for most purposes, if you have to manually constrain your outputs, the layer itself is probably learning poorly, and the solution is elsewhere.
What should I do? Lots of info - so here's some concrete advice:
Weight decay: try 1e-3
, 1e-4
, see which works better. Do not expect the same value of decay to work for kernel
and recurrent_kernel
, especially depending on architecture. Check weight shapes - if one is much smaller than the other, apply smaller decay to former
Dropout: try 0.1
. If you see improvement, try 0.2
- else, scrap it
Recurrent Dropout: start with 0.2
. Improvement --> 0.4
. Improvement --> 0.5
, else 0.3
.
BatchNormalization
, you use_bias=False
as an "equivalent"; BN applies to outputs, not hidden-to-hidden transforms.Introspection: bottom section on 'learning' isn't worth much without this; don't just look at validation performance and call it a day - inspect the effect that adjusting a regularizer has on weights and activations. Evaluate using info toward bottom & relevant theory.
BONUS: weight decay can be powerful - even more powerful when done right; turns out, adaptive optimizers like Adam can harm its effectiveness, as described in this paper. Solution: use AdamW. My Keras/TensorFlow implementation here.
This is too much! Agreed - welcome to Deep Learning. Two tips here:
Conv1D(strides > 1)
, for many timesteps (>1000
); slashes dimensionality, shouldn't harm performance (may in fact improve it).Introspection Code:
Gradients: see this answer
Weights: see this answer
Weight norm tracking: see this Q & A
Activations: see this answer
Weights: see_rnn.rnn_histogram
or see_rnn.rnn_heatmap
(examples in README)
How does 'learning' work?
The 'ultimate truth' of machine learning that is seldom discussed or emphasized is, we don't have access to the function we're trying to optimize - the test loss function. All of our work is with what are approximations of the true loss surface - both the train set and the validation set. This has some critical implications:
Further, loss functions are way too complex to analyze directly; a better approach is to localize analysis to individual layers, their weight matrices, and roles relative to the entire NN. Two key considerations are:
Feature extraction capability. Ex: the driving mechanism of deep classifiers is, given input data, to increase class separability with each layer's transformation. Higher quality features will filter out irrelevant information, and deliver what's essential for the output layer (e.g. softmax) to learn a separating hyperplane.
Information utility. Dead neurons, and extreme activations are major culprits of poor information utility; no single neuron should dominate information transfer, and too many neurons shouldn't lie purposeless. Stable activations and weight distributions enable gradient propagation and continued learning.
How does regularization work? read above first
In a nutshell, via maximizing NN's information utility, and improving estimates of the test loss function. Each regularization method is unique, and no two exactly alike - see "RNN regularizers".
RNN: Depth vs. Width: not as simple as "one is more nonlinear, other works in higher dimensions".
Update:
Here is an example of a near-ideal RNN gradient propagation for 170+ timesteps:
This is rare, and was achieved via careful regularization, normalization, and hyperparameter tuning. Usually we see a large gradient for the last few timesteps, which drops off sharply toward left - as here. Also, since the model is stateful and fits 7 equivalent windows, gradient effectively spans 1200 timesteps.
Update 2: see 9 w/ new info & correction
Update 3: add weight norms & weights introspection code