What does kernel_constraint=max_norm(3) do?

aztec242 picture aztec242 · Aug 31, 2017 · Viewed 10.2k times · Source

In one of the tutorials I am working on (link given below), the author outlines the baseline neural network structure as:

Convolutional input layer, 32 feature maps with a size of 3×3, a rectifier activation function and a weight constraint of max norm set to 3.

model.add(Conv2D(32, (3, 3), input_shape=(3, 32, 32), padding='same', activation='relu', kernel_constraint=maxnorm(3)))

What does weight constraint of max norm mean and do to the Conv layer? (We are using Keras.)

https://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/

Thank you!

Answer

McLawrence picture McLawrence · Aug 31, 2017

What does a weight constraint of max_normdo?

maxnorm(m) will, if the L2-Norm of your weights exceeds m, scale your whole weight matrix by a factor that reduces the norm to m. As you can find in the keras code in class MaxNorm(Constraint):

Now source code in the tensorflow.

def __call__(self, w):
    norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
    desired = K.clip(norms, 0, self.max_value)
    w *= (desired / (K.epsilon() + norms))
    return w

Aditionally, maxnorm has an axis argument, along which the norm is calculated. In your example you don't specify an axis, thus the norm is calculated over the whole weight matrix. If for example, you want to constrain the norm of every convolutional filter, assuming that you are using tf dimension ordering, the weight matrix will have the shape (rows, cols, input_depth, output_depth). Calculating the norm over axis = [0, 1, 2] will constrain each filter to the given norm.

Why to do it?

Constraining the weight matrix directly is another kind of regularization. If you use a simple L2 regularization term you penalize high weights with your loss function. With this constraint, you regularize directly. As also linked in the keras code, this seems to work especially well in combination with a dropoutlayer. More more info see chapter 5.1 in this paper