Random cropping and flipping in convolutional neural networks

chronosynclastic picture chronosynclastic · Sep 29, 2015 · Viewed 9.5k times · Source

In a lot of research papers I read about Convolutional Neural Networks (CNN), I see that people randomly crop a square region (e.g. 224x224) from the images and then randomly flip it horizontally. Why is this random cropping and flipping done? Also, why do people always crop a square region. Can CNNs not work on rectangular regions?

Answer

ypx picture ypx · Sep 29, 2015

This is referred to as data augmentation. By applying transformations to the training data, you're adding synthetic data points. This exposes the model to additional variations without the cost of collecting and annotating more data. This can have the effect of reducing overfitting and improving the model's ability to generalize.

The intuition behind flipping an image is that an object should be equally recognizable as its mirror image. Note that horizontal flipping is the type of flipping often used. Vertical flipping doesn't always make sense but this depends on the data.

The idea behind cropping is that to reduce the contribution of the background in the CNNs decision. That's useful if you have labels for locating where your object is. This lets you use surrounding regions as negative examples and building a better detector. Random cropping can also act as a regularizer and base your classification on the presence of parts of the object instead of focusing everything on a very distinct feature that may not always be present.

Why do people always crop a square region?

This is not a limitation of CNNs. It could be a limitation of a particular implementation. Or by design because assuming a square input can lead to optimizing the implementation for speed. I wouldn't read too much into this.

CNNs with variable sized input vs. fixed input:

This is not specific to cropping to a square but more generally why the input is sometimes resized/cropped/warped before inputting into a CNN:

Something to keep in mind is that designing a CNN involves deciding on whether to support variable-sized input or not. Convolution operations, pooling and non-linearities will work for any input dimensions. However, when use CNNs for solving image classification you usually end up with a fully-connected layer(s) such as logistic regression or MLP. The fully-connected layer is how the CNN produces a fixed-size output vector. The fixed-sized output can restrict the CNN to a fixed-sized input.

There are definitely workarounds to allow for variable-sized input and still produce a fixed sized output. The simplest is to use a convolution layer to perform classification over regular patches in an image. This idea has been around for a while. The intention behind it was to detect multiple occurrences of objects in the image and classify each occurrence. The earliest example I can think of is the work by Yann LeCun's group in the 1990s to simultaneously classify and localize digits in a string. This is referred to as turning a CNN with fully-connected layers into fully convolutional network. Most recent examples of fully-convolutional networks are applied to solve semantic segmentation and classify each pixel in an image. Here it is required to produce an output that matches the dimensions of the input. Another solution is to use global pooling at the end of a CNN to turn variable sized feature maps to fixed size output. The size of the pooling window is set equal to the feature map computed from the last conv. layer.