I noticed in a number of places that people use something like this, usually in fully convolutional networks, autoencoders, and similar:
model.add(UpSampling2D(size=(2,2)))
model.add(Conv2DTranspose(kernel_size=k, padding='same', strides=(1,1))
I am wondering what is the difference between that and simply:
model.add(Conv2DTranspose(kernel_size=k, padding='same', strides=(2,2))
Links towards any papers that explain this difference are welcome.
Here and here you can find a really nice explanation of how transposed convolutions work. To sum up both of these approaches:
In your first approach, you are first upsampling your feature map:
[[1, 2], [3, 4]] -> [[1, 1, 2, 2], [1, 1, 2, 2], [3, 3, 4, 4], [3, 3, 4, 4]]
and then you apply a classical convolution (as Conv2DTranspose
with stride=1
and padding='same'
is equivalent to Conv2D
).
In your second approach you are first un(max)pooling your feature map:
[[1, 2], [3, 4]] -> [[1, 0, 2, 0], [0, 0, 0, 0], [3, 0, 4, 0], [0, 0, 0, 0]]
and then apply a classical convolution with filter_size
, filters`, etc.
Fun fact is that - although these approaches are different they share something in common. Transpose convolution is meant to be the approximation of gradient of convolution, so the first approach is approximating sum pooling
whereas second max pooling
gradient. This makes the first results to produce slightly smoother results.
Other reasons why you might see the first approach are:
Conv2DTranspose
(and its equivalents) are relatively new in keras
so the only way to perform learnable upsampling was using Upsample2D
,keras
- Francois Chollet used this approach in one of his tutorials,keras
due to some API
inconsistencies.