Tensorflow object detection api SSD model using 'keep_aspect_ratio_resizer'

Ultraviolet picture Ultraviolet · Jan 8, 2018 · Viewed 7.7k times · Source

I am trying to detect objects in different shaped images (not square). I used faster_rcnn_inception_v2 model and there I can use image resizer which maintains the aspect ratio of the image and the output is satisfactory.

image_resizer {
  keep_aspect_ratio_resizer {
    min_dimension: 100
    max_dimension: 600
  }
}

Now for faster performance, I want to train it using ssd_inception_v2 or ssd_inception_v2 model. The sample configuration uses fixed shape resize as below,

image_resizer {
  fixed_shape_resizer {
    height: 300
    width: 300
  }
}

But the problem is I get a very poor detection result because of that fixed resize. I tried changing it to keep_aspect_ratio_resizer as stated earlier in faster_rcnn_inception_v2. I get the following error,

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[ 0] = [1,100,500,3] vs. shape1 = [1,100,439,3]

How can I make the configuration in SSD models to resize the image maintaining the aspect ratio?

Answer

gdelab picture gdelab · Jan 8, 2018

SSD and faster R-CNN work quite differently one from another, so, even though F-RCNN has no such constraint, for SSD you need input images that always have the same size (actually you need the feature map to always have the same size, but the best way to ensure it is with always the same input size). This is because it ends with fully connected layers, for which you need to know the size of the feature maps; whereas for F-RCNN there are only convolutions (which work on any input size) up to the ROI-pooling layer (which only doesnt need a fixed image size).

So you need to use a fixed shape resizer for SSD. In the best case, your data always has the same width/height ratio. In that case, just use a fixed_shape_resizer with the same ratio. Otherwise, you'll have to choose an image size (w, h) yourself more or less arbitrarily (some kind of average of your data would do). You have several options from then on:

  • Just letting TF reshape the input to (w, h) with the resizer, without preprocessing. The problem is that the images will be deformed, which may (or not, depending on your data and the objects you're trying to detect) be a problem.

  • Cropping all the images to have sub-images with the same aspect ratio as (w, h). Problem: you'll lose part of the images or have to do more inferences for each image.

  • Padding all images (with black pixels or random white noise) to get images with the same aspect ratio as (w, h). You'll have to do some coordinate translations on the output bounding boxes (the coordinates you'll get will be in the augmented image, you'll have to translate to initial coordinates by multiplying them by old_size/new_size on both axes). The problem is that some objects will be downsized (relatively to the full image size) more than some others, which may or may not be a problem depending on your data and what you're trying to detect.