This example uses the Single Shot Multibox Detector (SSD)* object detection
method to train a model that can (1) identify one or more regions of interest
(boxes) around objects in an input image and (2) classify the object found
within each box.

The model expects the data in a format similar to the Pascal Visual Object
Classes (VOC) dataset. Each image in the dataset contains one or more ground
truth objects. Each object is represented by:
  1) A bounding box in absolute boundary coordinates
     (e.g., `x_min`, `y_min`, `x_max`, `y_max`)
  2) A label, one of:
     'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat',
     'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person',
     'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'
  3) A perceived detection difficulty
     (0: not difficult,  1: difficult)
Training images are 300x300 pixels and in the RGB format.

The model used a customized structure of the VGG-16 architecture to match the
original implementation of the SSD approach.

The inference accepts RGB images and produces predictions (boxes) and a label
for each box, which are then visualized on the input image.


  * The original SSD paper can be found at https://arxiv.org/abs/1512.02325
