Hazard label detector with YOLOv3 and automatic dataset generation

8 November 2019
A part of RoboCup Rescue competition focuses on sensing, one of the tasks is detecting various objects in video, a lot of them rather obscure. That meant that for training a CNN, we needed to create our own dataset. This project is still work-in-progress, updates will be added later.

Since a majority of the obscure objects were planar (hazard labels, signs), the approach I chose was to generate a dataset by transforming the objects using OpenCV and place them on backgrounds from Open Images (where door and person labels were kept).

Dataset generation

Examples of training data
Starting with planar images (for some categories a list of them since there is variability), a range of different transformations were applied. Idea is that realistic images must be included in the distribution, if there are extra cases, that's not a problem.

List of transformations used:
  • Blur (motion, median, gaussian)
  • Added noise
  • Scaling in HSV colour space
  • Cutouts (lines, outside circles)
  • Random handwritten text overlayed onto the object
  • Rotation
  • Perspective
  • Camera lens distortion
Some of these are obvious, some are less so. With the noise and colour transformations, an important thing to watch is that hazard labels are often very similar and differ mostly in the colour. Text and cutouts represent various obstructions, e.g. viewing the object through a hole. Camera lens distortion is the most unusual one. I noticed that in real images, the labels aren't always flat - corners can be lifted, the whole label can be stuck on a pipe or across an edge. In these cases the network performed very badly. Lens distortion creates an effect similar to the label being placed in a bowl or on a ball. Introducing it improved results significantly but going forward, there is need for other similar transformations to represent edges and cylinders (pipes).

Training

Results on validation data
Since the goal is multi-label classification in real time, the models used are YOLOv3 and YOLOv3 Tiny. The strategy for training was starting with small images and weak transformations and going to bigger and more transformed images. The small images should only be used for setting up the last few layers, since images under ~200px have rather different textures from larger ones, so we rather want to keep the pretrained layers. In larger images we include full transformations and train all layers. It is a lot more efficient to have a higher learning rate for higher layers at this stage. Other tricks like learning rate warmup and annealing (one cycle training) were also included.

Results

Results on real life video frames
These aproaches are well aplicable for planar objects and reasonable results have already been reached. For non-planar objects many views were included in the original objects, but still accuracy is quite low.