Hazard label detector with YOLOv3 and automatic dataset generation
8 November 2019
A part of RoboCup Rescue competition focuses on sensing, one of the tasks is detecting
various objects in video, a lot of them rather obscure. That meant that for training
a CNN, we needed to create our own dataset. This project is still work-in-progress,
updates will be added later.
Since a majority of the obscure objects were planar (hazard labels, signs), the approach
I chose was to generate a dataset by transforming the objects using OpenCV and place them
on backgrounds from Open Images (where door and person labels were kept).
Dataset generation
Examples of training data
Starting with planar images (for some categories a list of them since there is variability),
a range of different transformations were applied. Idea is that realistic images must be
included in the distribution, if there are extra cases, that's not a problem.
List of transformations used:
Blur (motion, median, gaussian)
Added noise
Scaling in HSV colour space
Cutouts (lines, outside circles)
Random handwritten text overlayed onto the object
Rotation
Perspective
Camera lens distortion
Some of these are obvious, some are less so. With the noise and colour transformations,
an important thing to watch is that hazard labels are often very similar and differ mostly
in the colour. Text and cutouts represent various obstructions, e.g. viewing the object
through a hole. Camera lens distortion is the most unusual one. I noticed that in real images,
the labels aren't always flat - corners can be lifted, the whole label can be stuck on a pipe
or across an edge. In these cases the network performed very badly. Lens distortion creates
an effect similar to the label being placed in a bowl or on a ball. Introducing it improved
results significantly but going forward, there is need for other similar transformations
to represent edges and cylinders (pipes).
Training
Results on validation data
Since the goal is multi-label classification in real time, the models used are YOLOv3 and
YOLOv3 Tiny. The strategy for training was starting with small images and weak transformations
and going to bigger and more transformed images. The small images should only be used
for setting up the last few layers, since images under ~200px have rather different textures
from larger ones, so we rather want to keep the pretrained layers. In larger images we include
full transformations and train all layers. It is a lot more efficient to have a higher learning
rate for higher layers at this stage. Other tricks like learning rate warmup and annealing
(one cycle training) were also included.
Results
Results on real life video frames
These aproaches are well aplicable for planar objects and reasonable results have already been
reached. For non-planar objects many views were included in the original objects, but still
accuracy is quite low.