Written by Paballo Moeletsi
We built an Incline employee recognizer!
Image recognition and detection has changed many industries and has the potential to disrupt many more. More recently, facial recognition has made great strides forward. In this article we discuss a simple facial recognition algorithm which we implemented to classify various Incline employees.
Image Classification vs Object Detection
There is an important distinction between image classification and object detection. Image classification is usually concerned with classifying primarily the dominant object in an image, however multiple objects can also be detected. This is achieved by having a large dataset of images with the relevant object in the image. These images are then used to train a neural network in order for it to learn what the relevant objects are.
The current state of the art technology for computer vision is a set of techniques known as deep learning. Specifically for computer vision, convolution neural networks are the ubiquitous approach used for state of the art results. The neural network architecture’s weights are estimated using the training data via a method known as stochastic gradient decent (SGD) or one of its variants.
Object detection is a more sophisticated approach where multiple objects in a single image can be used as training data. Object detection is usually associated with bounding boxes which are used to locate where in the image the relevant objects are. Object detection algorithms work much in the same way as classification algorithms except instead of only having a label of an object to predict, the location of where the object is in the image is predicted as well.
Convolution neural nets for image classification
A convolutional net (convnet) consists of multiple layers of image processing. An image needing to be classified is fed into the first layer and processed through the multiple layers by linear maps and non-linear activation functions until the last layer which outputs the different training classes and the associated probability that the class is in the image.
Creating an Incline employee detector
We decided to develop an in-house algorithm to detect and classify Incline employees.
In order to train an image detector we need to collect data. This data is referred to as the training data. First, videos of various Incline employees were recorded across different backgrounds and images then extracted from these videos. Lastly, each image had to be given a label describing who was in the image and four co-ordinates specifying where in the image the various people were. Some example images from the process are shown below.
Once the training data collection is complete, the next step in the process is training the model. Model inference for neural networks has a fundamental trade-off between latency and accuracy. In the context of image recognition, accuracy is the ability of a model to correctly assign the correct ground truth class with a probability/ confidence score such that a decision rule will correctly predict the ground truth class and accurately predict where in the image the object/person is.
Latency refers to the time taken for a model to perform a single inference on an image. High accuracy models generally require more parameters to be estimated hence more calculations to be done which tends to increase latency. For video processing, latency considerations are quite important, especially when inference is performed on devices with limited computational resources.
To this end, an architecture referred to as MobileNet was used for feature extraction. MobileNet is designed for computationally constrained devices and offers a good trade-off between latency and accuracy. The extracted feature representations are then passed onto a fully connected layer and the softmax function whose purpose is to return a number between 0 and 1 describing the degree of confidence the convnet assigns each label to the image. This is the final prediction.
Lastly, once the model is trained to adequate accuracy it should tested on unseen test data.
Testing the algorithm
The classifier was run on several test videos. Here are some sample results.
The above are example frames of predictions from videos files being played where every 70th frame was sampled and processed by the MobileNet architecture for a prediction. This allowed for almost real time prediction.
We decided to implement a more sophisticated object detection model to detect where in the image a person was.
In order to achieve this, we implemented a ResNet50 backbone for feature extraction with a Single Shot Detection (SSD) head for localisation and classification in the image. The advantage of object detection is that multiple objects at different scales and sizes can be identified, making it more robust to changes in size, appearance and most importantly, to the background.
Above is an example of an object detection algorithm known as the RCNN. The algorithm extracts various regions of the image and for each region classifies what the dominant object in that region is.
SSD works in a similar way but makes use of “anchor boxes” (anchor boxes are analogous to region proposals of the RCNN, however, they have significant differences in terms of the number and shape of regions proposed) at different aspect ratios and sizes and different locations in the image.
Below are some results of the model.
As can be seen, the model is able to accurately classify who is in the image and also accurately draw bounding boxes around where the relevant person is. The model is also able to draw multiple bounding boxes around multiple people in one image.
What we have shown are some of the more simple implementations of facial detection. In future articles we will cover more modern approaches such as triplet loss Siamese networks or the CosFace network.
There are a myriad of value laden applications that computer vision offers to business. Some exciting practical applications of computer vision include:
- Detection of defects in manufacturing processes and products
- Person verification
- Employee monitoring
- Automated physical document extraction and checking
- Estimating customer demand and customer waiting times in store
The essential power computer vision affords to business is the ability to use unstructured image data with powerful algorithms and insightful data analytics to provide insights where it was financially or operationally impractical before.
Follow Incline’s LinkedIn and Facebook accounts to keep up with articles and industry related news.
For more information contact firstname.lastname@example.org
Review of Deep Learning Algorithms for Object Detection by Arthur Ouaknine. https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852
A Beginner’s Guide to Understanding Convolutional Neural Networks by Adit Deshpande. https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
R-CNN, Fast R-CNN, Faster R-CNN, YOLO-Object Detection Algorithms by Rohith Gandhi. https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e