Fine-Tuning a Detectron2 Model to Detect Faces Wearing a Covid Mask in Videos.
In this post, I will detail how I trained a masked face detector using the Detectron2 framework. Then, I will demonstrate how to use this trained model to perform live detections on videos.
This article is not a tutorial on how to use Detectron2 for object detection. Here, I want to share a return of experience and some lessons I have learned while building this project with my team in our company.
You can find detailed tutorials on how to use Detectron2 here.
In the following paragraphs, we will walk along the way I managed to:
- Build a custom learning base
- Fine-tune an object detection model with Detectron2
- Evaluate the resulting face detector on “real-world” data
Finally, the trained model is a component of an AI-based application that could be used to prevent the spread of Covid-19. This solution is presented in detail in a preceding article that you can find here.
Building the Masked Face Detector
In order to train a model that allows detecting masked faces from videos, I first decided to use the Face Mask Detection dataset from Kaggle.
This dataset provides 853 images belonging to 3 classes, which show the following types of faces: wearing a Covid mask, without a mask, or wearing a mask incorrectly (the mask is not totally covering the mouth and the nose). The corresponding annotations files are also given.
Initially, I decided to tackle this problem as a 3-class object detection one. Where each object corresponds to a type of face: wearing a mask, without a mask, or wearing a mask incorrectly. However, the highly unbalanced type of this dataset made it hard for me to obtain satisfying results right away. Indeed, this dataset is composed of 4068 instances appearing in the 853 images, distributed as follows:
- 3228 faces wearing a mask (79% of the instances), among which 2393 were used to train the model
- 717 faces not wearing a mask (18% of the instances), among which 534 in the training set
- 123 faces wearing a mask incorrectly(3% of the instances), among which 94 in the training set
I chose to use the Detectron2 framework to train a masked face detection model, and I retained the Faster R-CNN X101-FPN model which gave the best results based on this dataset.
However, the overall performing level of the model was quite mitigated, only about 51 mAP. Plus, the performance was particularly lower on the minority class: only 40 mAP for the category corresponding to faces wearing masks incorrectly. The detailed results of the detections are here below:
This post will help you understand further what the Mean Average Precision (mAP) evaluation metric is.
Creating a Custom Learning Base
There are many ways to overcome the difficulty of dealing with an unbalanced dataset to obtain satisfying results in object detection. We can cite the following ideas for example:
- Use a custom loss function, e.g. the Focal Loss (see this paper for more details).
- Balance your dataset by adding new instances of the minority class(es).
This review deals with all the types of imbalance problems in object detection and provides some solutions available in the literature to tackle them.
Here, I decided to follow the last idea and I decided to enrich the dataset. To achieve this, I chose to:
- Collect a few videos on Youtube showing people wearing masks in the street;
- Extract some frames (images) from them;
- Create annotations files corresponding to these new images.
To create the annotations files, I used LabelImg, a graphical image annotation tool that allows you to: create boxes around desired objects in an image, label it, and then save the created annotations in a file in XML, PASCAL VOC, YOLO, or CreateML formats.
In addition, I decided to reduce the problem to a 2-class object detection one. In fact, the requirement for this solution is to be able to detect faces wearing a mask from videos. There is no need to differentiate faces without masks from those wearing a mask incorrectly. Plus, from a healthcare point of view, both are problematic. So, I decided to merge these two classes into a unique class and labeled all the corresponding instances as not wearing a mask.
This strategy also helped me to balance further the resulting dataset. Which is finally composed of 1584 images with :
- 3846 faces wearing a mask (52% of the instances), among which 2720 were used to train the model
- 3574 faces not wearing a mask (48% of the instances), among which 3574 in the training set
Fine-Tuning and evaluating the Faster R-CNN Model
I fine-tuned the Faster R-CNN X101-FPN model on the new enriched dataset and obtained a performance on the new 2-class dataset that is still 51 mAP.
This is higher than the performance of 43 mAP obtained using the pre-trained model on Coco benchmark, but still way lower than some performances on similar posts available on the internet that claim to achieve much more higher-performing levels. Indeed, some articles present analyses with detections accuracy of 75 mAP or higher.
This displayed performance level could make us think of a model that is not performing that great. But, we will see that it is not exactly the case.
Validating the Model in Real-Life Conditions
In this part, I will explain how I analyzed more closely the predictions of the fine-tuned model in real-life conditions, and why I chose to retain it in the final solution.
As mentioned here above, some posts that I found on the internet were claiming performing levels significantly higher than the one achieved by my masked face detector.
So, I was very worried about the success of the final application. Then, I decided to collect a few more videos showing faces wearing a mask on Youtube to test the model, as this corresponds to the conditions in which the solution will be used. Indeed, this solution aims at detecting masked faces in cameras videos on the fly.
Then, I observed that the model was performing absolutely great based on these test videos. It was able to clearly identify faces wearing a mask as well as faces without a mask, including those partially masked (wearing a mask incorrectly) in the videos.
The following two videos were taken randomly from Youtube and used to test the fine-tuned masked face detector.
We can clearly see the good quality of the live detections on these videos, and figure out that the model is performing absolutely great in real-life conditions. Indeed, in the different environments here above, the masked faces are perfectly identified, as well as non-masked faces.
Finally, when I looked closely at the evaluation metrics calculated I figured out that two of the metrics displayed reflected well the quality of the predictions I observed. In fact, the model shows good levels of performance on large and medium-size objects, respectively 65 mAP 57 mAP. This clearly reaches the expectations in terms of the ability to detect masked faces, based on what we observe in the two videos.
On the opposite, the model was performing lower on small objects with only a 48 mAP performance. This is illustrated with some missed detections in the two preceding videos — low recall. But, in the reality, we should not care about these missed detections as they were not critical for our use case.
In fact, at a given point, when these small objects are getting closer to the camera, at a reasonable distance, they become bigger and are well detected. So, these small objects do not necessarily need to be detected as they are still far from the camera, and they do not represent a pain point for the use case. Whether if they are responsible for the overall displayed medium-level performance of the model. They are just giving an under-estimated appreciation of the real accuracy of the model.
Furthermore, this also emphasizes a mismatch between the labeling of the available dataset and the requirement of the use case: small objects are labeled, but in reality, they do not need to be.
Finally, in real-life conditions, I do not expect the model to detect faces wearing a mask 20 meters far from the camera or more. Imagine if you had to monitor, based on cameras, the wearing of Covid mask in a critical area, you would not expect your detection system to identify faces with or without masks if they are too far from the camera. Only the faces within a given range of interest matter. Consequently, the current model is sufficient for our use case.
Here, we have talked about:
- How to enrich an object detection dataset;
- Reducing the unbalance of the classes by merging some of them
- Fine-tuning a Faster R-CNN object detection model using Detectron2
- Testing and validating the trained model on real-life data
These are the main takeaways from this analysis:
- Do not hesitate to collect more data to solve an object detection problem with unbalanced classes;
- Do not focus too much on obtaining results as good as some that you can find in the literature, including some posts dealing with problems similar to yours;
- We should analyze in detail the displayed performance of a model Do not rely only on the displayed evaluation metrics;
- Perform error analysis to determine if the missed detections are really critical for your use case;
- Depending on your use case, overall performance of just 50 mAP could be sufficient to give you very good results;
- Visualize well the results obtained in the real conditions of use of the model to assess its actual performance visually.
In our case, I clearly saw that my masked face detector was giving very satisfying results in real-life conditions even though the calculated evaluation metrics were just quite good. This demonstrates that in some situations, achieving a medium-level overall accuracy (mAP) could be sufficient to fulfill perfectly the requirements of your use case.
The code associated with this project is available on Github here.
Hope you enjoyed this post. Feel free to give me your feedback. 😉