0% found this document useful (0 votes)
50 views

Introduction to Operations Research

The document presents a study on a real-time head pose estimation and facial keypoints prediction system, named 'DeepFaceMask', designed to detect masked and unmasked faces using deep learning techniques. It discusses the challenges of mask compliance during the COVID-19 pandemic and outlines the architecture of the proposed system, which utilizes a Deep Convolutional Neural Network (DCNN) and MobileNet for efficiency. The research emphasizes the importance of accurate mask detection and facial keypoint identification to aid public health efforts and improve safety in public spaces.

Uploaded by

Pranjal dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Introduction to Operations Research

The document presents a study on a real-time head pose estimation and facial keypoints prediction system, named 'DeepFaceMask', designed to detect masked and unmasked faces using deep learning techniques. It discusses the challenges of mask compliance during the COVID-19 pandemic and outlines the architecture of the proposed system, which utilizes a Deep Convolutional Neural Network (DCNN) and MobileNet for efficiency. The research emphasizes the importance of accurate mask detection and facial keypoint identification to aid public health efforts and improve safety in public spaces.

Uploaded by

Pranjal dubey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Scientific & Engineering Research Volume 12, Issue 5, May-2021

ISSN 2229-5518
1027

Realtime Head Pose Estimation with Facial


Keypoints Prediction On Masked/Unmasked Faces
PRANJAL DUBEY SAHAJ KAPOOR
SAURAV ARORA School of Computer Science and School of Computer Science and
School of Computer Science and Engineering Engineering
Engineering Galgotias University Galgotias University
Galgotias University, Greater Noida , Uttar Pradesh Greater Noida , Uttar Pradesh
Greater Noida , Uttar Pradesh [email protected] [email protected]
[email protected]
for 'Face Recognition' and 'Attention Detection' and the
location of Masked and Unmasked faces in static as well
as moving(video) images. The detection technique should
Abstract—Numerous Governing authorities/organizations be robust to the occlusion present in the images for better
expect people to utilize the services only if they wear masks, predictability Preferably, they should be sufficiently
effectively masking both their nose and mouth, according to
the rules from the World Health Organization (WHO).
quick to function admirably for certifiable programs
Manual screening and distinguishing proof of individuals would like to zero in on in our future executions.
following/not following this arrangement is an enormous III. VISION
assignment in public places.Keeping in mind these challenges,
the ideal methodology is to utilize innovations in Artificial
Intelligence and Deep Learning; to be utilized as to make this
This undertaking was made with the vision of building up a
undertaking straightforward, which is anything but difficult "Real-Time Mask Detection System'' accessible for public
to utilize and robotized. In this paper, we propose use, to help general wellbeing authorities and little to huge
"DeepFaceMask", which is a high-precision and efficient face foundations everywhere on the world viably battle this
mask classifier. The presented DeepFaceMask is a one-stage COVID19 pandemic. We trust that the models created here

IJSER
identifier, which consists of a Deep Convolutional Neural by the little exploration AI/ML people group empower
Network (DCNN) to combine significant level semantic data engineers around the planet to have the option to utilize and
with different element/feature maps. Other than this, we
convey the equivalent to construct systems that would be fit
additionally investigate the chance of actualizing DeepFace-
Mask with a light-weighted neural organization MobileNet for for withstanding the requests of a real-time, real-world use
cell phones. MTCNN, utilizes the inalienable connection case. Specifically, it would assist manufacturing plants with
among's recognition and alignment to help boost their guaranteeing mask consistence is followed, help guarantee
performance. Specifically, our frame work uses a cascaded security for guests in control zones or public spots where it
architecture with three phases of diligently planned DCNN to is vital for such measures to be taken, etc. The applications
predict the face and its key points or landmarks in a coarse-to- are endless and are of earnest need in this crucial time.[2]
fine way. [1]

IV. DATASETS
I. INTRODUCTION (HEADING 1) COVID-19 pandemic. Furthermore, masks should be worn
effectively on the face with the end goal that it masks the
To viably stop the spread of COVID-19 pandemic,
everyone is required to wear a mask in public places.This The dataset we will be using primarily is the
nearly makes regular facial recognition techniques MaskPascalVOC zip file taken from the website:
ineffective, for example, public access control, face access nose and mouth totally, which is frequently not being
control, facial recognition, facial security checks at train followed. Consequently, it is dire to improve the
stations, and so forth.. The science around the utilization of recognition capabilities of the current face/mask
masks by the overall population to prevent COVID-19 recognition technology. Face mask identification alludes to
transmission is progressing quickly. Policymakers need distinguish if an individual is using mask and amount of
guidance on how masks should be utilized by everybody to area covered, which [3]
battle the https://makeml.app/datasets/mask The dataset contains 853
recognition is to recognize a specific class of objects, for images of the following classes: With mask, Without mask,
example face. Uses of object and face recognition can be and Mask weared incorrect. It is labeled with bounding box
found in numerous territories, for example, self driving annotations for object detection. But the number of images
vehicles,education,surveillance, etc. Customary object
locators are based on handmade feature extractors.[4] we identify by including the facial keypoints too. The issue
is firmly identified with general object identification to
II. PROBLEM STATEMENT distinguish the classes of items (here we manage primarily
belonging to the class of mask worn incorrectly are too less
in quantity compared to the other two classes in the dataset,
The objective of this project is to prepare 'Object which was creating class imbalance so, we collected data
Detection Models' fit for distinguishing facial keypoints

IJSER © 2021
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 12, Issue 5, May-2021
ISSN 2229-5518
1028

three classes specifically: wearing mask accurately, wearing


from additional sources having the class name as None for
mask erroneously, and not wearing mask) and face
people not wearing masks correctly, which we have
combined separately and uploaded on the web, So, finally
our combined dataset overall has the following four labels
namely: with_mask, without_mask, mask_weared_incorrect
and none. This is divided finally into 3 classes, first one
having the label “with_mask” which we signify later by a
green colour bounding box on the face with a text label over
it as “Correctly Masked”, second having the label
“without_mask” which we signify later by a red colour
bounding box on the face with a text label over it as
“Unmasked”, and the third one having either of the two
labels, “mask_weared_incorrect” or “none” which we signify
later by a blue colour bounding box on the face with a text
label over it as “Incorrectly Masked''. . 3 Additionally, we are
also implementing the main facial keypoints inside the
bounding box while detecting the face and the dataset used to
train this model is taken from the website:
https://ibug.doc.ic.ac.uk/download/annotations/ xm2vts.zip/
The data is in the format of a CSV (Comma Separated
V. RELATED WORK
Values) file where there are sixty- eight key points of images
representing x, y coordinates. This data is being fed into a
deep CNN or ConvNet model with the final layer having A. OBJECT DETECTION
68*2=136 dimensions output predicting the X and Y
coordinates of those sixty eight key points. Smooth L1 loss

IJSER
The face detection technique used here is MTCNN (Multi-
and MSE (Mean Squared Error) loss metrics resulted in the task Cascaded Convolutional Networks). Humanface
best accuracy outputs, we choose Smooth L1 classification and arrangement in unconstrained climate
loss metric for our final model as it performed better in Ongoing investigations show that profound learning
real-time comparatively. [5] approaches can accomplish great execution on these two
errands. In this paper, we have utilized a Deep Cascaded
perform various tasks system which abuses the inalienable
relationship among discovery and arrangement to help up
their exhibition. Specifically, this casing work uses a fell
engineering with three phases of painstakingly planned
Deep Convolutional Neural Networks to anticipate face and
milestone area in a coarse-to-fine way. What's more, it
proposes another online hard example mining technique
that further improves the presentation practically speaking..

Real Time analysis of MTCN ( its workflow which is


being followed by all the three sequential models which
are the P , R and O model respectively) :

IJSER © 2021
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 12, Issue 5, May-2021
ISSN 2229-5518
1029

IJSER
N-face and keypoints detection: MTCNN is a technique comprising of three stages,
which can predict basic facial keypoints and perform basic face
alignment . To avoid detection errors , it uses a technique
called Non Max Suppression . [6][7]

The MTCNN framework / Architecture uses three separate


networks:

● “P” – Network
● “R” – Network
● “O” – Network

• Structure of P-Net:
P-Net predicts bounding box using sliding a 12*12 size
kernel/filter across the image.

IJSER © 2021
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 12, Issue 5, May-2021
ISSN 2229-5518
1030

• Structure of R-Net:
R-Net has similar structure, but uses more layer, thus
predicting more accurate bounding box coordinates.

IJSER
B. IMAGE CLASSIFICATION

Image classification refers to extracting specific desired


features from a static or a real time image and classifying it to
solve a specific problem at hand. This objective was
accomplished by using a transfer learning approach. The
ResNet-50 pre-trained model was used as a feature extractor
connected with a custom fully connected layer for robust and
efficient image classification. The model was trained on a
dataset consisting three classes, masked, not masked, not
properly masked respectively. The problem with the dataset
was that it didn't represent the same amount of each class i.e.
• Structure of O-Net:
it was an imbalance of data, so the model was trained on two
O-Net takes the output of R-Net and predicts three sets of
datasets combined. To achieve more robust results, custom
data namely - the probability of face being in the box,
image augment- ation techniques were implemented during
bounding box, and facial keypoints. [8][9]
the training process. The convolutional layers of ResNet-50
were used as feature extractor (last convolutional layers), rest
all were frozen during training. Thus, fine tuning the model
gave much better results from traditional state-of-the-art
architectures. It also helped in tackling vanishing gradients
problem by leveraging the use of skip connections and strong
robust feature extractor proved to be efficient enough to
extract features from a relatively small dataset. ResNet-50
layers were connected to linear layers before end-to-end
result prediction.

IJSER © 2021
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 12, Issue 5, May-2021
ISSN 2229-5518
1031

C. HEAD POSE ESTIMATION :


Alignment of any object suggests its general direction and
position with respect to a camera. We can change the
posture by either moving the thing regarding the camera, or
the camera concerning the article. [10]

The posture estimation issue portrayed in this paper is


often alluded to as Perspective-n-Point issue or PNP . In
this issue the objective is to determine the inclination or
posture of an article as for the camera , and we know the
coordinates of n 3D points on the item and the
corresponding 2D projections in the picture. [11]

Motions performed by a third dimensional rigid object : 1.


Translation : Change in the pixel values such that there is a
motion caused to the image in either x axis or the y axis.
2. Rotation : In this type of movement the image is
translated with respect to a single pivot point .
So, estimating the pose of a 3D object means finding 6
numbers — three for translation and three for rotation. To
calculate the 3D pose of an object in an image you need the
following information [12][13]
1. 2D coordinates of a couple of points : You
Recently DNN community started experimenting with deeper need the 2D (x,y) locations of a couple of
networks because they were able to achieve high accuracy points in the picture. For the situation of a
values. All in all, the underlying layers of the organization

IJSER
face, you could pick the corners of the
won't adapt successfully. Thus, profound organization
preparing won't combine and precision will either begin to eyes, the tip of the nose, corners of the
corrupt or immerse at a specific worth. In spite of the fact that mouth and so on .
the disappearing angle issue tended to utilizing the
standardized instatement of loads, further organization 2. 3D locations of the same points : We need
exactness was as yet not expanding. Profound Residual the 3D coordinates of the 2D feature
Network is practically like the organizations which have points. Primary 3d coordinates refer to :
convolution, pool-ing, activation and completely associated
layers stacked one over the other. Skip connections used by Nose tip , Chin , right corner of mouth , left
ResNet-50. [14][15] corner of mouth , left eye , right eye.

OpenCV solvePnP

The capacity solvePnP and solvePnPRansac can be


utilized to gauge pose. [16]
solvePnP actualizes a few calculations for pose estimation
which can be chosen utilizing the boundary flag. As a
matter of course it utilizes the check solve pnp iteration to
true and its basically distributed ledger technology
arrangement trailed by LM algorithm. Solve pnp p3p
function utilizes just three focuses ascertaining the
alignment and it must be utilized just when utilizing solve
pnp pransac.
Key Features of ResNet:
• Resnet utilizes the layer called Batch
normalization which has a sole purpose of adjusting the VI. TRAINING
input of the next layer hence increasing the performance.
The problem of covariate shift is mitigated. After preprocessing the data, our combined dataset consists
• Resnet uses skip connection to overcome the gradient a total number of 4198 images. Number of images labelled
diminishing problems.[17] 1 i.e. wearing mask correctly are
3232, number of images labelled 2 i.e. not wearing validation data and test data. It was split into
mask are 717, number of images labelled 3 i.e. 8:1:1 ratio i.e. train set size is 3358,
Wearing a mask incorrectly is 249. validation set size is 419 and test set size is
The dataset was then divided into training data, 421. The difference in the validation and test [18]

IJSER © 2021
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 12, Issue 5, May-2021
ISSN 2229-5518
1032

size, despite the same ratio, is because test set size


was calculated after calculating the train set size and
validation set size and their summation was subtracted
from the total number of images. Also, images were
randomly shuffled for no imbalance of class and
robust performance of model, in batch size of 64 for
faster computation. We trained the model using cross-
entropy loss and Adam [19]
optimizer (an upgrade to stochastic gradient descent
with momentum capabilities). In addition, the learning
rate was set as 10-3 i.e. 0.001 and the number of
epochs as 20, post this the model stopped learning
based on earlier observations during training.
Challenges faced during the training process was that
single data source wasn’t enough to provide sufficient
number of images belonging to each class. So, many
data sources were considered and a robust,
balanced, sufficiently large dataset was created that
would provide enough data for the model to adapt to
variances in data. GPU was used in training the model
due to the large data. Training on GPU proved to be
about 3x times faster than training on the CPU. GPU
model used while training: NVIDIA GeForce GTX
1050 2GB GDDR5. Lighting and camera settings play
a major role in model performance. Thus, we used
MTCNN, which easily tackles such problems. Total
params: 24,558,146

IJSER
Trainable params: 16,014,850
Non-trainable params: 8,543,296

VII. RESULTS
The best model saved during training resulted in a
validation loss of 0.9591 and validation accuracy of
0.9689 which was

VIII. REAL-TIME APPLICATIONS:

• Mall security checks / Super markets • Offices


spaces / Schools
• Hospitals
• Mobile applications for alerts

IJSER © 2021
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 12, Issue 5, May-2021
ISSN 2229-5518
1033

IX. FURTHER IMPLEMENTATIONS


2014, pp. 580–587.
It is evident that one of our biggest obstacles during [11] Krizhevsky A, Sutskever I, Hinton GE (2012)
the COVID-19 pandemic is to make sure people ImageNet classification with deep convolutional neural
follow the safety regulations especially in public networks. Adv Neural Inf Process Syst 25.
places for his/her own safety and the safety of others [12] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider
around. Our DeepFaceMask model will thus detect if face:A face detection benchmark,” in Proceedings of the
people are wearing masks or not, correctly, when IEEE conference on computer vision and pattern
deployed to the CCTVs in the public places and can recognition, 2016, pp. 5525–5533.
alert the admin as and when people are not wearing [13] R. Girshick, “Fast r-cnn,” in Proceedings of the
masks or wearing masks incorrectly. Additionally, it IEEE international conference on computer vision, 2015,
can be used in head pose estimation, attention pp.1440–1448.
detection in classrooms and lectures on masked faces, [14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L.
drowsiness detection on masked faces using facial Fei-Fei, “Imagenet: A large-scale hierarchical image
keypoints tracking the driver’s eyes, and so on. [20]
database,” in 2009 IEEE conference on computer vision
and pattern recognition. Ieee, 2009, pp. 248–255.
[15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi,
“You only look once: Unified, real-time object
X. REFERENCES detection,” in Proceedings of the IEEE conference on
[1] P. Viola and M. J. Jones, "Robust real-time face computer vision, 2016, pp. 779–788.
detection", Int. J. Comput. Vision, vol. 57, no. 2, pp. 137- [16] A. Krizhevsky, I. Sutskever and G. E. Hinton,
154, May 2004. "Imagenet classification with deep convolutional neural
[2] Z. A. Memish, A. I. Zumla, R. F. Al-Hakeem, A. A. networks", Advances in Neural Information Processing
Al-Rabeeah, and G. M. Stephens, “Family cluster of Systems 25, pp. 1097-1105, 2012.
middle east respiratory syndrome coronavirus infections,”
[17] K. Simonyan and A. Zisserman, "Very deep
New England Journal of Medicine, vol. 368, no. 26, pp.
convolutional networks for large-scale image

IJSER
2487–2494, 2013.
recognition",CoRR, 2014.
[3] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár,
[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
“Focal loss for dense object detection,” 2017.
D.
[4] A. Shrivastava, A. Gupta, and R. Girshick, “Training
Anguelov, et al., "Going deeper with convolutions",
region-based object detectors with online hard example
2015.
mining,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 761– [19] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual
769. learning for image recognition", 2016 IEEE Conference
[5] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked on Computer Vision and Pattern Recognition (CVPR),
faces in the wild with lle-cnns,” in Proceedings of the pp. 770-778, 2016.
IEEE. [20] P. Viola and M. J. Jones, "Robust real-time face
detection", Int. J. Comput. Vision, vol. 57, no. 2, pp.
[6] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, 137-154, May 2004
G.Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et
al.,“Pytorch: An imperative style, high-performance deep
learning library,” in Advances in Neural Information
Processing Systems, 2019, pp. 8024–8035.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
learning for image recognition,” in Proceedings of the
IEEE conference on computer vision and pattern
recognition, 2016, pp. 770–778
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
W.Wang, T. Weyand, M. Andreetto, and H. Adam,
“Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[9] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and
S.Zafeiriou, “Retinaface: Single-stage dense face
localization in the wild,” arXiv preprint
arXiv:1905.00641, 2019.
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik,
“Rich feature hierarchies for accurate object detection
and semantic segmentation,” in Proceedings of the IEEE
conference on computer vision and pattern recognition,

IJSER © 2021
http://www.ijser.org

You might also like