2023_vit
2023_vit
net/publication/373841755
CITATION READS
1 772
4 authors:
All content following this page was uploaded by Dane Lesley Brown on 12 October 2023.
Abstract—In recent years, the proliferation of facial recogni- • Texture Attacks: High-frequency textures or patterns
tion systems and their integration into various domains has rev- used to exploit the system’s reliance on specific visual
olutionised how we interact with technology. However, this rapid cues.
adoption has also brought new security challenges, particularly
facial spoofing attacks. Facial spoofing refers to the malicious • 3D Mask Attacks: Realistic three-dimensional masks or
attempt to deceive a facial recognition system by presenting sculptures mimic facial features to bypass depth and
counterfeit or manipulated facial biometric data. This paper structure detection.
proposes using Video Vision Transformers (ViViT) to tackle • Video Replay Attacks: Recorded videos replayed in front
facial liveness and anti-spoofing identification on the Rose-Youtu of the system deceive it into recognising the replayed
facial liveness dataset. Video data loading of the video clips is
also integrated and analysed as a video frame sampling method video as a live face.
for facial liveness and anti-spoofing detection. The multiclass
identification ViViT model yielded an accuracy of 86.78% on Deep learning techniques, particularly Convolutional Neu-
the isolated set of test subjects. An Equal Error Rate of 2.46%
was achieved on the real facial videos when contrasted against ral Networks (CNNs) and Recurrent Neural Networks
the spoof attack videos. (RNNs), have emerged as powerful tools for facial liveness
and anti-spoofing detection. These techniques leverage large-
Index Terms—Facial Liveness Detection, Facial Anti-Spoofing, scale datasets, complex network architectures, and advanced
Video Classification, Vision Transformers training algorithms to extract discriminative features and
model intricate patterns in facial images or videos.
I. I NTRODUCTION This paper notably makes the following contributions:
With the rapid advancement of facial recognition systems in
various domains, ensuring the security and reliability of these • The utilisation of video vision transformers for facial
systems has become a critical concern. Facial recognition is liveness detection.
a rapidly growing market with market face-based biometric • Video vision transformers for facial anti-spoofing classi-
recognition expected to be worth USD 8.5 Billion by 2025 [1]. fication.
Traditional facial recognition methods often struggle to differ- • Video data loading for uniform sampling of frames for
entiate between real faces and spoofing attempts using printed facial liveness and anti-spoofing detection
photographs, replayed videos, or 3D masks. A spoofing attack
is a malicious act of impersonating or falsifying data to Therefore, this paper investigates the application of video
deceive individuals or systems into believing the attacker is vision transformers to facial liveness detection and facial anti-
someone they are not. This vulnerability allows malicious spoofing on the ROSE-Youtu Face Liveness Detection Dataset
actors to bypass security measures and gain unauthorised [5]. For this study, no facial detection and cropping systems
access. are utilised; the entire image, including the background, is
Researchers have focused on facial liveness and anti- fed into the model, as the assumption is that vital information
spoofing detection techniques to address this issue. Facial that points toward spoofed attacks may be visible in the back-
liveness determines whether a face presented to a system is ground. Augmentation is implemented on video frames during
live or a spoof [2]. On the other hand, anti-spoofing involves training to improve the proposed system’s robustness and
detecting and preventing various forms of spoofing attacks, generalisation ability. The findings are subsequently compared
such as presentation attacks, texture attacks, and 3D mask to existing state-of-the-art systems for facial liveness and anti-
attacks [3]. spoofing detection.
Facial anti-spoofing attacks vary between the following [4]: The rest of this paper is structured as follows Section I-A
analyses related work, and Section II elaborates on the archi-
• Presentation Attacks: Artificial representations such as
tecture of the vision transformer implemented in this study.
printed photos or 3D masks deceive the system by
Section III explains the methodology and experimental setup.
mimicking a genuine face.
The study results, along with the discussion, are presented in
This work was undertaken in Distributed Multimedia CoE at Rhodes Section IV and Section V concludes the paper and presents
University. future work on the paper.
Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023 Page 193
A. Related Studies incorporating a CNN, RNN and deep reinforcement learning
to exploit and extract global and local features. A ResNet-18
Facial liveness detection and anti-spoofing techniques have backbone is utilised to extract global features, and a gated
been extensively studied to mitigate the vulnerabilities of recurrent unit is applied to exploit local features using a
facial recognition systems against spoofing attacks. This specified image patch size across the temporal domain. The
section provides an overview of the existing literature on model yielded an EER of 1.79% on the Rose-Youtu dataset,
facial liveness detection and anti-spoofing, focusing on several
influential papers in the field. Micro-texture analysis using
multi-scale local binary patterns (LBP)1 was the starting II. A RCHITECTURES
point towards countermeasures against spoofing attacks, on
the premise that real faces present different texture patterns Deep learning, a division of Artificial Intelligence, refers
when compared to fake faces [7, 8]. to a form of machine learning where computational models
include numerous layers to obtain data representations [17].
LBP and LBP-TOP showed sufficient promise but failed In contrast to conventional machine learning approaches, deep
to capture hand-crafted features; thus, introducing CNNs to learning obviates the requirement of extracting features from
learn discriminative features was the next significant landmark the data before training the model.
in facial anti-spoofing systems. CNNs combined with facial
localisation, spatial and temporal augmentation outperformed Transformers have gained immense popularity in natural
traditional LBP approaches with Half Total Error Rates language processing tasks, including language translation, text
(HTER) of 6.25% and 2.68% on the CASIA and Idiap Replay- generation, and sentiment analysis [18]. Unlike traditional
Attack datasets [9, 10], respectively. Image distortion analysis neural network architectures such as RNNs or CNNs, trans-
has also been proposed for face anti-spoofing detection using formers utilise a self-attention mechanism for processing input
specular reflection, blurriness, chromatic moment and colour data. This mechanism enables the model to focus on different
diversity features for an ensemble of SVM classifiers, each parts of the input sequence, learning which parts are crucial
trained on different spoof attacks [11]. The system yielded for the task at hand.
HTERs of 13.3% and 7.41% on the CASIA and Replay- The Video Vision Transformer (ViViT) is a deep learning
Attacks datasets, respectively. A two-stage motion and deep model designed for video recognition based on the Vision
learning-based approach were proposed on the Rose-Youtu Transformer (ViT) architecture [19, 20]. ViViT comprises
dataset [12], whereby the motion stage detects blinking to three primary components: a spatial encoder, a temporal
ensure liveness against photo attacks and the second stage encoder, and a classification head, as illustrated in Figure
DenseNet counters mask and video attacks. The system 1. The spatial encoder independently processes each input
achieved an Equal Error Rate (EER) of 4.56% on the Rose- video frame, employing a ViT-based transformer network to
Youtu dataset. Two attention-based end-2-end models using encode it into a fixed-length feature vector. This encoding
EfficientNet B0 and MobileNet V2 backbones yielded vali- involves 2D position embeddings and a multi-head self-
dation f1-scores of 99.37% and 97.81%, respectively, on the attention mechanism to capture spatial relationships among
Rose-Youtu dataset when trained as a binary classification different regions within the frame.
model. A dual channel neural architecture employed a CNN The output of the spatial encoder is a tensor with dimen-
to extract discriminative patterns and a hand-crafted features sions [T, H, W, C], where T represents the number of frames,
wide network to detect domain-specific features in spoofed H and W denote the height and width of the frame, and C
attacks [13]. The wide network extracts LBP, CoALBP and signifies the number of channels in the feature maps. The
LBQ colour texture features. The two networks are then temporal encoder then takes the feature vectors from the
aggregated in a low-dimensional latent space for final classifi- spatial encoder and models their temporal relationships using
cation. An HTER and EER of 6.12% and 4.27%, respectively, a 1D temporal convolutional network. Dilated convolutions
were achieved on the Rose-Youtu dataset. are employed to enable the modelling of long-range temporal
A Liveness Detection network (LDnet) has been applied dependencies. Consequently, the output of the temporal en-
to the Rose-Youtu dataset using a Histogram of Oriented coder is a tensor with dimensions [1, D], where D corresponds
Gradients (HOG) face detector, which detects and crops out to the size of the feature vector.
the face region [14]. The image is subsequently fed into a Tubelet embedding is an efficient technique that encodes
network consisting of a combination of 2D CNN and 3D CNN spatiotemporal information of video segments known as
layers for final classification. The network treats the dataset tubelets. Tokens are extracted from non-overlapping tubelets,
as a binary classification problem to identify genuine vs fake which consist of dimensions t × h × w, with nt = T /t,
liveness videos and does not classify the different types of nh = H/h and nw = W/w, as illustrated in Figure 2. The
attacks. The network yielded a 99.79% accuracy and HTER tubelet embedding module comprises a spatial encoder and a
of 0.08% on the Rose-Youtu dataset. 3D CNNs have also temporal encoder. The spatial encoder employs a modified
been explored on the Rose-Youtu dataset achieving an EER of version of the ViT-based transformer network to encode
7.00% [15]. Cai et al. [16] proposed a two-branch framework the appearance features of each frame within the tubelet.
1 LBP is a texture descriptor in computer vision that represents the
Meanwhile, the temporal encoder aggregates the appearance
relationship between a central pixel and its neighbouring pixels by encoding features over time and captures motion information using a
their intensity variations as binary patterns [6]. 1D dilated convolutional neural network. The output of the
Page 194 Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023
Fig. 1. Video Vision Transformer architecture.
tubelet embedding module is a fixed-length feature vector that A. ROSE-Youtu Face Liveness Detection Dataset
represents the spatiotemporal information of the tubelet. This
output is then fed into the temporal encoder of the ViViT The ROSE-Youtu Face Liveness Detection Dataset is a face
model, which further aggregates information across multiple anti-spoofing and liveness database consisting of a range of
tubelets to generate a final prediction. illumination conditions, camera models and attack types [5,
23]. The publically available version of the dataset covers 20
subjects to create a total of 3350 videos, where some subjects
wore glasses. The number of videos for each subject ranges
between 150 and 200 video clips with an average duration of
around 10 seconds. Five different mobile devices front facing
cameras’ were utilised to capture all videos. Three spoofing
attack types are incorporated into the dataset: paper attack,
video replay attack, and masking attack. Figure 3 visualises
the genuine face video alongside the seven different varieties
of the three attacks, and Table I describes the various type of
attacks within the dataset.
This paper uses vision transformers to evaluate facial The dataset is split into three splits, namely training,
liveness and anti-spoofing detection on the ROSE-Youtu Face validation and testing. Subjects 2-12 are reserved for the train
Liveness Detection Dataset. Subsequently, this section pro- split, and the remainder is split evenly between the validation
vides an in-depth overview of the proposed system archi- and test splits. The ratio between the train:validation:test splits
tecture implemented that focuses on model generalisation is 50:25:25. The proposed system identifies and classifies
on unseen data. All model training and evaluation were each of the types of attacks present in the Rose-Youtu Facial
performed on an Nvidia RTX 3090 GPU with 24GB of RAM. Liveness dataset as a multiclass classification problem.
Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023 Page 195
TABLE I false acceptance rate (FAR) and false rejection rate (FRR)
A NTI - SPOOFING ATTACKS PRESENT IN THE ROSE-YOUTU FACE are equal, given by Equation 2. The HTER metric is also
L IVENESS D ETECTION DATASET.
utilised in biometrics, representing the average of both FAR
Experiment Architecture and FRR error [4], as seen in Equation 3. For the proposed
G Genuine person system, accuracy, EER and HTER are utilised to evaluate and
Ps Still printed paper
Pq Quivering printed paper compare the model performance.
Vl Video recorded on Lenovo display
Vm Video recorded on Mac display
Mc Paper mask with two eyes and mouth cropped out
Mf Paper mask without cropping
Mu Paper mask with the upper part cut in the middle
number of correct classifications
Accuracy = (1)
total number of classifications attempted
TABLE II
V I V I T T UBELET EMBEDDING PARAMETERS .
Page 196 Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023
TABLE III
EER AND HTER PERCENTAGES FOR EACH CLASS IN THE ROSE -YOUTU
DATASET.
V. C ONCLUSION
Table III provides an overview of the HTER and EER
results of the different classes in the Rose-Youtu dataset. The The ViViT-based proposed model performs exceptionally
class with the lowest HTER and EER of 2.00% and 0.00% well on the non-trivial Rose-Youtu dataset, achieving an ac-
was the quivering printed paper. The video recorded on the curacy of 86.78% on facial liveness detection. The multiclass
Lenovo display recorded the highest EER of 4.39%. problem identified the genuine and various live spoof attacks,
focusing on successfully identifying each class of attack
in the dataset. Analysing the genuine live video clips and
C. Discussion comparing these against the spoof attacks, the model achieved
an HTER of 13.2% and an EER of 2.46%. The proposed
Overall, the ViViT-based proposed model performs well
model achieved favourable results compared with existing
on the Rose-Youtu dataset achieving an accuracy of 86.78%.
cutting-edge approaches on the same dataset and experimental
Based on the confusion matrix visualised in Figure 5, the
conditions. The study and proposed model notably contributed
genuine class is confused with the video recorded on the Mac
towards utilising custom video data loading and implementing
display spoof attack. Both the Vm and Vl classes are confused
video vision transformers for face liveness and anti-spoofing
in Figure 5; this is most likely because they are laptop display
detection towards optimising generalisation ability in other
devices which is not the model’s objective for identification.
environmental conditions.
Future expansions on this work would investigate the
TABLE IV
C OMPARISON OF V ISION T RANSFORMERS WITH S TATE OF THE ART effect of different patch sizes for the tubelet embedding in
METHODS ON THE ROSE -YOUTU DATASET. ViViT and the comparison of model performance on other
benchmark datasets such as CASIA, Idiap Replay-Attack and
Method Accuracy HTER EER
Motion-based approach [12] 95.44% — 4.56% MSU mobile face spoofing databases.
Wide and deep features [13] — 6.12% 4.27%
3D CNN [15] — — 7.00% R EFERENCES
ViViT 86.78% 13.28% 2.46%
DRL-FAS [16] — — 1.79% [1] “Facial recognition market by component (software
LDnet [14] 99.79% 0.08% — tools (3d facial recognition) and services),
Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023 Page 197
View publication stats
Page 198 Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023