0% found this document useful (0 votes)

39 views

Liveness_Detection_in_Computer_Vision_Transformer-Based_Self-Supervised_Learning_for_Face_Anti-Spoofing

Uploaded by

ppawanmaail

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Liveness_Detection_in_Computer_Vision_Transformer-Based_Self-Supervised_Learning_for_Face_Anti-Spoofing

Uploaded by

ppawanmaail

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Received 27 November 2024, accepted 5 December 2024, date of publication 9 December 2024,

date of current version 17 December 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3513795

Liveness Detection in Computer Vision:

Transformer-Based Self-Supervised Learning
for Face Anti-Spoofing
ARMAN KERESH AND PAKIZAR SHAMOI , (Member, IEEE)
School of Information Technology and Engineering, Kazakh-British Technical University, 050000 Almaty, Kazakhstan
Corresponding author: Pakizar Shamoi ([email protected])
This work was supported in part by the Science Committee of the Ministry of Science and Higher Education of the
Republic of Kazakhstan under Grant AP22786412, and in part by Kazakh-British Technical University.

ABSTRACT Face recognition systems are increasingly used in biometric security for convenience and
effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos,
or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the
Vision Transformer (ViT) architecture, fine-tuned with the DINO framework utilizing CelebA-Spoof,
CASIA SURF, and a proprietary dataset. The DINO framework facilitates self-supervised learning, enabling
the model to learn distinguishing features from unlabeled data. We compared the performance of the
proposed fine-tuned ViT model using the DINO framework against traditional models, including CNN
Model EfficientNet b2, EfficientNet b2 (Noisy Student), and Mobile ViT on the face anti-spoofing task.
Numerous tests on standard datasets show that the ViT model performs better than other models in terms
of accuracy and resistance to different spoofing methods. Our model’s superior performance, particularly in
APCER (1.6%), the most critical metric in this domain, underscores its improved ability to detect spoofing
relative to other models. Additionally, we collected our own dataset from a biometric application to validate
our findings further. This study highlights the superior performance of transformer-based architecture in
identifying complex spoofing cues, leading to significant advancements in biometric security.

INDEX TERMS Biometric security, computer vision, DINO Framework, face anti-spoofing, liveness
detection, self-supervised learning, unsupervised learning, vision transformers.

I. INTRODUCTION camera shots or social media photos, can compromise the

Face recognition systems (FRS) are vital to modern security, security of these systems [9]. This vulnerability requires
offering efficient biometric authentication for applications developing strong anti-spoofing techniques to accurately
like smartphone unlocking and access control [1], [2], [3], differentiate between genuine and spoofed faces. [10].
[4]. These systems are particularly effective in sensitive areas, Several recent studies have demonstrated the potential
where they can restrict unauthorized access and enhance of vision transformers in face anti-spoofing [11], [12],
reliability [5]. Smartphone-based FRS is also being explored, [13], [14]. Current research addresses this problem using
focusing on feature extraction algorithms and security the Vision Transformer (ViT) architecture, fine-tuned with
challenges [6]. However, they are vulnerable to spoofing the DINO (Emerging Properties in Self-Supervised Vision
attacks, where impostors use photos, videos, or masks to Transformers) framework. The DINO framework facilitates
mimic legitimate users and deceive the system [7], [8]. Even self-supervised learning, enabling the model to learn dis-
simple identity spoofing methods, such as using mobile tinguishing features from unlabeled data. However, existing
self-supervised approaches often struggle with generalization
The associate editor coordinating the review of this manuscript and across diverse and unseen spoofing techniques, and their
approving it for publication was Antonio J. R. Neves . ability to detect subtle, nuanced features of spoofing remains
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 185673
A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

limited. We hypothesize that a transformer-based model, in Section V and Future Works in Section VI. Finally, the
trained on a large and diverse dataset, can effectively capture concluding remarks are drawn in Section VII.
the nuanced features indicative of spoofing. As a result, it can
outperform traditional CNN models.
Face anti-spoofing poses unique challenges, especially due II. RELATED WORK
to the lack of labeled spoofing data and the constantly evolv- A. FACE ANTI-SPOOFING
ing techniques to bypass security systems. Self-supervised This section reviews the existing methods for face anti-
learning, like DINO, provides a significant advantage by spoofing, including traditional and deep learning-based
allowing the model to learn from large amounts of unlabeled approaches. The vulnerability of face recognition systems to
data, reducing the need for expensive and time-consuming spoofing attacks has been extensively studied.
labeled data. This is particularly valuable for face anti- Initial methods for face anti-spoofing mainly used hand-
spoofing, as collecting and labeling diverse spoofing samples crafted features and traditional machine learning techniques.
poses a challenge. By using self-supervised learning, our For instance, some researchers [15] used SURF - speeded-
model can better generalize across a wider range of spoofing up robust features as a patented local feature detector and
attacks and adapt to unseen threats. descriptor, and Fisher vector encoding is an image feature
In this study, we utilized multiple benchmark datasets to encoding and quantization technique to enhance face spoof
evaluate the performance of our proposed Vision Transformer detection. Still, these methods struggled with generalizing
(ViT) model, fine-tuned using the DINO framework. Besides to new and unseen spoofing attacks. Similarly, researchers
these established datasets, we also gathered a unique dataset focused on smartphone-based face unlock systems, empha-
from a biometric application. sizing the limitations of these traditional methods in dynamic
The contributions of this study are as follows: and varied attack scenarios [16].
A range of other methods has been proposed for face
• Introducing the Vision Transformer (ViT) architecture anti-spoofing, including Haralick texture features [17], image
fine-tuned with the DINO (Emerging Properties in quality assessment [18], patch and depth-based CNNs [19],
Self-Supervised Vision Transformers) framework for and multi-feature video let aggregation [20]. These methods
face anti-spoofing. While ViTs have been used in face have shown promising results in distinguishing between
anti-spoofing, integrating the DINO framework in this genuine and spoofed face appearances. Other approaches
area has not been extensively investigated. include general image quality assessment [21], color texture
• Comparative Analysis of the proposed model with analysis [22], and pulse detection from face videos [23], all of
traditional models, including CNN Model EfficientNet which have demonstrated effectiveness in detecting various
b2 and EfficientNet b2 (Noisy Student), Mobile ViT. types of spoofing attacks. Combining FRS with other security
• Improvement in anti-spoofing performance, reflected in systems, such as RFID, has also been suggested to strengthen
the APCER. Our comparative analysis shows that our
security [24].
DINO-based ViT model significantly outperforms other
Since the emergence of deep learning, Convolutional
models, demonstrating the ability to identify spoofing
Neural Networks (CNNs) have become popular in face anti-
attacks better.
spoofing research. Several studies have demonstrated the
One of our model’s main distinctions is its integration effectiveness of CNNs in learning features directly from
with the DINO framework, which employs self-supervised data [25], [26], leading to improved liveness detection
learning. This allows our model to learn from unlabeled data performance. However, these models require large, diverse
and generalize better across various spoofing attacks. datasets and often struggle with generalization to novel
Although Vision Transformers have been previously spoofing techniques due to their reliance on local feature
applied to face anti-spoofing, the integration of the DINO extraction.
framework remains underexplored in this context. Our work Several recent studies have explored the use of trans-
addresses this gap by introducing a novel approach and utiliz- former architectures in face anti-spoofing, with promising
ing DINO’s self-supervised learning capabilities to enhance results. Studies [11] and [27] both achieved competitive
model robustness against spoofing attacks. To our knowledge, performance using ViT transformers, with the latter intro-
this is the first application of the DINO framework in the ducing a relation-aware mechanism. The performance was
context of face anti-spoofing. It fills a critical gap in the further improved by deepening the transformer network
existing literature and offers new insights into the potential loop depth and introducing adaptive transformers for robust
of self-supervised ViTs in biometric security. cross-domain face anti-spoofing, respectively [28]. Other
The paper is structured as follows: Section I is this similar studies focused on generalizability, with the former
introduction. Section II presents an overview of the works proposing a domain-invariant vision transformer and the
related to face anti-spoofing. Next, Section III describes the latter demonstrating the effectiveness of vision transformers
methods employed in this study, including data collection, for zero-shot face anti-spoofing [13], [29]. The other
vision transformers, and the DINO framework. Experimental work presents UDG-FAS, the first Unsupervised Domain
results are presented in Section IV, followed by a Discussion Generalization framework for Face Anti-Spoofing [30]. This

185674 VOLUME 12, 2024

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

framework uses large volumes of unlabeled data to learn highlighted the method’s ability to learn rich visual repre-
generalizable features, thereby improving performance in sentations without labels, achieving state-of-the-art results on
low-data scenarios for face anti-spoofing. Another study ImageNet.
introduces FM-ViT, a transformer-based framework that out- Many studies demonstrate the effectiveness of DINO in
performs existing single-modal frameworks [12]. Adaptive object detection and masked autoencoder domains. [36]
vision transformers for robust few-shot cross-domain face focuses on learning patch-level representations, which are
anti-spoofing was proposed in the other recent study [28]. The crucial for accurate object detection. DINO’s self-supervised
generalizability of vision transformers was further improved vision transformers enable the model to learn detailed
with the Domain-invariant Vision Transformer (DiVT) [13]. representations of image patches, improving performance
Next, the study [14] developed a convolutional vision in detecting and recognizing objects. Lastly, [37] and [38]
transformer-based framework for robust performance against both demonstrate how DINO’s features can be effectively
unseen domain data. utilized in masked autoencoders, enabling these models to
As we see, recent advancements in Vision Transformers reconstruct masked image regions more efficiently. These
(ViTs) offer a promising alternative. Unlike CNNs, ViTs studies demonstrate DINO’s versatility and effectiveness
capture global dependencies via self-attention mechanisms, across various computer vision applications.
potentially enhancing their ability to identify subtle, global The DINO framework has been explored in the context
spoofing cues. Studies [29] have explored the application of security, particularly in adversarial attack scenarios [39],
of ViTs for unseen face anti-spoofing, showcasing their [40]. For example, studies have analyzed the robustness
potential in handling unseen attacks. Further research [27] of self-supervised Vision Transformers trained with DINO
emphasized the effectiveness of transformers in incorporating against adversarial attacks, showing that these models can
relation-aware mechanisms for improved spoof detection. be more resilient than those trained through supervised
Recent studies illustrate the relevance of handling masked learning [40]. These works have focused on evaluating the
face detection in real-time scenarios, which can be extended robustness of DINO in adversarial contexts and exploring
to an anti-spoofing approach, enhancing the robustness of defense strategies to enhance model security. However,
face detection systems. A CAFFE-modified MobileNetV2 despite these advancements, no previous studies have applied
(CMNV2) model for masked face age and gender identi- DINO specifically to face anti-spoofing. Our research
fication was proposed [31], achieving 96.54% accuracy by addresses this gap by employing the DINO framework to
focusing on key facial areas like the eyes, forehead, and ears. enhance the performance of Vision Transformers in detecting
Similarly, authors developed a Caffe-MobileNetV2 model for spoofing attacks, thereby contributing a novel application
detecting masked and unmasked faces in both photos and of DINO in the domain of biometric security. By doing
real-time video [32], with an impressive accuracy of 99.64%. so, we demonstrate the potential of self-supervised learning
These studies highlight the importance of feature extraction frameworks like DINO to improve real-world security
from the periocular region and above, which aligns with applications, particularly in face anti-spoofing significantly.
challenges in face detection and anti-spoofing under occluded So, unlike traditional supervised approaches that rely
conditions. heavily on labeled datasets, DINO excels in tasks like
Specific challenges frequently arise in face anti-spoofing face anti-spoofing by focusing on its ability to capture
research, including difficulties in generalizing across differ- global dependencies and learn discriminative features from
ent domains and datasets, the constraints imposed by limited large amounts of unlabeled data. This leads to improved
data, and technical obstacles related to methodologies such generalization to diverse spoofing attacks that may not be
as anomaly detection and black-box discriminators. Cross- present in traditional training datasets. By leveraging the ViT
domain face anti-spoofing, such as the domain gap and architecture, DINO allows the model to detect subtle details
limited data, can lead to poor generalization of models to indicative of spoofing, making it particularly well-suited for
new domains. Furthermore, the generalization capabilities of this task.
classifiers, particularly when applied to diverse databases, are
often questioned, as they may not consistently perform well
across different datasets. III. METHODS
A. DATA
In this research, we employed several benchmark datasets
B. DINO FRAMEWORK to assess how well our proposed Vision Transformer (ViT)
Recent research has explored the DINO framework for visual model fine-tuned with the DINO framework. These datasets
transformers, demonstrating its effectiveness in various are selected for their diversity and coverage of various
computer vision tasks. DINO-based models have shown spoofing techniques, ensuring a thorough evaluation of the
remarkable performance in object detection and segmenta- model’s capabilities.
tion [33]. The CelebA-Spoof [41] dataset is an extensive dataset
The framework has been extended to improve few-shot created especially for face anti-spoofing tasks. It contains
keypoint detection [34]. The original DINO paper [35] over 625,000 images of 10,000 subjects, incorporating

VOLUME 12, 2024 185675

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

FIGURE 1. Trainig data distribution by dataset and label. FIGURE 2. Validation data distribution by dataset and label.

TABLE 1. Distribution of data in train and validation sets.

various spoofing attacks, including printed photos, replayed
videos, and 3D masks. The dataset’s extensive range of
spoofing techniques and high subject diversity make it an
excellent resource for training and evaluating anti-spoofing
models, ensuring they can generalize well to different types
of attacks.
The CASIA-SURF [42], [43] dataset includes 21,000
images captured in multiple modalities: RGB, Depth, and
TABLE 2. Distribution of labels in train and validation sets.
Infrared. This multi-modal approach provides rich infor-
mation that deep learning models can leverage to improve
spoof detection accuracy. The dataset is particularly useful
for evaluating the effectiveness of models in scenarios where
different types of image data are available, improving the
robustness of anti-spoofing systems.
In addition to these well-known public datasets, we used The label distribution (Table 2) also indicates a balanced
a proprietary dataset, which we collected from a biometrics representation of normal and attack labels in both training and
application; it is owned and controlled by a company. This validation sets, which is essential for accurate model training
dataset was created during sessions flagged as suspicious and and evaluation.
non-suspicious. During biometric authentication, subjects Fig. 3 shows a sample of images from the dataset used in
were often asked to turn their heads or move closer, resulting this study. The dataset includes a wide range of face images,
in a dataset of 100,000 images. Each subject underwent both genuine and spoofed, to train and evaluate the face anti-
multiple biometric sessions, providing diverse images under spoofing models. The images in the sample illustrate various
various conditions. These images are unlabeled. Due to spoofing techniques, such as printed photos (images 5-7),
privacy concerns and the sensitive nature of the biometric screen displays (images 3, 9-10), and genuine face images.
data, this dataset cannot be publicly disclosed. We aim to train Each image is labeled as ‘‘live’’ or ‘‘spoof,’’ highlighting the
a Vision Transformer (ViT) on this unlabeled data using a ground truth for training and validation purposes.
self-supervised learning approach.
So, the dataset used in this study consists of images B. VISION TRANSFORMER (VIT)
from three sources: CelebA-Spoof, a proprietary dataset, and Vision Transformers significantly impacted the field of
CASIA-SURF. The training data distribution, as depicted in computer vision [44]. ViT architecture treats an image as a
the first set of plots (see Fig 1), shows that the majority of sequence of patches, similar to how words are treated in text
the data comes from the CelebA-Spoof dataset with 543,424 processing using Transformers [45]. Each image is split into
images, followed by the proprietary dataset with 69,234 a grid of non-overlapping patches, then linearly embedded
images, and CASIA-SURF with 14,879 images (Table 1). and provided with positional embeddings. These embeddings
For the validation data, the distribution is similar, with go through a standard Transformer encoder, which uses
59,762 images from CelebA-Spoof, 29,856 images from multi-head self-attention mechanisms to understand the
the proprietary dataset, and 6,892 images from CASIA- connections between different patches (see Fig. 4).
SURF. (see Fig 2) These distributions highlight the reliance The self-attention mechanism in Transformers can be
on the CelebA-Spoof dataset for training and validation, defined as:
supplemented by the proprietary and CASIA-SURF datasets
QK T

to provide diverse images for evaluating the model’s Attention(Q, K , V ) = softmax √ V (1)
dk
performance across different sources. This diverse dataset
composition ensures the robustness and generalizability where Q (queries), K (keys), and V (values) are derived from
of the face anti-spoofing models developed in this study. the input embeddings, and dk is the dimension of the keys.

185676 VOLUME 12, 2024

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

parameters and reduced computational cost. To enhance its

performance further, we employed the noisy student [48]
training approach, which iteratively trains the model on our
custom unlabeled dataset, leveraging self-training with noise
to improve robustness and accuracy. For training, we utilized
the CelebA-Spoof and CASIA-SURF datasets. Additionally,
our proprietary dataset, consisting of 100,000 images, was
incorporated into the training process using the noisy student
approach, enhancing the model’s ability to generalize across
different spoofing scenarios.

E. PROPOSED APPROACH
1) EXPERIMENTAL SETTINGS
To tackle the issue of face anti-spoofing, we fine-tuned a
Vision Transformer (ViT) model using the DINO framework.
Our approach leverages ViTs’ ability to capture global
dependencies in the input data via self-attention mechanisms,
which enhances their ability to detect subtle, global spoofing
cues. Though Vision Transformers have been applied to
face anti-spoofing in prior research, incorporating the DINO
framework within this context has received limited attention.
FIGURE 3. Sample images from the dataset illustrating various genuine We compared how well the ViT model performed against
(‘‘live’’) and fake (‘‘fake’’) examples. The dataset includes various facial traditional models, including CNN Model EfficientNet b2,
images, including spoofing techniques such as printed and screen images.
EfficientNet b2 (Noisy Student), and Mobile ViT, to see how
effective transformer-based methods are in this field. Our
models were trained on two NVIDIA A100 40 GB GPUs.
MobileViT [46] is an effective neural network architecture, The detailed training procedure is outlined in Algorithm 1.
merging the capabilities of Vision Transformers (ViTs) We selected the Adam optimizer [49] due to its ability
with Convolutional Neural Networks (CNNs). MobileViT’s to adapt the learning rate for each parameter automatically,
hybrid design enables it to capture both global and local and it is effective because, in deep learning problems,
image features that are important for the face anti-spoofing the loss function landscape can be extremely non-convex.
domain. It is particularly suitable for deep models such as Vision
Transformers.
C. DINO (DISTILLATION WITH NO LABELS) Focal Loss [50] was used to handle the issue of class
DINO is a self-supervised learning approach that trains the imbalance, a frequent challenge in face anti-spoofing tasks.
model to generate similar embeddings for different views of It reduces the impact of easy-to-classify examples, enabling
the same image [35]. This is done using a student-teacher the model to concentrate more effectively on complex cases,
training setup, where the student network learns to imitate such as identifying spoofed faces.
the output of the teacher network. Architecture is shown in Using fp16 half precision enables faster training and
Fig. 5. reduces memory usage, especially when dealing with large
• Teacher Network: A fixed pre-trained network that models or datasets. This approach also allows for larger batch
provides stable target representations. sizes, speeding up the training process on GPUs with limited
• Student Network: A trainable network that learns to memory. It helps us to speed up the training process and
predict the teacher’s representations. handle limitations on our GPUs.
The DINO framework helps the ViT model learn discrim- The OneCycleLR scheduler [51] modifies the learning rate
inative features from large amounts of unlabeled data. This throughout the training process by initially setting it low,
is particularly useful for tasks like face anti-spoofing, where gradually increasing it to a peak, and then reducing it. his
labeled data may be limited. It will help the model train on helps the model to converge faster and perform better by
our data without labels. enabling it to explore a range of learning rates during training.
It showed better convergence compared to other schedulers.
D. EFFICIENTNET B2
EfficientNet b2 is a CNN model optimized for both efficiency 2) DISTINGUISHED FEATURES
and performance [47]. It uses a compound scaling method During the training process, various data augmentation
that proportionally increases the network’s width, depth, techniques were used to enhance the robustness and general-
and resolution, resulting in improved accuracy with fewer izability of the face anti-spoofing models. The visualization

VOLUME 12, 2024 185677

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

FIGURE 4. The input face image is split into patches, which are then projected linearly and embedded with positional
information. These embeddings go into the Transformer encoder, which processes the sequence of patches. Next, the
encoder’s output is passed through a multi-layer perceptron (MLP) head to classify the image as either ‘‘spoof’’ or ‘‘live.’’ .

FIGURE 5. This figure illustrates the DINO (Distillation with No Labels) model training process. It starts with image augmentations (1), where two
augmented views of the same image are generated. The student model processes one view, while the teacher model processes the other (2). The teacher
model’s outputs are centered and passed through a softmax layer (3). The student’s outputs are optimized using Stochastic Gradient Descent (SGD) to
match the teacher’s outputs via an exponential moving average (EMA) update (4), minimizing the cross-entropy loss between the student’s and teacher’s
predictions.

of augmentations can be seen in Fig. 6. These augmentations as Blur with a blur limit of 3 to 7, MotionBlur with a
were categorized into four main groups: blur limit of 7 to 21, and GaussNoise for variability in
1) Color Transformations. To provide color variations noise levels.
and simulate different lighting conditions, we used aug- 4) Cropping and Padding. To alter the spatial compo-
mentations such as ChannelShuffle, ChannelDropout, sition of the images, we used CropAndPad with a
and RandomBrightnessContrast. percentage range of -10% to +23% which randomly
2) Affine Transformations. We used augmentations such crops and pads the images, ensuring the model
as Rotate and Flip to provide geometric variations can handle partial occlusions and varying framing
and enhance the model’s ability to generalize across conditions.
different orientations and perspectives.
The steps of the training Algorithm are as follows:
3) Quality Degradations. To simulate various image
quality issues that might be encountered in real-world 1) Data Preparation.Split images into patches
scenarios, we used augmentations such as ImageCom- and create patch embeddings with positional
pression and a combination of blurring techniques such encodings.

185678 VOLUME 12, 2024

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

in terms of true positives (TP), false positives (FP), true

negatives (TN), and false negatives (FN)
APCER (Attack Presentation Classification Error Rate):
This is the rate of attack presentations (spoof attempts)
incorrectly classified as bona fide (genuine) presentations.
FP
APCER = (2)
FP + TN
BPCER (Bona Fide Presentation Classification Error
Rate): This is the rate of bona fide presentations incorrectly
FIGURE 6. Various data augmentation methods applied to an image classified as attack presentations.
during training.
FN
BPCER = (3)
Algorithm 1 Training DINOv2 for Liveness and Anti- FN + TP
Spoofing Classification ACER (Average Classification Error Rate): This is the
Input: Dataset D (CelebA-Spoof, CASIA-SURF, mean of APCER and BPCER, providing a single metric to
Proprietary), Image Size 224 × 224, Patch Size evaluate the model’s overall performance.
14 × 14 APCER + BPCER
ACER = (4)
Initialization: DINOv2 model M with pre-trained 2
weights, with Batch Size B = 4, Learning Rate LR In face anti-spoofing systems, APCER and BPCER present
0.001; a trade-off Fig. 8. Minimizing APCER (reducing false
Output: ViT model DINOv2 acceptance of spoofs) can increase BPCER (false rejection
for epoch = 1 to 300 do of genuine attempts) and vice versa. Balancing these rates is
for each batch B in D do crucial for effective performance.
Resize images in B to 224 × 224; For our comparison experiment, we used four models: Effi-
Apply augmentations to images; cientNet b2 and the same model, but enhanced with the Noisy
Forward pass through M with half-precision Student technique, MobileViT v2, and ViT (DINO). Effi-
(fp16); cientNet b2 was selected for its strong baseline performance
Compute FocalLoss on predictions; and efficiency. The Noisy Student version of EfficientNet
Update M using Adam optimizer; b2 was included to explore the impact of semi-supervised
Adjust LR with OneCycleLR scheduler; learning on model robustness. MobileViT v2 was chosen
end for its balance of efficiency and performance. Finally, ViT
end (DINO) was included as the primary model in our research,
return Trained model M focusing on its ability to leverage self-supervised learning
through a transformer-based architecture.
The performance metrics for considered models are
2) Self-Supervised Pre-training. Use the DINO frame- summarized in Table 3. The results demonstrate that the ViT
work to pre-train the ViT model on a large dataset of (DINO) model significantly outperforms the other models
unlabeled facial images. across all evaluation metrics. Table 4 compares the model’s
3) Fine-tuning. Replace the decoder with a binary performance on different datasets.
classification layer and fine-tune the model on labeled The results, summarized in Tables 3, 4, demonstrate that
face anti-spoofing datasets. the ViT (DINO) model consistently outperforms the other
4) Evaluation. Compare the performance of the ViT models across all evaluation metrics. For instance, ViT
model with EfficientNet b2 using standard metrics. (DINO) achieves the lowest APCER (1.6%) compared to
See Fig. 7 and Algorithm 1 for the detailed training 22.5% for EfficientNet b2 and 5.5% for MobileViT v2.
algorithm steps. The algorithm initializes the DINOv2 model Similarly, BPCER for ViT (DINO) is minimal at 0.1%,
with pre-trained weights and a batch size of 4, using outperforming the other models. The ACER metric further
224×224 image inputs with a 14 × 14 patch size. Over confirms ViT (DINO)’s superior balance in handling both
300 epochs, images are augmented, passed through the model attack and bona fide presentations, with a score of 0.8%
in half-precision, and trained using FocalLoss and the Adam compared to 11.75% for EfficientNet b2 and 2.98% for
optimizer with a OneCycleLR learning rate scheduler. MobileViT v2. Moreover, ViT (DINO) consistently delivers
the highest overall accuracy, reaching 99.8%, underscoring
IV. EXPERIMENTAL RESULTS its excellent capability in distinguishing genuine faces from
To evaluate the performance of the models, we used standard spoofed ones across various datasets.
metrics in face anti-spoofing, including APCER, BPCER, The enhancement in anti-spoofing efficacy is evident in
ACER [52], and accuracy. We express APCER and BPCER the APCER metric. Our comparative analysis reveals that our

VOLUME 12, 2024 185679

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

FIGURE 7. The training pipeline involves collecting a comprehensive dataset for face anti-spoofing, processing and augmenting the images to enhance
quality and variability, feeding the preprocessed data into the DINOv2 (Vision Transformer) model with a binary classification layer, and evaluating the
model’s performance.

TABLE 3. Comparison of EfficientNet and ViT (DINO) Models (all datasets combined).

TABLE 4. Comparison of EfficientNet b2, EfficientNet b2 with Noisy

student, MobileViT v2, ViT (DINO) on datasets.

FIGURE 8. Balancing the Attack Presentation Classification Error Rate

(APCER) and the Bona Fide Presentation Classification Error Rate (BPCER)
is important. The overlapping areas show misclassifications: APCER (blue)
represents attack presentations wrongly classified as real, and BPCER
(yellow) shows real presentations wrongly classified as attacks. Finding
the right balance between these two metrics is key to improving the
performance of face anti-spoofing systems.

DINO-based ViT model greatly surpasses the EfficientNet

B2 model in performance. Notably, it achieves an APCER
of 1.6%, markedly lower than the 22.5% by the Efficient-
Net model. This substantial improvement underscores our shows how often the system mistakenly accepts spoof attacks
model’s enhanced capability to detect spoofing attacks. as genuine, which is critical since even a few errors can
Our model’s strong performance in APCER demonstrates lead to serious security breaches. BPCER, on the other
its superior ability to detect spoofing over other models, hand, indicates how often genuine users are wrongly denied
as it is the most critical metric, even if others perform access, which can cause significant frustration. Since datasets
slightly better. APCER and BPCER are more crucial than in this field are often imbalanced, accuracy alone can be
overall accuracy in face anti-spoofing because they directly misleading–it might appear high even if the model fails
measure how well a system handles security threats. APCER to detect spoofing effectively. This is why standards like

185680 VOLUME 12, 2024

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

ISO/IEC 30107-3 focus on APCER and BPCER, as these model that is more robust against novel and complex spoofing
metrics more accurately reflect the system’s performance in attacks.
real-world security scenarios. The attention visualizations for spoof and live class images,
Fig. 10 illustrates the trends for APCER, BPCER, ACER, as shown in the figures Fig. 11, reveal how the Vision
and accuracy over 50 training epochs for all models. The Transformer (ViT) model, fine-tuned with DINO, selectively
plot demonstrates a significant decrease in APCER for all focuses on different regions of the images when making
models, with the ViT (DINO) model consistently maintaining classifications. In the case of spoof class images Fig. 11b, the
a lower APCER throughout the training process. The BPCER attention maps demonstrate that the model concentrates on
plot highlights the reduction in BPCER, where the ViT areas that often exhibit unnatural artifacts or inconsistencies,
(DINO) model shows superior performance by achieving a such as reflections, edges, or distortions typically found
lower BPCER than other models. The ACER plot indicates in spoofing attacks. In contrast, for the live class images
the overall classification error rates, significantly improving Fig. 11a, the attention maps show a more evenly distributed
the ViT (DINO) model’s ability to balance APCER and focus on natural, coherent facial features, such as skin texture,
BPCER. The accuracy plot illustrates the higher overall smoothness, and uniform lighting patterns. This distinction
accuracy of the ViT (DINO) model, indicating better between how the model handles real and spoofed images
general performance in distinguishing genuine and spoofed illustrates the model’s effectiveness in focusing on relevant
faces. features for classification.
Fig. 9 presents the confusion matrices for all models. In contrast, the EfficientNet B2 model, although optimized
The ViT (DINO) model demonstrates superior classification for efficiency and performance, relies on local feature
performance with the lowest APCER and BPCER values, extraction through convolutional layers. This localized focus
resulting in fewer false positives and false negatives. The may limit its ability to generalize to novel and sophisticated
confusion matrix for ViT (DINO) highlights its ability to spoofing attacks that require a detailed understanding of
accurately distinguish between genuine and spoofed faces, the face’s overall structure. Additionally, the traditional
leading to high accuracy. MobileViT also shows strong supervised learning approach used for training EfficientNet
performance with low error rates, while both EfficientNet B2 may not fully exploit the potential of the available data,
b2 models, though achieving high accuracy, exhibit higher leading to suboptimal generalization. This limitation led
APCER and BPCER, reflecting a relatively higher rate of us to experiment with training EfficientNet B2 using the
misclassification when compared to MobileViT and ViT Noisy Student method, a semi-supervised approach that uses
(DINO). both labeled and unlabeled data. This approach improved
performance metrics, including APCER, but the results were
still not as good as the self-supervised ViT model fine-tuned
V. DISCUSSION with DINO.
A. WHY APCER IS SIGNIFICANTLY DECREASED? The findings of this study suggest that adopting
As our experimental observations demonstrated, APCER transformer-based architectures, such as ViT, fine-tuned
significantly decreased after we trained the ViT model, with self-supervised learning frameworks like DINO,
with even greater improvements when fine-tuned using the or even CNN-based models enhanced with semi-supervised
DINO framework. The decrease in APCER reflects the learning frameworks like Noisy Student, can significantly
model’s ability to more accurately distinguish between real improve face anti-spoofing systems. These advancements
and spoofed faces, reducing the risk of security breaches have practical implications for improving the security and
in face recognition systems. This improvement is critical reliability of biometric authentication systems, which are
because APCER directly measures the model’s effectiveness increasingly used in areas such as unlocking personal devices
in identifying spoof attacks, a key concern in biometric and controlling access in secure environments.
security applications.
The superior performance of ViT-based models can be
attributed to their ability to capture global patterns and B. COMPARISON WITH RECENT STUDIES
dependencies across the entire image, rather than focusing Let’s review how the current study’s results compare to
only on localized features, as is common with traditional previous studies. Many studies have explored using vision
CNN models. ViTs are particularly well-suited for face anti- transformers in face anti-spoofing, with promising results.
spoofing tasks because they can detect subtle inconsistencies, Many studies demonstrate the effectiveness of these mod-
such as unnatural lighting or distortions in spoofed faces. els in detecting anomalies and achieving robust performance
However, the DINO framework’s self-supervised pre-training across different domains [11], [13], [28], [53]. Studies [27]
further enhances the model’s capability to learn discrimina- and [54] further enhance the capabilities of vision trans-
tive features from large amounts of unlabeled data. By using formers by incorporating relation-aware mechanisms and
this data, the DINO framework enables the ViT model to adaptive-avg-pooling-based attention. Next, [29] and [55]
generalize better to diverse spoofing techniques that may not extend the application of vision transformers to zero-shot
be present in traditional training datasets. This results in a anti-spoofing and data augmentation, respectively, achieving

VOLUME 12, 2024 185681

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

FIGURE 9. Confusion matrices for four models: EfficientNet b2, EfficientNet b2 (Noisy Student), MobileViT, and ViT (DINO).

FIGURE 10. Trends of APCER, BPCER, ACER, and accuracy over 50 training epochs for EfficientNet b2 and ViT (DINO) models, demonstrating the
superior performance of the ViT (DINO) model in face anti-spoofing tasks.

185682 VOLUME 12, 2024

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

FIGURE 11. Attention heatmaps and original spoof images for different datasets. The top row shows the attention heatmaps, highlighting the regions
where the model focuses its attention during classification. The bottom row displays the original live or spoof images from different datasets like
CelebA-Spoof, CASIA-SURF, and a Proprietary dataset.

state-of-the-art performance. Lastly, [56] reports significant could enhance the robustness and adaptability of face anti-
improvements in accuracy and reduced equal error rates spoofing models, particularly in scenarios with ambiguous or
using transformer-based models. These studies collectively uncertain data. Finally, real-world testing and deployment of
highlight the potential of vision transformers in enhancing these models in diverse environments would be valuable in
the security of face recognition systems. Our findings back assessing their practical effectiveness and identifying areas
up these prior research works. for improvement.
As can be seen, the existing studies mainly focus on
supervised or semi-supervised methods, leaving room for VII. CONCLUSION
improvement in terms of generalization. In contrast, our In this study, we presented a novel application of the
approach utilizes the DINO framework, a self-supervised DINO framework within Vision Transformers for face anti-
method, allowing our model to learn from large-scale spoofing. This approach addresses the limited exploration
unlabeled data. This significantly enhances the model’s of DINO’s self-supervised learning capabilities in this
ability to generalize across diverse spoofing techniques. context. Several benchmark datasets were used to assess the
This presents an advantage over traditional CNNs and even effectiveness of the model.
supervised ViT models, with a more flexible and powerful The ViT (DINO) model consistently outperformed other
approach to face anti-spoofing. Although similar research models across all key metrics (especially in APCER), indi-
has previously been carried out, the literature has paid little cating its superior ability to distinguish between genuine and
attention to fine-tuning the ViT architecture with Dino. spoofed faces. Our comparative experiments demonstrated
that the ViT (DINO) model consistently outperformed other
VI. LIMITATIONS AND FUTURE WORKS state-of-the-art models, including EfficientNet B2, Efficient-
The study has certain limitations that need to be addressed Net B2 with Noisy Student, and MobileViT, particularly in
in future work. Firstly, the reliance on a specific set of key metrics like APCER. This improvement is crucial as it
datasets may limit the generalizability of the results to other addresses the growing threat of spoofing attacks in various
types of spoofing attacks or different demographic groups. applications, from personal device security to access control
Secondly, while the DINO framework provides significant in high-security environments. The findings underscore the
improvements, it also introduces additional computational importance of adopting cutting-edge AI technologies to safe-
complexity that may be challenging to implement in real-time guard biometric systems against increasingly sophisticated
applications. Finally, the current study does not consider the spoofing techniques.
potential impact of environmental variations, such as lighting In general, study findings suggest that incorporating DINO
conditions and camera quality, on the model’s performance. into ViTs enhances their robustness against spoofing attacks,
Addressing these limitations in future research will be crucial offering valuable insights into the potential of self-supervised
for developing more universally applicable and efficient face learning in biometric security. The results indicate that
anti-spoofing systems. integrating DINO into ViTs can enhance their performance in
Future research should consider using extra data types, biometric security applications. This contributes to a broader
like depth and infrared, to make face anti-spoofing models understanding of how self-supervised learning techniques can
even more robust. Investigating the application of other be effectively applied in this domain.
self-supervised learning techniques and transformer architec-
tures could also provide further enhancements. In addition, REFERENCES
in future research, we aim to explore the integration of fuzzy
[1] E. Vazquez-Fernandez and D. Gonzalez-Jimenez, ‘‘Face recognition for
logic with ViT, a recent trend [57]. Fuzzy logic is a powerful authentication on mobile devices,’’ Image Vis. Comput., vol. 55, pp. 31–33,
tool for handling imprecision and uncertainty [58], which Nov. 2016, doi: 10.1016/j.imavis.2016.03.018.

VOLUME 12, 2024 185683

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

[2] R. V. Petrescu, ‘‘Face recognition as a biometric application,’’ SSRN [25] S. Garg, S. Mittal, P. Kumar, and V. Anant Athavale, ‘‘DeBNet: Multilayer
Electron. J., vol. 3, pp. 237–257, Apr. 2019, doi: 10.2139/ssrn.3417325. deep network for liveness detection in face recognition system,’’ in
[3] M. P. Nagesh, ‘‘Face recognition systems,’’ Int. J. Res. Appl. Proc. 7th Int. Conf. Signal Process. Integr. Netw. (SPIN), Feb. 2020,
Sci. Eng. Technol., vol. 11, no. 3, pp. 962–964, Mar. 2023, doi: pp. 1136–1141.
10.22214/ijraset.2023.49567. [26] S. Jafri, S. Chawan, and A. Khan, ‘‘Face recognition using deep neural
[4] T. I. Dhamecha, S. Ghosh, M. Vatsa, and R. Singh, ‘‘Kernelized network with ‘LivenessNet’,’’ in Proc. Int. Conf. Inventive Comput.
heterogeneity-aware cross-view face recognition,’’ Frontiers Artif. Intell., Technol. (ICICT), 2020, pp. 145–148.
vol. 4, Jul. 2021, Art. no. 670538, doi: 10.3389/frai.2021.670538. [27] Z. Wang, Q. Wang, W. Deng, and G. Guo, ‘‘Face anti-spoofing
[5] D. A. Chowdhry, A. Hussain, M. Z. Ur Rehman, F. Ahmad, A. Ahmad, and using transformers with relation-aware mechanism,’’ IEEE Trans.
M. Pervaiz, ‘‘Smart security system for sensitive area using face recog- Biometrics, Behav., Identity Sci., vol. 4, no. 3, pp. 439–450,
nition,’’ in Proc. IEEE Conf. Sustain. Utilization Develop. Eng. Technol. Jul. 2022.
(CSUDET), May 2013, pp. 11–14, doi: 10.1109/CSUDET.2013.6670976. [28] H.-P. Huang, D. Sun, Y. Liu, W.-S. Chu, T. Xiao, J. Yuan, H. Adam,
[6] A. AbdElaziz, ‘‘A survey of smartphone-based face recognition systems and M.-H. Yang, ‘‘Adaptive transformers for robust few-shot cross-
for security purposes,’’ Kafrelsheikh J. Inf. Sci., vol. 2, no. 1, pp. 1–7, domain face anti-spoofing,’’ in Proc. Eur. Conf. Comput. Vis., Jan. 2022,
Aug. 2021, doi: 10.21608/kjis.2021.5484.1006. pp. 37–54.
[7] N. Erdogmus and S. Marcel, ‘‘Spoofing face recognition with 3D masks,’’ [29] A. George and S. Marcel, ‘‘On the effectiveness of vision transformers for
IEEE Trans. Inf. Forensics Security, vol. 9, no. 7, pp. 1084–1097, Jul. 2014. zero-shot face anti-spoofing,’’ in Proc. IEEE Int. Joint Conf. Biometrics
[8] B. Hamdan and K. Mokhtar, ‘‘The detection of spoofing by 3D mask in (IJCB), Aug. 2021, pp. 1–8.
a 2D identity recognition system,’’ Egyptian Informat. J., vol. 19, no. 2, [30] Y. Liu, Y. Chen, M. Gou, C.-T. Huang, Y. Wang, W. Dai, and
pp. 75–82, Jul. 2018. H. Xiong, ‘‘Towards unsupervised domain generalization for face anti-
[9] L. Omar and I. Ivrissimtzis, ‘‘Evaluating the resilience of face recognition spoofing,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023,
systems against malicious attacks,’’ in Proc. 7th U.K. Brit. Mach. Vis. pp. 20597–20607.
Workshop, 2015, pp. 5.1–5.9. [31] B. A. Kumar and M. Bansal, ‘‘Face mask detection on photo and real-
[10] L. Omar and I. Ivrissimtzis, ‘‘Designing a facial spoofing database for time video images using caffe-MobileNetV2 transfer learning,’’ Appl. Sci.,
processed image attacks,’’ in Proc. 7th Int. Conf. Imag. Crime Detection vol. 13, no. 2, p. 935, Jan. 2023, doi: 10.3390/app13020935.
Prevention (ICDP), 2016, pp. 1–6. [32] B. A. Kumar and N. K. Misra, ‘‘Masked face age and gender iden-
[11] L. Abduh, L. Omar, and I. Ivrissimtzis, ‘‘Anomaly detection with tification using caffe-modified MobileNetV2 on photo and real-time
transformer in face anti-spoofing,’’ J. WSCG, vol. 31, nos. 1–2, pp. 91–98, video images by transfer learning and deep learning techniques,’’ Expert
Jul. 2023. Syst. Appl., vol. 246, Jul. 2024, Art. no. 123179. [Online]. Available:
[12] A. Liu, Z. Tan, Z. Yu, C. Zhao, J. Wan, Y. Liang, Z. Lei, D. Zhang, S. Z. Li, https://www.sciencedirect.com/science/article/pii/S0957417424000447
and G. Guo, ‘‘FM-ViT: Flexible modal vision transformers for face anti- [33] F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum,
spoofing,’’ IEEE Trans. Inf. Forensics Security, vol. 18, pp. 4775–4786, ‘‘Mask DINO: Towards a unified transformer-based framework for object
2023, doi: 10.1109/TIFS.2023.3296330. detection and segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
[13] C.-H. Liao, W.-C. Chen, H.-T. Liu, Y.-R. Yeh, M.-C. Hu, and C.-S. Chen, Pattern Recognit. (CVPR), Jun. 2023, pp. 3041–3050.
‘‘Domain invariant vision transformer learning for face anti-spoofing,’’
[34] C. Lu, H. Zhu, and P. Koniusz, ‘‘From saliency to DINO: Saliency-
in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023,
guided vision transformer for few-shot keypoint detection,’’ 2023,
pp. 6087–6096.
arXiv:2304.03140.
[14] Y. Lee, Y. Kwak, and J. Shin, ‘‘Robust face anti-spoofing framework with
[35] M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski,
convolutional vision transformer,’’ in Proc. IEEE Int. Conf. Image Process.
and A. Joulin, ‘‘Emerging properties in self-supervised vision transform-
(ICIP), Oct. 2023, pp. 1015–1019.
ers,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
[15] Z. Boulkenafet, J. Komulainen, and A. Hadid, ‘‘Face antispoofing using
pp. 9630–9640.
speeded-up robust features and Fisher vector encoding,’’ IEEE Signal
[36] S. Yun, H. Lee, J. Kim, and J. Shin, ‘‘Patch-level representation learning
Process. Lett., vol. 24, no. 2, pp. 141–145, Feb. 2017.
for self-supervised vision transformers,’’ 2022, arXiv:2206.07990.
[16] K. Patel, H. Han, and A. K. Jain, ‘‘Secure face unlock: Spoof detection
on smartphones,’’ IEEE Trans. Inf. Forensics Security, vol. 11, no. 10, [37] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, ‘‘Masked
pp. 2268–2283, Oct. 2016. autoencoders are scalable vision learners,’’ 2021, arXiv:2111.06377.
[17] A. Agarwal, R. Singh, and M. Vatsa, ‘‘Face anti-spoofing using Haralick [38] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. So Kweon, and
features,’’ in Proc. IEEE 8th Int. Conf. Biometrics Theory, Appl. Syst. S. Xie, ‘‘ConvNeXt v2: Co-designing and scaling ConvNets with masked
(BTAS), Sep. 2016, pp. 1–6. autoencoders,’’ 2023, arXiv:2301.00808.
[18] E. Fourati, W. Elloumi, and A. Chetouani, ‘‘Face anti-spoofing with image [39] N. Inkawhich, G. McDonald, and R. Luley, ‘‘Adversarial attacks on
quality assessment,’’ in Proc. 2nd Int. Conf. Bio-eng. Smart Technol. foundational vision models,’’ 2023, arXiv:2308.14597.
(BioSMART), Aug. 2017, pp. 1–4. [40] J. Rando, N. Naimi, T. Baumann, and M. Mathys, ‘‘Exploring adversarial
[19] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu, ‘‘Face anti-spoofing using attacks and defenses in vision transformers trained with DINO,’’ 2022,
patch and depth-based CNNs,’’ in Proc. IEEE Int. Joint Conf. Biometrics arXiv:2206.06761.
(IJCB), Oct. 2017, pp. 319–328. [41] Y. Zhang, Z. Yin, Y. Li, G. Yin, J. Yan, J. Shao, and Z. Liu, ‘‘Celeba-spoof:
[20] T. A. Siddiqui, S. Bharadwaj, T. I. Dhamecha, A. Agarwal, M. Vatsa, Large-scale face anti-spoofing dataset with rich annotations,’’ in Proc. Eur.
R. Singh, and N. Ratha, ‘‘Face anti-spoofing with multifeature videolet Conf. Comput. Vis., pp. 70–85, 2020.
aggregation,’’ in Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), 2016, [42] S. Zhang, A. Liu, J. Wan, Y. Liang, G. Guo, S. Escalera, H. J. Escalante,
pp. 1035–1040. and S. Z. Li, ‘‘CASIA-SURF: A large-scale multi-modal benchmark for
[21] J. Galbally and S. Marcel, ‘‘Face anti-spoofing based on general image face anti-spoofing,’’ IEEE Trans. Biometrics, Behav., Identity Sci., vol. 2,
quality assessment,’’ in Proc. 22nd Int. Conf. Pattern Recognit., Aug. 2014, no. 2, pp. 182–193, Apr. 2020.
pp. 1173–1178. [43] S. Zhang, X. Wang, A. Liu, C. Zhao, J. Wan, S. Escalera, H. Shi, Z. Wang,
[22] Z. Boulkenafet, J. Komulainen, and A. Hadid, ‘‘Face anti-spoofing based and S. Z. Li, ‘‘A dataset and benchmark for large-scale multi-modal face
on color texture analysis,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), anti-spoofing,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Sep. 2015, pp. 2636–2640. (CVPR), Jun. 2019, pp. 919–928.
[23] X. Li, J. Komulainen, G. Zhao, P.-C. Yuen, and M. Pietikäinen, [44] N. Ilinykh and S. Dobnik, ‘‘What does a language-and-vision trans-
‘‘Generalized face anti-spoofing by detecting pulse from face videos,’’ in former see: The impact of semantic information on visual representa-
Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), Dec. 2016, pp. 4244–4249, tions,’’ Frontiers Artif. Intell., vol. 4, Dec. 2021, Art. no. 767971, doi:
doi: 10.1109/ICPR.2016.7900300. 10.3389/frai.2021.767971.
[24] A. Aff, M. Awedh, and M. H. A. Alghamdi, ‘‘RFID and face recognition [45] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
based security and access control system,’’ Int. J. Innov. Res. Sci., Eng. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
Technol., vol. 2, no. 11, pp. 5955–5964, Jan. 2013. [Online]. Available: J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16×16 words:
https://api.semanticscholar.org/CorpusID:13542387 Transformers for image recognition at scale,’’ 2020, arXiv:2010.11929.

185684 VOLUME 12, 2024

A. Keresh, P. Shamoi: Liveness Detection in Computer Vision

[46] S. Mehta and M. Rastegari, ‘‘Separable self-attention for mobile vision ARMAN KERESH received the B.S. degree in
transformers,’’ 2022, arXiv:2206.02680. information systems from the Al-Farabi Kazakh
[47] M. Tan and Q. V. Le, ‘‘Efficientnet: Rethinking model scaling for National University, Almaty, Kazakhstan, in 2023.
convolutional neural networks,’’ in Proc. 36th Int. Conf. Mach. Learn., He is currently pursuing the M.S. degree in data
vol. 97, 2020, pp. 6105–6114. science with Kazakh-British Technical University.
[48] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, ‘‘Self-training with noisy He is also a computer vision engineer in a leading
student improves imagenet classification,’’ in Proc. IEEE/CVF Conf. telecommunication company in Kazakhstan. His
Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 10687–10698.
research interests include artificial intelligence
[49] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
and machine learning, image processing, liveness
2014, arXiv:1412.6980.
detection, image generation, and self-supervised
[50] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for dense
object detection,’’ 2017, arXiv:1708.02002. learning.
[51] L. N. Smith and N. Topin, ‘‘Super-convergence: Very fast training of neural
networks using large learning rates,’’ 2017, arXiv:1708.07120.
[52] Information Technology—Biometric Presentation Attack Detection Part
3: Testing and Reporting, Standard ISO/IEC 30107-3:2023, Int. Org.
for Standardization, 2023. [Online]. Available: https://www.iso.org/ PAKIZAR SHAMOI (Member, IEEE) received
standard/79520.html the B.S. and M.S. degrees in information sys-
[53] M. Marais, D. Brown, J. Connan, and A. Boby, ‘‘Facial live- tems from Kazakh-British Technical University,
ness and anti-spoofing detection using vision transformers,’’ in Proc. Almaty, Kazakhstan, in 2011 and 2013, respec-
Southern Afr. Telecommun. Netw. Appl. Conf. (SATNAC), Aug. 2023, tively, and the Ph.D. degree in engineering from
pp. 1–6. Mie University, Tsu, Japan, in 2019. In her
[54] J. Yang, F. Chen, R. K. Das, Z. Zhu, and S. Zhang, ‘‘Adaptive-avg- academic journey, she has held various teaching
pooling based attention vision transformer for face anti-spoofing,’’ in Proc. and research positions at Kazakh-British Technical
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2024, University, where she has been a Professor with
pp. 3875–3879, doi: 10.1109/ICASSP48485.2024.10446940. the School of Information Technology and Engi-
[55] J. Orfao and D. van der Haar, ‘‘Keyframe and GAN-based data augmen- neering, since August 2020. She is the author of one book, one monograph,
tation for face anti-spoofing,’’ in Proc. 12th Int. Conf. Pattern Recognit. and more than 33 scientific publications. Her research interests include
Appl. Methods, 2023, pp. 629–640, doi: 10.5220/0011648400003411. artificial intelligence and machine learning in general, with a focus on
[56] K. Watanabe, K. Ito, and T. Aoki, ‘‘Spoofing attack detection in
fuzzy sets and logic, soft computing, representing and processing colors in
face recognition system using vision transformer with patch-wise
computer systems, natural language processing, computational aesthetics,
data augmentation,’’ in Proc. Asia–Pacific Signal Inf. Process. Assoc.
Annu. Summit Conf. (APSIPA ASC), Nov. 2022, pp. 1561–1565, doi:
and human-friendly computing and systems. She received awards for the best
10.23919/APSIPAASC55919.2022.9979996. paper at conferences five times. She took part in the organization and worked
[57] Q. Fan, Q. You, X. Han, Y. Liu, Y. Tao, H. Huang, R. He, and in the organization committee (as the Head of the Session and responsible for
H. Yang, ‘‘Vitar: Vision transformer with any resolution,’’ 2024, special sessions) of several international conferences, such as IFSA-SCIS
arXiv:2403.18361. 2017, Otsu, Japan; SCIS-ISIS 2022, Mie, Japan; and EUSPN 2023, Almaty.
[58] P. Kozlov, A. Akram, and P. Shamoi, ‘‘Fuzzy approach for audio- She served as a Reviewer for several international conferences, including
video emotion recognition in computer games for children,’’ Proc. IEEE: SIST 2023, SMC 2022, SCIS-ISIS 2022, SMC 2020, ICIEV-IVPR
Comput. Sci., vol. 231, pp. 771–778, Jan. 2024, doi: 10.1016/J.PROCS. 2019, and ICIEV-IVPR 2018.
2023.12.139.

VOLUME 12, 2024 185685

Penetration Testing for Jobseekers: Perform Ethical Hacking across Web Apps, Networks, Mobile Devices using Kali Linux, Burp Suite, MobSF, and Metasploit
From Everand
Penetration Testing for Jobseekers: Perform Ethical Hacking across Web Apps, Networks, Mobile Devices using Kali Linux, Burp Suite, MobSF, and Metasploit
Debasish Mandal
No ratings yet
Face Biometric AntiSpoofing
100% (2)
Face Biometric AntiSpoofing
15 pages
Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper
No ratings yet
Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper
11 pages
Anti Spoofing Face Detection With Convolutional Neural Networks Classifier
No ratings yet
Anti Spoofing Face Detection With Convolutional Neural Networks Classifier
6 pages
2024_Generalizing VT for Face Anti-Spoofing
No ratings yet
2024_Generalizing VT for Face Anti-Spoofing
14 pages
Mobile Malware Infringement and Detection
From Everand
Mobile Malware Infringement and Detection
Abdul Razaque
No ratings yet
Practical Guide to Penetration Testing: Breaking and Securing Systems
From Everand
Practical Guide to Penetration Testing: Breaking and Securing Systems
Peter Johnson
No ratings yet
Learning iOS Penetration Testing: Secure your iOS applications and uncover hidden vulnerabilities by conducting penetration tests
From Everand
Learning iOS Penetration Testing: Secure your iOS applications and uncover hidden vulnerabilities by conducting penetration tests
Swaroop Yermalkar
No ratings yet
Advanced Techniques for Biometric
No ratings yet
Advanced Techniques for Biometric
16 pages
Face Recognition Model Report
No ratings yet
Face Recognition Model Report
23 pages
Cybersecurity Program and Policy using NIST Cybersecurity Framework: NIST Cybersecurity Framework (CSF), #2
From Everand
Cybersecurity Program and Policy using NIST Cybersecurity Framework: NIST Cybersecurity Framework (CSF), #2
Bruce Brown, CISSP
No ratings yet
Overviewoffaceliveness Paper
No ratings yet
Overviewoffaceliveness Paper
16 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Effective Vulnerability Management: Managing Risk in the Vulnerable Digital Ecosystem
From Everand
Effective Vulnerability Management: Managing Risk in the Vulnerable Digital Ecosystem
Chris Hughes
5/5 (1)
Deep Learning Detection of Facial Biometric Presentation Attack
No ratings yet
Deep Learning Detection of Facial Biometric Presentation Attack
18 pages
Mastering Cybersecurity Foundations: Building Resilience in a Digital World
From Everand
Mastering Cybersecurity Foundations: Building Resilience in a Digital World
Robert Johnson
No ratings yet
DINOv2_Updated_Presentation
No ratings yet
DINOv2_Updated_Presentation
17 pages
DINOv2_Presentation
No ratings yet
DINOv2_Presentation
13 pages
Advanced Cybersecurity Strategies: Navigating Threats and Safeguarding Data
From Everand
Advanced Cybersecurity Strategies: Navigating Threats and Safeguarding Data
Adam Jones
No ratings yet
Advanced Penetration Testing with Kali Linux: Unlocking industry-oriented VAPT tactics (English Edition)
From Everand
Advanced Penetration Testing with Kali Linux: Unlocking industry-oriented VAPT tactics (English Edition)
Ummed Meel
No ratings yet
Penetration Testing, Threat Hunting, and Cryptography: Mastering Cybersecurity
From Everand
Penetration Testing, Threat Hunting, and Cryptography: Mastering Cybersecurity
Virversity Online Courses
No ratings yet
Firewalls: The Engineer's Guide in the Age of Cyber Threats
From Everand
Firewalls: The Engineer's Guide in the Age of Cyber Threats
Rob Botwright
No ratings yet
Liveness Detection in Face Recognition Using Deep Learning
No ratings yet
Liveness Detection in Face Recognition Using Deep Learning
4 pages
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
From Everand
Botnet Attack Detection in the Internet of Things Using Selected Learning Algorithms: A Research Study on Securing IoT Against Cyber Threats Using Machine Learning
Bolakale Aremu
5/5 (1)
Face Liveness Detection Using Artificial Intelligence Techniques: A Systematic Literature Review and Future Directions
No ratings yet
Face Liveness Detection Using Artificial Intelligence Techniques: A Systematic Literature Review and Future Directions
35 pages
Cybersecurity Guidebook: From Basics to Expert Proficiency
From Everand
Cybersecurity Guidebook: From Basics to Expert Proficiency
William Smith
No ratings yet
Next-Gen Cybersecurity
From Everand
Next-Gen Cybersecurity
Dr. Valarian Couch
No ratings yet
Mastering Cybersecurity: A Comprehensive Guidebook
From Everand
Mastering Cybersecurity: A Comprehensive Guidebook
Rob Proutyon
No ratings yet
Mastering Secure Coding: Writing Software That Stands Up to Attacks
From Everand
Mastering Secure Coding: Writing Software That Stands Up to Attacks
Larry Jones
No ratings yet
Face and Liveness Detection With Criminal Identification Using Machine Learning and Image Processing Techniques For Security System
No ratings yet
Face and Liveness Detection With Criminal Identification Using Machine Learning and Image Processing Techniques For Security System
8 pages
Wa0002.
No ratings yet
Wa0002.
23 pages
V T N R: Ision Ransformers EED Egisters
No ratings yet
V T N R: Ision Ransformers EED Egisters
21 pages
Improved_DINOv2_Presentation
No ratings yet
Improved_DINOv2_Presentation
8 pages
Darcet 2023 Vision Transformers Need Registers
No ratings yet
Darcet 2023 Vision Transformers Need Registers
16 pages
AI-Powered Security: Advanced Safeguarding
From Everand
AI-Powered Security: Advanced Safeguarding
Anasooya Khanna
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
2311.02803v1
No ratings yet
2311.02803v1
20 pages
Sample Paper SDT CNN
No ratings yet
Sample Paper SDT CNN
9 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Face Liveness Detection For Biometric Antispoofing Applications Using Color Texture and Distortion Analysis Features
No ratings yet
Face Liveness Detection For Biometric Antispoofing Applications Using Color Texture and Distortion Analysis Features
10 pages
Advanced Software Security: Strategies for Robust Backend Systems
From Everand
Advanced Software Security: Strategies for Robust Backend Systems
Adam Jones
No ratings yet
Cloud Defense: Advanced Endpoint Protection and Secure Network Strategies
From Everand
Cloud Defense: Advanced Endpoint Protection and Secure Network Strategies
Rob Botwright
No ratings yet
Software Testing
From Everand
Software Testing
Alisa Turing
No ratings yet
Video Content Analysis: Unlocking Insights Through Visual Data
From Everand
Video Content Analysis: Unlocking Insights Through Visual Data
Fouad Sabry
No ratings yet
Computers and Electrical Engineering: Chenglin Yu, Hailong Pei
No ratings yet
Computers and Electrical Engineering: Chenglin Yu, Hailong Pei
13 pages
Deep Learning For Face Anti-Spoofing: An End-To-End Approach: September 2017
No ratings yet
Deep Learning For Face Anti-Spoofing: An End-To-End Approach: September 2017
7 pages
Defense in Depth
From Everand
Defense in Depth
Qasim
No ratings yet
Network Security Traceback Attack and React in the United States Department of Defense Network
From Everand
Network Security Traceback Attack and React in the United States Department of Defense Network
Edmond K. Machie
No ratings yet
Arun MasterThesis
No ratings yet
Arun MasterThesis
73 pages
Secure Edge Computing for IoT: Master Security Protocols, Device Management, Data Encryption, and Privacy Strategies to Innovate Solutions for Edge Computing in IoT (English Edition)
From Everand
Secure Edge Computing for IoT: Master Security Protocols, Device Management, Data Encryption, and Privacy Strategies to Innovate Solutions for Edge Computing in IoT (English Edition)
Oluyemi James
No ratings yet
Secure Edge Computing for IoT
From Everand
Secure Edge Computing for IoT
Oluyemi James Odeyinka
No ratings yet
Face Spoof Detection Using Deep Structured Learning: Abstract-Face Recognition Systems Are Now Being Used in Many
No ratings yet
Face Spoof Detection Using Deep Structured Learning: Abstract-Face Recognition Systems Are Now Being Used in Many
5 pages
Conference FACE SPOOF
No ratings yet
Conference FACE SPOOF
5 pages
Mobile Security Fundamentals: A Guide for CompTIA Security+ 601 Exam
From Everand
Mobile Security Fundamentals: A Guide for CompTIA Security+ 601 Exam
Adil Ahmed
No ratings yet
125989871 (1)
No ratings yet
125989871 (1)
11 pages
Fortifying Digital Fortress: A Comprehensive Guide to Information Systems Security: GoodMan, #1
From Everand
Fortifying Digital Fortress: A Comprehensive Guide to Information Systems Security: GoodMan, #1
Patrick Mukosha
No ratings yet
Final_Improved_DINOv2_Presentation
No ratings yet
Final_Improved_DINOv2_Presentation
12 pages
Cybersecurity in Cloud Computing
From Everand
Cybersecurity in Cloud Computing
Akula Achari
No ratings yet
Virus Safeguarding: Navigating Cybersecurity Challenges
From Everand
Virus Safeguarding: Navigating Cybersecurity Challenges
Ayir Ahsi
No ratings yet
CYBER SECURITY HANDBOOK Part-2: Lock, Stock, and Cyber: A Comprehensive Security Handbook
From Everand
CYBER SECURITY HANDBOOK Part-2: Lock, Stock, and Cyber: A Comprehensive Security Handbook
Poonam Devi
No ratings yet
GNSS Spoofing Jamming Detection Based On Generative Adversarial Network
No ratings yet
GNSS Spoofing Jamming Detection Based On Generative Adversarial Network
10 pages
EVA M8M FW3 - DataSheet - (UBX 16007405)
No ratings yet
EVA M8M FW3 - DataSheet - (UBX 16007405)
32 pages
Motor Boat & Yachting PDF
100% (1)
Motor Boat & Yachting PDF
154 pages
Cyber Security Enhancing CAN Transceivers
No ratings yet
Cyber Security Enhancing CAN Transceivers
4 pages
Satellite-Based Communications Security - A Survey of Threats, Solutions, and Research Challenges
No ratings yet
Satellite-Based Communications Security - A Survey of Threats, Solutions, and Research Challenges
72 pages
IARJSET
No ratings yet
IARJSET
3 pages
EX NO 8 Eavesdrop
No ratings yet
EX NO 8 Eavesdrop
6 pages
Test Bank Financial Accounting and Accounting Standard Chapter 17
No ratings yet
Test Bank Financial Accounting and Accounting Standard Chapter 17
19 pages
Chap 6 ETI Types of Hacking
100% (1)
Chap 6 ETI Types of Hacking
67 pages
Ra Cyber
No ratings yet
Ra Cyber
12 pages
ZOE-M8B DataSheet UBX-17035164
No ratings yet
ZOE-M8B DataSheet UBX-17035164
33 pages
Bingen SpaceThreatAssessment 2023 UPDATED-min
No ratings yet
Bingen SpaceThreatAssessment 2023 UPDATED-min
56 pages
Ai Autonomous Vehicles Paper Exp 1c
No ratings yet
Ai Autonomous Vehicles Paper Exp 1c
17 pages
CHAPTER 5 (Lama)
No ratings yet
CHAPTER 5 (Lama)
48 pages
Gps Spoofing: by Low-Cost SDR Tools
No ratings yet
Gps Spoofing: by Low-Cost SDR Tools
51 pages
Interim Sample
No ratings yet
Interim Sample
63 pages
1 s2.0 S0167404823006065 Main.
No ratings yet
1 s2.0 S0167404823006065 Main.
13 pages
B4 Konovaltsev Development Array Receivers ION GNSS2019 Final
No ratings yet
B4 Konovaltsev Development Array Receivers ION GNSS2019 Final
14 pages
Tryhackme Spoofingattack
No ratings yet
Tryhackme Spoofingattack
41 pages
Computer Fraud and Abuse Techniques
No ratings yet
Computer Fraud and Abuse Techniques
17 pages
Accounting Information Systems 14th Edition Romney Test Bank Full Chapter PDF
100% (24)
Accounting Information Systems 14th Edition Romney Test Bank Full Chapter PDF
53 pages
Business Communication Notes
No ratings yet
Business Communication Notes
11 pages
Marketing Thesis
No ratings yet
Marketing Thesis
18 pages
CyBOK v1.1.0-5
No ratings yet
CyBOK v1.1.0-5
200 pages
AOC-Losing-GPS-War Antisatélites
No ratings yet
AOC-Losing-GPS-War Antisatélites
42 pages
Owasp201604 Drones
No ratings yet
Owasp201604 Drones
72 pages
100% Successful Bank Transfers PDF
100% (1)
100% Successful Bank Transfers PDF
3 pages
2019 Jamming Spoofing of Gnss
100% (2)
2019 Jamming Spoofing of Gnss
16 pages
Multi-Protocol Communication and Security System Using ESP8266/32
No ratings yet
Multi-Protocol Communication and Security System Using ESP8266/32
10 pages

Liveness_Detection_in_Computer_Vision_Transformer-Based_Self-Supervised_Learning_for_Face_Anti-Spoofing

Uploaded by

Liveness_Detection_in_Computer_Vision_Transformer-Based_Self-Supervised_Learning_for_Face_Anti-Spoofing

Uploaded by

Received 27 November 2024, accepted 5 December 2024, date of publication 9 December 2024,

date of current version 17 December 2024.

Liveness Detection in Computer Vision:

I. INTRODUCTION camera shots or social media photos, can compromise the

185674 VOLUME 12, 2024

VOLUME 12, 2024 185675

TABLE 1. Distribution of data in train and validation sets.

185676 VOLUME 12, 2024

parameters and reduced computational cost. To enhance its

VOLUME 12, 2024 185677

185678 VOLUME 12, 2024

in terms of true positives (TP), false positives (FP), true

VOLUME 12, 2024 185679

TABLE 4. Comparison of EfficientNet b2, EfficientNet b2 with Noisy

FIGURE 8. Balancing the Attack Presentation Classification Error Rate

DINO-based ViT model greatly surpasses the EfficientNet

185680 VOLUME 12, 2024

VOLUME 12, 2024 185681

185682 VOLUME 12, 2024

VOLUME 12, 2024 185683

185684 VOLUME 12, 2024

VOLUME 12, 2024 185685

You might also like