Liveness_Detection_in_Computer_Vision_Transformer-Based_Self-Supervised_Learning_for_Face_Anti-Spoofing
Liveness_Detection_in_Computer_Vision_Transformer-Based_Self-Supervised_Learning_for_Face_Anti-Spoofing
ABSTRACT Face recognition systems are increasingly used in biometric security for convenience and
effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos,
or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the
Vision Transformer (ViT) architecture, fine-tuned with the DINO framework utilizing CelebA-Spoof,
CASIA SURF, and a proprietary dataset. The DINO framework facilitates self-supervised learning, enabling
the model to learn distinguishing features from unlabeled data. We compared the performance of the
proposed fine-tuned ViT model using the DINO framework against traditional models, including CNN
Model EfficientNet b2, EfficientNet b2 (Noisy Student), and Mobile ViT on the face anti-spoofing task.
Numerous tests on standard datasets show that the ViT model performs better than other models in terms
of accuracy and resistance to different spoofing methods. Our model’s superior performance, particularly in
APCER (1.6%), the most critical metric in this domain, underscores its improved ability to detect spoofing
relative to other models. Additionally, we collected our own dataset from a biometric application to validate
our findings further. This study highlights the superior performance of transformer-based architecture in
identifying complex spoofing cues, leading to significant advancements in biometric security.
INDEX TERMS Biometric security, computer vision, DINO Framework, face anti-spoofing, liveness
detection, self-supervised learning, unsupervised learning, vision transformers.
limited. We hypothesize that a transformer-based model, in Section V and Future Works in Section VI. Finally, the
trained on a large and diverse dataset, can effectively capture concluding remarks are drawn in Section VII.
the nuanced features indicative of spoofing. As a result, it can
outperform traditional CNN models.
Face anti-spoofing poses unique challenges, especially due II. RELATED WORK
to the lack of labeled spoofing data and the constantly evolv- A. FACE ANTI-SPOOFING
ing techniques to bypass security systems. Self-supervised This section reviews the existing methods for face anti-
learning, like DINO, provides a significant advantage by spoofing, including traditional and deep learning-based
allowing the model to learn from large amounts of unlabeled approaches. The vulnerability of face recognition systems to
data, reducing the need for expensive and time-consuming spoofing attacks has been extensively studied.
labeled data. This is particularly valuable for face anti- Initial methods for face anti-spoofing mainly used hand-
spoofing, as collecting and labeling diverse spoofing samples crafted features and traditional machine learning techniques.
poses a challenge. By using self-supervised learning, our For instance, some researchers [15] used SURF - speeded-
model can better generalize across a wider range of spoofing up robust features as a patented local feature detector and
attacks and adapt to unseen threats. descriptor, and Fisher vector encoding is an image feature
In this study, we utilized multiple benchmark datasets to encoding and quantization technique to enhance face spoof
evaluate the performance of our proposed Vision Transformer detection. Still, these methods struggled with generalizing
(ViT) model, fine-tuned using the DINO framework. Besides to new and unseen spoofing attacks. Similarly, researchers
these established datasets, we also gathered a unique dataset focused on smartphone-based face unlock systems, empha-
from a biometric application. sizing the limitations of these traditional methods in dynamic
The contributions of this study are as follows: and varied attack scenarios [16].
A range of other methods has been proposed for face
• Introducing the Vision Transformer (ViT) architecture anti-spoofing, including Haralick texture features [17], image
fine-tuned with the DINO (Emerging Properties in quality assessment [18], patch and depth-based CNNs [19],
Self-Supervised Vision Transformers) framework for and multi-feature video let aggregation [20]. These methods
face anti-spoofing. While ViTs have been used in face have shown promising results in distinguishing between
anti-spoofing, integrating the DINO framework in this genuine and spoofed face appearances. Other approaches
area has not been extensively investigated. include general image quality assessment [21], color texture
• Comparative Analysis of the proposed model with analysis [22], and pulse detection from face videos [23], all of
traditional models, including CNN Model EfficientNet which have demonstrated effectiveness in detecting various
b2 and EfficientNet b2 (Noisy Student), Mobile ViT. types of spoofing attacks. Combining FRS with other security
• Improvement in anti-spoofing performance, reflected in systems, such as RFID, has also been suggested to strengthen
the APCER. Our comparative analysis shows that our
security [24].
DINO-based ViT model significantly outperforms other
Since the emergence of deep learning, Convolutional
models, demonstrating the ability to identify spoofing
Neural Networks (CNNs) have become popular in face anti-
attacks better.
spoofing research. Several studies have demonstrated the
One of our model’s main distinctions is its integration effectiveness of CNNs in learning features directly from
with the DINO framework, which employs self-supervised data [25], [26], leading to improved liveness detection
learning. This allows our model to learn from unlabeled data performance. However, these models require large, diverse
and generalize better across various spoofing attacks. datasets and often struggle with generalization to novel
Although Vision Transformers have been previously spoofing techniques due to their reliance on local feature
applied to face anti-spoofing, the integration of the DINO extraction.
framework remains underexplored in this context. Our work Several recent studies have explored the use of trans-
addresses this gap by introducing a novel approach and utiliz- former architectures in face anti-spoofing, with promising
ing DINO’s self-supervised learning capabilities to enhance results. Studies [11] and [27] both achieved competitive
model robustness against spoofing attacks. To our knowledge, performance using ViT transformers, with the latter intro-
this is the first application of the DINO framework in the ducing a relation-aware mechanism. The performance was
context of face anti-spoofing. It fills a critical gap in the further improved by deepening the transformer network
existing literature and offers new insights into the potential loop depth and introducing adaptive transformers for robust
of self-supervised ViTs in biometric security. cross-domain face anti-spoofing, respectively [28]. Other
The paper is structured as follows: Section I is this similar studies focused on generalizability, with the former
introduction. Section II presents an overview of the works proposing a domain-invariant vision transformer and the
related to face anti-spoofing. Next, Section III describes the latter demonstrating the effectiveness of vision transformers
methods employed in this study, including data collection, for zero-shot face anti-spoofing [13], [29]. The other
vision transformers, and the DINO framework. Experimental work presents UDG-FAS, the first Unsupervised Domain
results are presented in Section IV, followed by a Discussion Generalization framework for Face Anti-Spoofing [30]. This
framework uses large volumes of unlabeled data to learn highlighted the method’s ability to learn rich visual repre-
generalizable features, thereby improving performance in sentations without labels, achieving state-of-the-art results on
low-data scenarios for face anti-spoofing. Another study ImageNet.
introduces FM-ViT, a transformer-based framework that out- Many studies demonstrate the effectiveness of DINO in
performs existing single-modal frameworks [12]. Adaptive object detection and masked autoencoder domains. [36]
vision transformers for robust few-shot cross-domain face focuses on learning patch-level representations, which are
anti-spoofing was proposed in the other recent study [28]. The crucial for accurate object detection. DINO’s self-supervised
generalizability of vision transformers was further improved vision transformers enable the model to learn detailed
with the Domain-invariant Vision Transformer (DiVT) [13]. representations of image patches, improving performance
Next, the study [14] developed a convolutional vision in detecting and recognizing objects. Lastly, [37] and [38]
transformer-based framework for robust performance against both demonstrate how DINO’s features can be effectively
unseen domain data. utilized in masked autoencoders, enabling these models to
As we see, recent advancements in Vision Transformers reconstruct masked image regions more efficiently. These
(ViTs) offer a promising alternative. Unlike CNNs, ViTs studies demonstrate DINO’s versatility and effectiveness
capture global dependencies via self-attention mechanisms, across various computer vision applications.
potentially enhancing their ability to identify subtle, global The DINO framework has been explored in the context
spoofing cues. Studies [29] have explored the application of security, particularly in adversarial attack scenarios [39],
of ViTs for unseen face anti-spoofing, showcasing their [40]. For example, studies have analyzed the robustness
potential in handling unseen attacks. Further research [27] of self-supervised Vision Transformers trained with DINO
emphasized the effectiveness of transformers in incorporating against adversarial attacks, showing that these models can
relation-aware mechanisms for improved spoof detection. be more resilient than those trained through supervised
Recent studies illustrate the relevance of handling masked learning [40]. These works have focused on evaluating the
face detection in real-time scenarios, which can be extended robustness of DINO in adversarial contexts and exploring
to an anti-spoofing approach, enhancing the robustness of defense strategies to enhance model security. However,
face detection systems. A CAFFE-modified MobileNetV2 despite these advancements, no previous studies have applied
(CMNV2) model for masked face age and gender identi- DINO specifically to face anti-spoofing. Our research
fication was proposed [31], achieving 96.54% accuracy by addresses this gap by employing the DINO framework to
focusing on key facial areas like the eyes, forehead, and ears. enhance the performance of Vision Transformers in detecting
Similarly, authors developed a Caffe-MobileNetV2 model for spoofing attacks, thereby contributing a novel application
detecting masked and unmasked faces in both photos and of DINO in the domain of biometric security. By doing
real-time video [32], with an impressive accuracy of 99.64%. so, we demonstrate the potential of self-supervised learning
These studies highlight the importance of feature extraction frameworks like DINO to improve real-world security
from the periocular region and above, which aligns with applications, particularly in face anti-spoofing significantly.
challenges in face detection and anti-spoofing under occluded So, unlike traditional supervised approaches that rely
conditions. heavily on labeled datasets, DINO excels in tasks like
Specific challenges frequently arise in face anti-spoofing face anti-spoofing by focusing on its ability to capture
research, including difficulties in generalizing across differ- global dependencies and learn discriminative features from
ent domains and datasets, the constraints imposed by limited large amounts of unlabeled data. This leads to improved
data, and technical obstacles related to methodologies such generalization to diverse spoofing attacks that may not be
as anomaly detection and black-box discriminators. Cross- present in traditional training datasets. By leveraging the ViT
domain face anti-spoofing, such as the domain gap and architecture, DINO allows the model to detect subtle details
limited data, can lead to poor generalization of models to indicative of spoofing, making it particularly well-suited for
new domains. Furthermore, the generalization capabilities of this task.
classifiers, particularly when applied to diverse databases, are
often questioned, as they may not consistently perform well
across different datasets. III. METHODS
A. DATA
In this research, we employed several benchmark datasets
B. DINO FRAMEWORK to assess how well our proposed Vision Transformer (ViT)
Recent research has explored the DINO framework for visual model fine-tuned with the DINO framework. These datasets
transformers, demonstrating its effectiveness in various are selected for their diversity and coverage of various
computer vision tasks. DINO-based models have shown spoofing techniques, ensuring a thorough evaluation of the
remarkable performance in object detection and segmenta- model’s capabilities.
tion [33]. The CelebA-Spoof [41] dataset is an extensive dataset
The framework has been extended to improve few-shot created especially for face anti-spoofing tasks. It contains
keypoint detection [34]. The original DINO paper [35] over 625,000 images of 10,000 subjects, incorporating
FIGURE 1. Trainig data distribution by dataset and label. FIGURE 2. Validation data distribution by dataset and label.
E. PROPOSED APPROACH
1) EXPERIMENTAL SETTINGS
To tackle the issue of face anti-spoofing, we fine-tuned a
Vision Transformer (ViT) model using the DINO framework.
Our approach leverages ViTs’ ability to capture global
dependencies in the input data via self-attention mechanisms,
which enhances their ability to detect subtle, global spoofing
cues. Though Vision Transformers have been applied to
face anti-spoofing in prior research, incorporating the DINO
framework within this context has received limited attention.
FIGURE 3. Sample images from the dataset illustrating various genuine We compared how well the ViT model performed against
(‘‘live’’) and fake (‘‘fake’’) examples. The dataset includes various facial traditional models, including CNN Model EfficientNet b2,
images, including spoofing techniques such as printed and screen images.
EfficientNet b2 (Noisy Student), and Mobile ViT, to see how
effective transformer-based methods are in this field. Our
models were trained on two NVIDIA A100 40 GB GPUs.
MobileViT [46] is an effective neural network architecture, The detailed training procedure is outlined in Algorithm 1.
merging the capabilities of Vision Transformers (ViTs) We selected the Adam optimizer [49] due to its ability
with Convolutional Neural Networks (CNNs). MobileViT’s to adapt the learning rate for each parameter automatically,
hybrid design enables it to capture both global and local and it is effective because, in deep learning problems,
image features that are important for the face anti-spoofing the loss function landscape can be extremely non-convex.
domain. It is particularly suitable for deep models such as Vision
Transformers.
C. DINO (DISTILLATION WITH NO LABELS) Focal Loss [50] was used to handle the issue of class
DINO is a self-supervised learning approach that trains the imbalance, a frequent challenge in face anti-spoofing tasks.
model to generate similar embeddings for different views of It reduces the impact of easy-to-classify examples, enabling
the same image [35]. This is done using a student-teacher the model to concentrate more effectively on complex cases,
training setup, where the student network learns to imitate such as identifying spoofed faces.
the output of the teacher network. Architecture is shown in Using fp16 half precision enables faster training and
Fig. 5. reduces memory usage, especially when dealing with large
• Teacher Network: A fixed pre-trained network that models or datasets. This approach also allows for larger batch
provides stable target representations. sizes, speeding up the training process on GPUs with limited
• Student Network: A trainable network that learns to memory. It helps us to speed up the training process and
predict the teacher’s representations. handle limitations on our GPUs.
The DINO framework helps the ViT model learn discrim- The OneCycleLR scheduler [51] modifies the learning rate
inative features from large amounts of unlabeled data. This throughout the training process by initially setting it low,
is particularly useful for tasks like face anti-spoofing, where gradually increasing it to a peak, and then reducing it. his
labeled data may be limited. It will help the model train on helps the model to converge faster and perform better by
our data without labels. enabling it to explore a range of learning rates during training.
It showed better convergence compared to other schedulers.
D. EFFICIENTNET B2
EfficientNet b2 is a CNN model optimized for both efficiency 2) DISTINGUISHED FEATURES
and performance [47]. It uses a compound scaling method During the training process, various data augmentation
that proportionally increases the network’s width, depth, techniques were used to enhance the robustness and general-
and resolution, resulting in improved accuracy with fewer izability of the face anti-spoofing models. The visualization
FIGURE 4. The input face image is split into patches, which are then projected linearly and embedded with positional
information. These embeddings go into the Transformer encoder, which processes the sequence of patches. Next, the
encoder’s output is passed through a multi-layer perceptron (MLP) head to classify the image as either ‘‘spoof’’ or ‘‘live.’’ .
FIGURE 5. This figure illustrates the DINO (Distillation with No Labels) model training process. It starts with image augmentations (1), where two
augmented views of the same image are generated. The student model processes one view, while the teacher model processes the other (2). The teacher
model’s outputs are centered and passed through a softmax layer (3). The student’s outputs are optimized using Stochastic Gradient Descent (SGD) to
match the teacher’s outputs via an exponential moving average (EMA) update (4), minimizing the cross-entropy loss between the student’s and teacher’s
predictions.
of augmentations can be seen in Fig. 6. These augmentations as Blur with a blur limit of 3 to 7, MotionBlur with a
were categorized into four main groups: blur limit of 7 to 21, and GaussNoise for variability in
1) Color Transformations. To provide color variations noise levels.
and simulate different lighting conditions, we used aug- 4) Cropping and Padding. To alter the spatial compo-
mentations such as ChannelShuffle, ChannelDropout, sition of the images, we used CropAndPad with a
and RandomBrightnessContrast. percentage range of -10% to +23% which randomly
2) Affine Transformations. We used augmentations such crops and pads the images, ensuring the model
as Rotate and Flip to provide geometric variations can handle partial occlusions and varying framing
and enhance the model’s ability to generalize across conditions.
different orientations and perspectives.
The steps of the training Algorithm are as follows:
3) Quality Degradations. To simulate various image
quality issues that might be encountered in real-world 1) Data Preparation.Split images into patches
scenarios, we used augmentations such as ImageCom- and create patch embeddings with positional
pression and a combination of blurring techniques such encodings.
FIGURE 7. The training pipeline involves collecting a comprehensive dataset for face anti-spoofing, processing and augmenting the images to enhance
quality and variability, feeding the preprocessed data into the DINOv2 (Vision Transformer) model with a binary classification layer, and evaluating the
model’s performance.
TABLE 3. Comparison of EfficientNet and ViT (DINO) Models (all datasets combined).
ISO/IEC 30107-3 focus on APCER and BPCER, as these model that is more robust against novel and complex spoofing
metrics more accurately reflect the system’s performance in attacks.
real-world security scenarios. The attention visualizations for spoof and live class images,
Fig. 10 illustrates the trends for APCER, BPCER, ACER, as shown in the figures Fig. 11, reveal how the Vision
and accuracy over 50 training epochs for all models. The Transformer (ViT) model, fine-tuned with DINO, selectively
plot demonstrates a significant decrease in APCER for all focuses on different regions of the images when making
models, with the ViT (DINO) model consistently maintaining classifications. In the case of spoof class images Fig. 11b, the
a lower APCER throughout the training process. The BPCER attention maps demonstrate that the model concentrates on
plot highlights the reduction in BPCER, where the ViT areas that often exhibit unnatural artifacts or inconsistencies,
(DINO) model shows superior performance by achieving a such as reflections, edges, or distortions typically found
lower BPCER than other models. The ACER plot indicates in spoofing attacks. In contrast, for the live class images
the overall classification error rates, significantly improving Fig. 11a, the attention maps show a more evenly distributed
the ViT (DINO) model’s ability to balance APCER and focus on natural, coherent facial features, such as skin texture,
BPCER. The accuracy plot illustrates the higher overall smoothness, and uniform lighting patterns. This distinction
accuracy of the ViT (DINO) model, indicating better between how the model handles real and spoofed images
general performance in distinguishing genuine and spoofed illustrates the model’s effectiveness in focusing on relevant
faces. features for classification.
Fig. 9 presents the confusion matrices for all models. In contrast, the EfficientNet B2 model, although optimized
The ViT (DINO) model demonstrates superior classification for efficiency and performance, relies on local feature
performance with the lowest APCER and BPCER values, extraction through convolutional layers. This localized focus
resulting in fewer false positives and false negatives. The may limit its ability to generalize to novel and sophisticated
confusion matrix for ViT (DINO) highlights its ability to spoofing attacks that require a detailed understanding of
accurately distinguish between genuine and spoofed faces, the face’s overall structure. Additionally, the traditional
leading to high accuracy. MobileViT also shows strong supervised learning approach used for training EfficientNet
performance with low error rates, while both EfficientNet B2 may not fully exploit the potential of the available data,
b2 models, though achieving high accuracy, exhibit higher leading to suboptimal generalization. This limitation led
APCER and BPCER, reflecting a relatively higher rate of us to experiment with training EfficientNet B2 using the
misclassification when compared to MobileViT and ViT Noisy Student method, a semi-supervised approach that uses
(DINO). both labeled and unlabeled data. This approach improved
performance metrics, including APCER, but the results were
still not as good as the self-supervised ViT model fine-tuned
V. DISCUSSION with DINO.
A. WHY APCER IS SIGNIFICANTLY DECREASED? The findings of this study suggest that adopting
As our experimental observations demonstrated, APCER transformer-based architectures, such as ViT, fine-tuned
significantly decreased after we trained the ViT model, with self-supervised learning frameworks like DINO,
with even greater improvements when fine-tuned using the or even CNN-based models enhanced with semi-supervised
DINO framework. The decrease in APCER reflects the learning frameworks like Noisy Student, can significantly
model’s ability to more accurately distinguish between real improve face anti-spoofing systems. These advancements
and spoofed faces, reducing the risk of security breaches have practical implications for improving the security and
in face recognition systems. This improvement is critical reliability of biometric authentication systems, which are
because APCER directly measures the model’s effectiveness increasingly used in areas such as unlocking personal devices
in identifying spoof attacks, a key concern in biometric and controlling access in secure environments.
security applications.
The superior performance of ViT-based models can be
attributed to their ability to capture global patterns and B. COMPARISON WITH RECENT STUDIES
dependencies across the entire image, rather than focusing Let’s review how the current study’s results compare to
only on localized features, as is common with traditional previous studies. Many studies have explored using vision
CNN models. ViTs are particularly well-suited for face anti- transformers in face anti-spoofing, with promising results.
spoofing tasks because they can detect subtle inconsistencies, Many studies demonstrate the effectiveness of these mod-
such as unnatural lighting or distortions in spoofed faces. els in detecting anomalies and achieving robust performance
However, the DINO framework’s self-supervised pre-training across different domains [11], [13], [28], [53]. Studies [27]
further enhances the model’s capability to learn discrimina- and [54] further enhance the capabilities of vision trans-
tive features from large amounts of unlabeled data. By using formers by incorporating relation-aware mechanisms and
this data, the DINO framework enables the ViT model to adaptive-avg-pooling-based attention. Next, [29] and [55]
generalize better to diverse spoofing techniques that may not extend the application of vision transformers to zero-shot
be present in traditional training datasets. This results in a anti-spoofing and data augmentation, respectively, achieving
FIGURE 9. Confusion matrices for four models: EfficientNet b2, EfficientNet b2 (Noisy Student), MobileViT, and ViT (DINO).
FIGURE 10. Trends of APCER, BPCER, ACER, and accuracy over 50 training epochs for EfficientNet b2 and ViT (DINO) models, demonstrating the
superior performance of the ViT (DINO) model in face anti-spoofing tasks.
FIGURE 11. Attention heatmaps and original spoof images for different datasets. The top row shows the attention heatmaps, highlighting the regions
where the model focuses its attention during classification. The bottom row displays the original live or spoof images from different datasets like
CelebA-Spoof, CASIA-SURF, and a Proprietary dataset.
state-of-the-art performance. Lastly, [56] reports significant could enhance the robustness and adaptability of face anti-
improvements in accuracy and reduced equal error rates spoofing models, particularly in scenarios with ambiguous or
using transformer-based models. These studies collectively uncertain data. Finally, real-world testing and deployment of
highlight the potential of vision transformers in enhancing these models in diverse environments would be valuable in
the security of face recognition systems. Our findings back assessing their practical effectiveness and identifying areas
up these prior research works. for improvement.
As can be seen, the existing studies mainly focus on
supervised or semi-supervised methods, leaving room for VII. CONCLUSION
improvement in terms of generalization. In contrast, our In this study, we presented a novel application of the
approach utilizes the DINO framework, a self-supervised DINO framework within Vision Transformers for face anti-
method, allowing our model to learn from large-scale spoofing. This approach addresses the limited exploration
unlabeled data. This significantly enhances the model’s of DINO’s self-supervised learning capabilities in this
ability to generalize across diverse spoofing techniques. context. Several benchmark datasets were used to assess the
This presents an advantage over traditional CNNs and even effectiveness of the model.
supervised ViT models, with a more flexible and powerful The ViT (DINO) model consistently outperformed other
approach to face anti-spoofing. Although similar research models across all key metrics (especially in APCER), indi-
has previously been carried out, the literature has paid little cating its superior ability to distinguish between genuine and
attention to fine-tuning the ViT architecture with Dino. spoofed faces. Our comparative experiments demonstrated
that the ViT (DINO) model consistently outperformed other
VI. LIMITATIONS AND FUTURE WORKS state-of-the-art models, including EfficientNet B2, Efficient-
The study has certain limitations that need to be addressed Net B2 with Noisy Student, and MobileViT, particularly in
in future work. Firstly, the reliance on a specific set of key metrics like APCER. This improvement is crucial as it
datasets may limit the generalizability of the results to other addresses the growing threat of spoofing attacks in various
types of spoofing attacks or different demographic groups. applications, from personal device security to access control
Secondly, while the DINO framework provides significant in high-security environments. The findings underscore the
improvements, it also introduces additional computational importance of adopting cutting-edge AI technologies to safe-
complexity that may be challenging to implement in real-time guard biometric systems against increasingly sophisticated
applications. Finally, the current study does not consider the spoofing techniques.
potential impact of environmental variations, such as lighting In general, study findings suggest that incorporating DINO
conditions and camera quality, on the model’s performance. into ViTs enhances their robustness against spoofing attacks,
Addressing these limitations in future research will be crucial offering valuable insights into the potential of self-supervised
for developing more universally applicable and efficient face learning in biometric security. The results indicate that
anti-spoofing systems. integrating DINO into ViTs can enhance their performance in
Future research should consider using extra data types, biometric security applications. This contributes to a broader
like depth and infrared, to make face anti-spoofing models understanding of how self-supervised learning techniques can
even more robust. Investigating the application of other be effectively applied in this domain.
self-supervised learning techniques and transformer architec-
tures could also provide further enhancements. In addition, REFERENCES
in future research, we aim to explore the integration of fuzzy
[1] E. Vazquez-Fernandez and D. Gonzalez-Jimenez, ‘‘Face recognition for
logic with ViT, a recent trend [57]. Fuzzy logic is a powerful authentication on mobile devices,’’ Image Vis. Comput., vol. 55, pp. 31–33,
tool for handling imprecision and uncertainty [58], which Nov. 2016, doi: 10.1016/j.imavis.2016.03.018.
[2] R. V. Petrescu, ‘‘Face recognition as a biometric application,’’ SSRN [25] S. Garg, S. Mittal, P. Kumar, and V. Anant Athavale, ‘‘DeBNet: Multilayer
Electron. J., vol. 3, pp. 237–257, Apr. 2019, doi: 10.2139/ssrn.3417325. deep network for liveness detection in face recognition system,’’ in
[3] M. P. Nagesh, ‘‘Face recognition systems,’’ Int. J. Res. Appl. Proc. 7th Int. Conf. Signal Process. Integr. Netw. (SPIN), Feb. 2020,
Sci. Eng. Technol., vol. 11, no. 3, pp. 962–964, Mar. 2023, doi: pp. 1136–1141.
10.22214/ijraset.2023.49567. [26] S. Jafri, S. Chawan, and A. Khan, ‘‘Face recognition using deep neural
[4] T. I. Dhamecha, S. Ghosh, M. Vatsa, and R. Singh, ‘‘Kernelized network with ‘LivenessNet’,’’ in Proc. Int. Conf. Inventive Comput.
heterogeneity-aware cross-view face recognition,’’ Frontiers Artif. Intell., Technol. (ICICT), 2020, pp. 145–148.
vol. 4, Jul. 2021, Art. no. 670538, doi: 10.3389/frai.2021.670538. [27] Z. Wang, Q. Wang, W. Deng, and G. Guo, ‘‘Face anti-spoofing
[5] D. A. Chowdhry, A. Hussain, M. Z. Ur Rehman, F. Ahmad, A. Ahmad, and using transformers with relation-aware mechanism,’’ IEEE Trans.
M. Pervaiz, ‘‘Smart security system for sensitive area using face recog- Biometrics, Behav., Identity Sci., vol. 4, no. 3, pp. 439–450,
nition,’’ in Proc. IEEE Conf. Sustain. Utilization Develop. Eng. Technol. Jul. 2022.
(CSUDET), May 2013, pp. 11–14, doi: 10.1109/CSUDET.2013.6670976. [28] H.-P. Huang, D. Sun, Y. Liu, W.-S. Chu, T. Xiao, J. Yuan, H. Adam,
[6] A. AbdElaziz, ‘‘A survey of smartphone-based face recognition systems and M.-H. Yang, ‘‘Adaptive transformers for robust few-shot cross-
for security purposes,’’ Kafrelsheikh J. Inf. Sci., vol. 2, no. 1, pp. 1–7, domain face anti-spoofing,’’ in Proc. Eur. Conf. Comput. Vis., Jan. 2022,
Aug. 2021, doi: 10.21608/kjis.2021.5484.1006. pp. 37–54.
[7] N. Erdogmus and S. Marcel, ‘‘Spoofing face recognition with 3D masks,’’ [29] A. George and S. Marcel, ‘‘On the effectiveness of vision transformers for
IEEE Trans. Inf. Forensics Security, vol. 9, no. 7, pp. 1084–1097, Jul. 2014. zero-shot face anti-spoofing,’’ in Proc. IEEE Int. Joint Conf. Biometrics
[8] B. Hamdan and K. Mokhtar, ‘‘The detection of spoofing by 3D mask in (IJCB), Aug. 2021, pp. 1–8.
a 2D identity recognition system,’’ Egyptian Informat. J., vol. 19, no. 2, [30] Y. Liu, Y. Chen, M. Gou, C.-T. Huang, Y. Wang, W. Dai, and
pp. 75–82, Jul. 2018. H. Xiong, ‘‘Towards unsupervised domain generalization for face anti-
[9] L. Omar and I. Ivrissimtzis, ‘‘Evaluating the resilience of face recognition spoofing,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023,
systems against malicious attacks,’’ in Proc. 7th U.K. Brit. Mach. Vis. pp. 20597–20607.
Workshop, 2015, pp. 5.1–5.9. [31] B. A. Kumar and M. Bansal, ‘‘Face mask detection on photo and real-
[10] L. Omar and I. Ivrissimtzis, ‘‘Designing a facial spoofing database for time video images using caffe-MobileNetV2 transfer learning,’’ Appl. Sci.,
processed image attacks,’’ in Proc. 7th Int. Conf. Imag. Crime Detection vol. 13, no. 2, p. 935, Jan. 2023, doi: 10.3390/app13020935.
Prevention (ICDP), 2016, pp. 1–6. [32] B. A. Kumar and N. K. Misra, ‘‘Masked face age and gender iden-
[11] L. Abduh, L. Omar, and I. Ivrissimtzis, ‘‘Anomaly detection with tification using caffe-modified MobileNetV2 on photo and real-time
transformer in face anti-spoofing,’’ J. WSCG, vol. 31, nos. 1–2, pp. 91–98, video images by transfer learning and deep learning techniques,’’ Expert
Jul. 2023. Syst. Appl., vol. 246, Jul. 2024, Art. no. 123179. [Online]. Available:
[12] A. Liu, Z. Tan, Z. Yu, C. Zhao, J. Wan, Y. Liang, Z. Lei, D. Zhang, S. Z. Li, https://www.sciencedirect.com/science/article/pii/S0957417424000447
and G. Guo, ‘‘FM-ViT: Flexible modal vision transformers for face anti- [33] F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum,
spoofing,’’ IEEE Trans. Inf. Forensics Security, vol. 18, pp. 4775–4786, ‘‘Mask DINO: Towards a unified transformer-based framework for object
2023, doi: 10.1109/TIFS.2023.3296330. detection and segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
[13] C.-H. Liao, W.-C. Chen, H.-T. Liu, Y.-R. Yeh, M.-C. Hu, and C.-S. Chen, Pattern Recognit. (CVPR), Jun. 2023, pp. 3041–3050.
‘‘Domain invariant vision transformer learning for face anti-spoofing,’’
[34] C. Lu, H. Zhu, and P. Koniusz, ‘‘From saliency to DINO: Saliency-
in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023,
guided vision transformer for few-shot keypoint detection,’’ 2023,
pp. 6087–6096.
arXiv:2304.03140.
[14] Y. Lee, Y. Kwak, and J. Shin, ‘‘Robust face anti-spoofing framework with
[35] M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski,
convolutional vision transformer,’’ in Proc. IEEE Int. Conf. Image Process.
and A. Joulin, ‘‘Emerging properties in self-supervised vision transform-
(ICIP), Oct. 2023, pp. 1015–1019.
ers,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
[15] Z. Boulkenafet, J. Komulainen, and A. Hadid, ‘‘Face antispoofing using
pp. 9630–9640.
speeded-up robust features and Fisher vector encoding,’’ IEEE Signal
[36] S. Yun, H. Lee, J. Kim, and J. Shin, ‘‘Patch-level representation learning
Process. Lett., vol. 24, no. 2, pp. 141–145, Feb. 2017.
for self-supervised vision transformers,’’ 2022, arXiv:2206.07990.
[16] K. Patel, H. Han, and A. K. Jain, ‘‘Secure face unlock: Spoof detection
on smartphones,’’ IEEE Trans. Inf. Forensics Security, vol. 11, no. 10, [37] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, ‘‘Masked
pp. 2268–2283, Oct. 2016. autoencoders are scalable vision learners,’’ 2021, arXiv:2111.06377.
[17] A. Agarwal, R. Singh, and M. Vatsa, ‘‘Face anti-spoofing using Haralick [38] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. So Kweon, and
features,’’ in Proc. IEEE 8th Int. Conf. Biometrics Theory, Appl. Syst. S. Xie, ‘‘ConvNeXt v2: Co-designing and scaling ConvNets with masked
(BTAS), Sep. 2016, pp. 1–6. autoencoders,’’ 2023, arXiv:2301.00808.
[18] E. Fourati, W. Elloumi, and A. Chetouani, ‘‘Face anti-spoofing with image [39] N. Inkawhich, G. McDonald, and R. Luley, ‘‘Adversarial attacks on
quality assessment,’’ in Proc. 2nd Int. Conf. Bio-eng. Smart Technol. foundational vision models,’’ 2023, arXiv:2308.14597.
(BioSMART), Aug. 2017, pp. 1–4. [40] J. Rando, N. Naimi, T. Baumann, and M. Mathys, ‘‘Exploring adversarial
[19] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu, ‘‘Face anti-spoofing using attacks and defenses in vision transformers trained with DINO,’’ 2022,
patch and depth-based CNNs,’’ in Proc. IEEE Int. Joint Conf. Biometrics arXiv:2206.06761.
(IJCB), Oct. 2017, pp. 319–328. [41] Y. Zhang, Z. Yin, Y. Li, G. Yin, J. Yan, J. Shao, and Z. Liu, ‘‘Celeba-spoof:
[20] T. A. Siddiqui, S. Bharadwaj, T. I. Dhamecha, A. Agarwal, M. Vatsa, Large-scale face anti-spoofing dataset with rich annotations,’’ in Proc. Eur.
R. Singh, and N. Ratha, ‘‘Face anti-spoofing with multifeature videolet Conf. Comput. Vis., pp. 70–85, 2020.
aggregation,’’ in Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), 2016, [42] S. Zhang, A. Liu, J. Wan, Y. Liang, G. Guo, S. Escalera, H. J. Escalante,
pp. 1035–1040. and S. Z. Li, ‘‘CASIA-SURF: A large-scale multi-modal benchmark for
[21] J. Galbally and S. Marcel, ‘‘Face anti-spoofing based on general image face anti-spoofing,’’ IEEE Trans. Biometrics, Behav., Identity Sci., vol. 2,
quality assessment,’’ in Proc. 22nd Int. Conf. Pattern Recognit., Aug. 2014, no. 2, pp. 182–193, Apr. 2020.
pp. 1173–1178. [43] S. Zhang, X. Wang, A. Liu, C. Zhao, J. Wan, S. Escalera, H. Shi, Z. Wang,
[22] Z. Boulkenafet, J. Komulainen, and A. Hadid, ‘‘Face anti-spoofing based and S. Z. Li, ‘‘A dataset and benchmark for large-scale multi-modal face
on color texture analysis,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), anti-spoofing,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Sep. 2015, pp. 2636–2640. (CVPR), Jun. 2019, pp. 919–928.
[23] X. Li, J. Komulainen, G. Zhao, P.-C. Yuen, and M. Pietikäinen, [44] N. Ilinykh and S. Dobnik, ‘‘What does a language-and-vision trans-
‘‘Generalized face anti-spoofing by detecting pulse from face videos,’’ in former see: The impact of semantic information on visual representa-
Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), Dec. 2016, pp. 4244–4249, tions,’’ Frontiers Artif. Intell., vol. 4, Dec. 2021, Art. no. 767971, doi:
doi: 10.1109/ICPR.2016.7900300. 10.3389/frai.2021.767971.
[24] A. Aff, M. Awedh, and M. H. A. Alghamdi, ‘‘RFID and face recognition [45] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
based security and access control system,’’ Int. J. Innov. Res. Sci., Eng. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
Technol., vol. 2, no. 11, pp. 5955–5964, Jan. 2013. [Online]. Available: J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16×16 words:
https://api.semanticscholar.org/CorpusID:13542387 Transformers for image recognition at scale,’’ 2020, arXiv:2010.11929.
[46] S. Mehta and M. Rastegari, ‘‘Separable self-attention for mobile vision ARMAN KERESH received the B.S. degree in
transformers,’’ 2022, arXiv:2206.02680. information systems from the Al-Farabi Kazakh
[47] M. Tan and Q. V. Le, ‘‘Efficientnet: Rethinking model scaling for National University, Almaty, Kazakhstan, in 2023.
convolutional neural networks,’’ in Proc. 36th Int. Conf. Mach. Learn., He is currently pursuing the M.S. degree in data
vol. 97, 2020, pp. 6105–6114. science with Kazakh-British Technical University.
[48] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, ‘‘Self-training with noisy He is also a computer vision engineer in a leading
student improves imagenet classification,’’ in Proc. IEEE/CVF Conf. telecommunication company in Kazakhstan. His
Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 10687–10698.
research interests include artificial intelligence
[49] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
and machine learning, image processing, liveness
2014, arXiv:1412.6980.
detection, image generation, and self-supervised
[50] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for dense
object detection,’’ 2017, arXiv:1708.02002. learning.
[51] L. N. Smith and N. Topin, ‘‘Super-convergence: Very fast training of neural
networks using large learning rates,’’ 2017, arXiv:1708.07120.
[52] Information Technology—Biometric Presentation Attack Detection Part
3: Testing and Reporting, Standard ISO/IEC 30107-3:2023, Int. Org.
for Standardization, 2023. [Online]. Available: https://www.iso.org/ PAKIZAR SHAMOI (Member, IEEE) received
standard/79520.html the B.S. and M.S. degrees in information sys-
[53] M. Marais, D. Brown, J. Connan, and A. Boby, ‘‘Facial live- tems from Kazakh-British Technical University,
ness and anti-spoofing detection using vision transformers,’’ in Proc. Almaty, Kazakhstan, in 2011 and 2013, respec-
Southern Afr. Telecommun. Netw. Appl. Conf. (SATNAC), Aug. 2023, tively, and the Ph.D. degree in engineering from
pp. 1–6. Mie University, Tsu, Japan, in 2019. In her
[54] J. Yang, F. Chen, R. K. Das, Z. Zhu, and S. Zhang, ‘‘Adaptive-avg- academic journey, she has held various teaching
pooling based attention vision transformer for face anti-spoofing,’’ in Proc. and research positions at Kazakh-British Technical
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2024, University, where she has been a Professor with
pp. 3875–3879, doi: 10.1109/ICASSP48485.2024.10446940. the School of Information Technology and Engi-
[55] J. Orfao and D. van der Haar, ‘‘Keyframe and GAN-based data augmen- neering, since August 2020. She is the author of one book, one monograph,
tation for face anti-spoofing,’’ in Proc. 12th Int. Conf. Pattern Recognit. and more than 33 scientific publications. Her research interests include
Appl. Methods, 2023, pp. 629–640, doi: 10.5220/0011648400003411. artificial intelligence and machine learning in general, with a focus on
[56] K. Watanabe, K. Ito, and T. Aoki, ‘‘Spoofing attack detection in
fuzzy sets and logic, soft computing, representing and processing colors in
face recognition system using vision transformer with patch-wise
computer systems, natural language processing, computational aesthetics,
data augmentation,’’ in Proc. Asia–Pacific Signal Inf. Process. Assoc.
Annu. Summit Conf. (APSIPA ASC), Nov. 2022, pp. 1561–1565, doi:
and human-friendly computing and systems. She received awards for the best
10.23919/APSIPAASC55919.2022.9979996. paper at conferences five times. She took part in the organization and worked
[57] Q. Fan, Q. You, X. Han, Y. Liu, Y. Tao, H. Huang, R. He, and in the organization committee (as the Head of the Session and responsible for
H. Yang, ‘‘Vitar: Vision transformer with any resolution,’’ 2024, special sessions) of several international conferences, such as IFSA-SCIS
arXiv:2403.18361. 2017, Otsu, Japan; SCIS-ISIS 2022, Mie, Japan; and EUSPN 2023, Almaty.
[58] P. Kozlov, A. Akram, and P. Shamoi, ‘‘Fuzzy approach for audio- She served as a Reviewer for several international conferences, including
video emotion recognition in computer games for children,’’ Proc. IEEE: SIST 2023, SMC 2022, SCIS-ISIS 2022, SMC 2020, ICIEV-IVPR
Comput. Sci., vol. 231, pp. 771–778, Jan. 2024, doi: 10.1016/J.PROCS. 2019, and ICIEV-IVPR 2018.
2023.12.139.