0% found this document useful (0 votes)
12 views7 pages

2023_vit

This conference paper presents a method for facial liveness and anti-spoofing detection using Video Vision Transformers (ViViT) on the ROSE-Youtu dataset, achieving an accuracy of 86.78% and an Equal Error Rate of 2.46%. The study highlights the vulnerabilities of traditional facial recognition systems to various spoofing attacks, including presentation, texture, and 3D mask attacks. The proposed approach integrates video data loading and frame sampling to enhance detection capabilities against these spoofing techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

2023_vit

This conference paper presents a method for facial liveness and anti-spoofing detection using Video Vision Transformers (ViViT) on the ROSE-Youtu dataset, achieving an accuracy of 86.78% and an Equal Error Rate of 2.46%. The study highlights the vulnerabilities of traditional facial recognition systems to various spoofing attacks, including presentation, texture, and 3D mask attacks. The proposed approach integrates video data loading and frame sampling to enhance detection capabilities against these spoofing techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373841755

Facial Liveness and Anti-Spoofing Detection using Vision Transformers

Conference Paper · August 2023

CITATION READS

1 772

4 authors:

Marc Marais Dane Lesley Brown


Rhodes University Rhodes University
15 PUBLICATIONS 24 CITATIONS 70 PUBLICATIONS 181 CITATIONS

SEE PROFILE SEE PROFILE

James Connan Alden Zachary Boby


Rhodes University Rhodes University
57 PUBLICATIONS 257 CITATIONS 18 PUBLICATIONS 30 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Dane Lesley Brown on 12 October 2023.

The user has requested enhancement of the downloaded file.


Facial Liveness and Anti-Spoofing Detection using
Vision Transformers
Marc Marais ∗ , Dane Brown† , James Connan‡ , Alden Boby§
∗ Department of Computer Science, Rhodes University, Grahamstown, South Africa
[email protected], † [email protected], ‡ [email protected],
§ [email protected]

Abstract—In recent years, the proliferation of facial recogni- • Texture Attacks: High-frequency textures or patterns
tion systems and their integration into various domains has rev- used to exploit the system’s reliance on specific visual
olutionised how we interact with technology. However, this rapid cues.
adoption has also brought new security challenges, particularly
facial spoofing attacks. Facial spoofing refers to the malicious • 3D Mask Attacks: Realistic three-dimensional masks or
attempt to deceive a facial recognition system by presenting sculptures mimic facial features to bypass depth and
counterfeit or manipulated facial biometric data. This paper structure detection.
proposes using Video Vision Transformers (ViViT) to tackle • Video Replay Attacks: Recorded videos replayed in front
facial liveness and anti-spoofing identification on the Rose-Youtu of the system deceive it into recognising the replayed
facial liveness dataset. Video data loading of the video clips is
also integrated and analysed as a video frame sampling method video as a live face.
for facial liveness and anti-spoofing detection. The multiclass
identification ViViT model yielded an accuracy of 86.78% on Deep learning techniques, particularly Convolutional Neu-
the isolated set of test subjects. An Equal Error Rate of 2.46%
was achieved on the real facial videos when contrasted against ral Networks (CNNs) and Recurrent Neural Networks
the spoof attack videos. (RNNs), have emerged as powerful tools for facial liveness
and anti-spoofing detection. These techniques leverage large-
Index Terms—Facial Liveness Detection, Facial Anti-Spoofing, scale datasets, complex network architectures, and advanced
Video Classification, Vision Transformers training algorithms to extract discriminative features and
model intricate patterns in facial images or videos.
I. I NTRODUCTION This paper notably makes the following contributions:
With the rapid advancement of facial recognition systems in
various domains, ensuring the security and reliability of these • The utilisation of video vision transformers for facial
systems has become a critical concern. Facial recognition is liveness detection.
a rapidly growing market with market face-based biometric • Video vision transformers for facial anti-spoofing classi-
recognition expected to be worth USD 8.5 Billion by 2025 [1]. fication.
Traditional facial recognition methods often struggle to differ- • Video data loading for uniform sampling of frames for
entiate between real faces and spoofing attempts using printed facial liveness and anti-spoofing detection
photographs, replayed videos, or 3D masks. A spoofing attack
is a malicious act of impersonating or falsifying data to Therefore, this paper investigates the application of video
deceive individuals or systems into believing the attacker is vision transformers to facial liveness detection and facial anti-
someone they are not. This vulnerability allows malicious spoofing on the ROSE-Youtu Face Liveness Detection Dataset
actors to bypass security measures and gain unauthorised [5]. For this study, no facial detection and cropping systems
access. are utilised; the entire image, including the background, is
Researchers have focused on facial liveness and anti- fed into the model, as the assumption is that vital information
spoofing detection techniques to address this issue. Facial that points toward spoofed attacks may be visible in the back-
liveness determines whether a face presented to a system is ground. Augmentation is implemented on video frames during
live or a spoof [2]. On the other hand, anti-spoofing involves training to improve the proposed system’s robustness and
detecting and preventing various forms of spoofing attacks, generalisation ability. The findings are subsequently compared
such as presentation attacks, texture attacks, and 3D mask to existing state-of-the-art systems for facial liveness and anti-
attacks [3]. spoofing detection.
Facial anti-spoofing attacks vary between the following [4]: The rest of this paper is structured as follows Section I-A
analyses related work, and Section II elaborates on the archi-
• Presentation Attacks: Artificial representations such as
tecture of the vision transformer implemented in this study.
printed photos or 3D masks deceive the system by
Section III explains the methodology and experimental setup.
mimicking a genuine face.
The study results, along with the discussion, are presented in
This work was undertaken in Distributed Multimedia CoE at Rhodes Section IV and Section V concludes the paper and presents
University. future work on the paper.

Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023 Page 193
A. Related Studies incorporating a CNN, RNN and deep reinforcement learning
to exploit and extract global and local features. A ResNet-18
Facial liveness detection and anti-spoofing techniques have backbone is utilised to extract global features, and a gated
been extensively studied to mitigate the vulnerabilities of recurrent unit is applied to exploit local features using a
facial recognition systems against spoofing attacks. This specified image patch size across the temporal domain. The
section provides an overview of the existing literature on model yielded an EER of 1.79% on the Rose-Youtu dataset,
facial liveness detection and anti-spoofing, focusing on several
influential papers in the field. Micro-texture analysis using
multi-scale local binary patterns (LBP)1 was the starting II. A RCHITECTURES
point towards countermeasures against spoofing attacks, on
the premise that real faces present different texture patterns Deep learning, a division of Artificial Intelligence, refers
when compared to fake faces [7, 8]. to a form of machine learning where computational models
include numerous layers to obtain data representations [17].
LBP and LBP-TOP showed sufficient promise but failed In contrast to conventional machine learning approaches, deep
to capture hand-crafted features; thus, introducing CNNs to learning obviates the requirement of extracting features from
learn discriminative features was the next significant landmark the data before training the model.
in facial anti-spoofing systems. CNNs combined with facial
localisation, spatial and temporal augmentation outperformed Transformers have gained immense popularity in natural
traditional LBP approaches with Half Total Error Rates language processing tasks, including language translation, text
(HTER) of 6.25% and 2.68% on the CASIA and Idiap Replay- generation, and sentiment analysis [18]. Unlike traditional
Attack datasets [9, 10], respectively. Image distortion analysis neural network architectures such as RNNs or CNNs, trans-
has also been proposed for face anti-spoofing detection using formers utilise a self-attention mechanism for processing input
specular reflection, blurriness, chromatic moment and colour data. This mechanism enables the model to focus on different
diversity features for an ensemble of SVM classifiers, each parts of the input sequence, learning which parts are crucial
trained on different spoof attacks [11]. The system yielded for the task at hand.
HTERs of 13.3% and 7.41% on the CASIA and Replay- The Video Vision Transformer (ViViT) is a deep learning
Attacks datasets, respectively. A two-stage motion and deep model designed for video recognition based on the Vision
learning-based approach were proposed on the Rose-Youtu Transformer (ViT) architecture [19, 20]. ViViT comprises
dataset [12], whereby the motion stage detects blinking to three primary components: a spatial encoder, a temporal
ensure liveness against photo attacks and the second stage encoder, and a classification head, as illustrated in Figure
DenseNet counters mask and video attacks. The system 1. The spatial encoder independently processes each input
achieved an Equal Error Rate (EER) of 4.56% on the Rose- video frame, employing a ViT-based transformer network to
Youtu dataset. Two attention-based end-2-end models using encode it into a fixed-length feature vector. This encoding
EfficientNet B0 and MobileNet V2 backbones yielded vali- involves 2D position embeddings and a multi-head self-
dation f1-scores of 99.37% and 97.81%, respectively, on the attention mechanism to capture spatial relationships among
Rose-Youtu dataset when trained as a binary classification different regions within the frame.
model. A dual channel neural architecture employed a CNN The output of the spatial encoder is a tensor with dimen-
to extract discriminative patterns and a hand-crafted features sions [T, H, W, C], where T represents the number of frames,
wide network to detect domain-specific features in spoofed H and W denote the height and width of the frame, and C
attacks [13]. The wide network extracts LBP, CoALBP and signifies the number of channels in the feature maps. The
LBQ colour texture features. The two networks are then temporal encoder then takes the feature vectors from the
aggregated in a low-dimensional latent space for final classifi- spatial encoder and models their temporal relationships using
cation. An HTER and EER of 6.12% and 4.27%, respectively, a 1D temporal convolutional network. Dilated convolutions
were achieved on the Rose-Youtu dataset. are employed to enable the modelling of long-range temporal
A Liveness Detection network (LDnet) has been applied dependencies. Consequently, the output of the temporal en-
to the Rose-Youtu dataset using a Histogram of Oriented coder is a tensor with dimensions [1, D], where D corresponds
Gradients (HOG) face detector, which detects and crops out to the size of the feature vector.
the face region [14]. The image is subsequently fed into a Tubelet embedding is an efficient technique that encodes
network consisting of a combination of 2D CNN and 3D CNN spatiotemporal information of video segments known as
layers for final classification. The network treats the dataset tubelets. Tokens are extracted from non-overlapping tubelets,
as a binary classification problem to identify genuine vs fake which consist of dimensions t × h × w, with nt = T /t,
liveness videos and does not classify the different types of nh = H/h and nw = W/w, as illustrated in Figure 2. The
attacks. The network yielded a 99.79% accuracy and HTER tubelet embedding module comprises a spatial encoder and a
of 0.08% on the Rose-Youtu dataset. 3D CNNs have also temporal encoder. The spatial encoder employs a modified
been explored on the Rose-Youtu dataset achieving an EER of version of the ViT-based transformer network to encode
7.00% [15]. Cai et al. [16] proposed a two-branch framework the appearance features of each frame within the tubelet.
1 LBP is a texture descriptor in computer vision that represents the
Meanwhile, the temporal encoder aggregates the appearance
relationship between a central pixel and its neighbouring pixels by encoding features over time and captures motion information using a
their intensity variations as binary patterns [6]. 1D dilated convolutional neural network. The output of the

Page 194 Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023
Fig. 1. Video Vision Transformer architecture.

tubelet embedding module is a fixed-length feature vector that A. ROSE-Youtu Face Liveness Detection Dataset
represents the spatiotemporal information of the tubelet. This
output is then fed into the temporal encoder of the ViViT The ROSE-Youtu Face Liveness Detection Dataset is a face
model, which further aggregates information across multiple anti-spoofing and liveness database consisting of a range of
tubelets to generate a final prediction. illumination conditions, camera models and attack types [5,
23]. The publically available version of the dataset covers 20
subjects to create a total of 3350 videos, where some subjects
wore glasses. The number of videos for each subject ranges
between 150 and 200 video clips with an average duration of
around 10 seconds. Five different mobile devices front facing
cameras’ were utilised to capture all videos. Three spoofing
attack types are incorporated into the dataset: paper attack,
video replay attack, and masking attack. Figure 3 visualises
the genuine face video alongside the seven different varieties
of the three attacks, and Table I describes the various type of
attacks within the dataset.

Fig. 2. An overview of the Tubelet Embedding module with dimensions


t × h × w.

ViViT achieves state-of-the-art performance on various


benchmark video recognition datasets while demonstrating
significant computational efficiency compared to existing ap-
proaches [21]. This achievement results from architectural (a) G (b) Ps (c) Pq (d) Vl
optimisations, including depth-wise convolutions, attention
masking, and training strategies such as distillation and
knowledge distillation. Distillation and knowledge distillation
are techniques that transfer knowledge from a larger model to
a smaller one, improving efficiency and performance by using
soft labels from the teacher model during training to help the
student model learn and generalise better [22]. (e) Vm (f) Mc (g) Mf (h) Mu
Fig. 3. Examples of the various anti-spoofing attacks present in the ROSE-
III. E XPERIMENTAL S ETUP AND M ETHODOLOGY Youtu Face Liveness Detection Dataset.

This paper uses vision transformers to evaluate facial The dataset is split into three splits, namely training,
liveness and anti-spoofing detection on the ROSE-Youtu Face validation and testing. Subjects 2-12 are reserved for the train
Liveness Detection Dataset. Subsequently, this section pro- split, and the remainder is split evenly between the validation
vides an in-depth overview of the proposed system archi- and test splits. The ratio between the train:validation:test splits
tecture implemented that focuses on model generalisation is 50:25:25. The proposed system identifies and classifies
on unseen data. All model training and evaluation were each of the types of attacks present in the Rose-Youtu Facial
performed on an Nvidia RTX 3090 GPU with 24GB of RAM. Liveness dataset as a multiclass classification problem.

Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023 Page 195
TABLE I false acceptance rate (FAR) and false rejection rate (FRR)
A NTI - SPOOFING ATTACKS PRESENT IN THE ROSE-YOUTU FACE are equal, given by Equation 2. The HTER metric is also
L IVENESS D ETECTION DATASET.
utilised in biometrics, representing the average of both FAR
Experiment Architecture and FRR error [4], as seen in Equation 3. For the proposed
G Genuine person system, accuracy, EER and HTER are utilised to evaluate and
Ps Still printed paper
Pq Quivering printed paper compare the model performance.
Vl Video recorded on Lenovo display
Vm Video recorded on Mac display
Mc Paper mask with two eyes and mouth cropped out
Mf Paper mask without cropping
Mu Paper mask with the upper part cut in the middle
number of correct classifications
Accuracy = (1)
total number of classifications attempted

TABLE II
V I V I T T UBELET EMBEDDING PARAMETERS .

Patch height = 32 EER = FAR = FRR (2)


Patch width = 32
Patch tube size = 8

B. Video Dataset Loading and Augmentation


FAR + FRR
HTER = (3)
To account for the varying length of videos, an effective and 2
efficient video frame dataset loader was utilised2 . A sampling
strategy to represent the entire video was implemented as
IV. R ESULTS AND D ISCUSSION
loading the entire sequence of frames is computationally
intense [24]. Videos were sampled by splitting each video into
eight segments. Ten frames were selected at random indices
A. Parameter Tuning
from each of these four segments to create a total sample of
80 frames from each video clip.
Manual evaluation was conducted for parameter tuning
All video frames were resized to 480 × 640 size frames. to select the optimal batch size and learning rate based on
Random augmentation transformations were applied to the best practices. A maximum of 500 epochs utilising early
training and validation sets to aid the models’ generalisation stopping to prevent overtraining was implemented to find the
ability on the test set. The video frames are randomly flipped best epoch. Early stopping finds the epoch with the lowest
horizontally, rotated ±10◦ , scaled between 80% and 100% validation loss and stops model training if the validation loss
and a shear range of ±5◦ using PyTorch transforms. has not decreased after ten epochs. A batch size of 8 was
chosen alongside a learning rate of 2e − 05 based on manual
C. Model Parameters inspection and computational processing time as informed by
the literature.
The ViViT model implemented with factorised self-
attention, explained in Section II, has a number of parameters
to tune to achieve the best performance. The chosen param-
eters for the model were based on manual evaluation and B. Results
computational resources. The ViViT model utilised 8 multi-
head attention blocks with a depth of 4 across each head. The
The proposed ViViT architecture explained in Sections II
frame patch size for the tubelet embedding parameters was
and parameters given in Section III-C trained on the Rose-
set in accordance with Table II. A dropout rate of 10% was
Youtu dataset yielded an accuracy of 98.34% on the train
applied to each encoder block with a final 2048-dimensional
set and 86.47% on the validation set of reserved subjects.
multilayer perceptron output.
On the isolated test split, the model achieved an accuracy of
86.78%. Analysing the multiclass identification of the spoof
D. Model Evaluation Metrics
attacks, the model yielded an HTER of 13.28% and of EER
Machine learning employs a range of metrics to assess the 2.46% when comparing the genuine faces against all the
performance of models. Accuracy, precision, recall, and f- spoofed attacks, as depicted in Table III. Figure 4 illustrates
score are the four frequently used metrics in this context. a multiclass receiver characteristics curve (ROC), a graphical
Accuracy is the ratio of the number of correct predictions to representation of the performance of a classification model
the total number of input samples as given by Equation 1 across multiple classes, illustrating the trade-off between true
[25]. The field of biometrics and facial recognition utilises positive rate and false positive rate. The larger area under
the metric EER, which represents the point at which the each curve for the various types of spoof attacks shows that
the model performs well at identifying the different types of
2 https://github.com/RaivoKoot/Video-Dataset-Loading-Pytorch attacks.

Page 196 Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023
TABLE III
EER AND HTER PERCENTAGES FOR EACH CLASS IN THE ROSE -YOUTU
DATASET.

Metric HTER EER


G 13.28% 2.46%
Ps 3.26% 0.52%
Pq 2.00% 0.00%
Vl 2.69% 4.39%
Vm 13.09% 2.19%
Mc 4.85% 2.71%
Mf 9.61% 1.21%
Mu 5.31% 1.55%

Fig. 5. Confusion matrix comparison of the classes.

Compared to existing studies on the Rose-Youtu dataset,


the proposed model achieves near state-of-the-art perfor-
mance, detailed in Table IV. Outperforming the two-stage mo-
tion approach, dual channel CNN and 3D CNN architectures
only failed to achieve the performance of the DRL-FAS and
LDnet systems, with an EER of 1.79% and HTER of 0.08%,
respectively. However, it is significant to note that the LDnet
approach utilised limited samples from the dataset and did not
include temporal information. The DRL-FAS approach found
that implementing a smaller patch size when iterating over
each frame yielded better results, with an optimal patch size
of 8. Given the computational resources, the proposed ViViT
model can potentially be improved by decreasing the tubelet
Fig. 4. ROC curve of the performance of the proposed model across the embedding patch size, given the computational resources, as
different classes using a one-vs-rest approach on the Rose-Youtu dataset. this requires significantly more GPU memory.

V. C ONCLUSION
Table III provides an overview of the HTER and EER
results of the different classes in the Rose-Youtu dataset. The The ViViT-based proposed model performs exceptionally
class with the lowest HTER and EER of 2.00% and 0.00% well on the non-trivial Rose-Youtu dataset, achieving an ac-
was the quivering printed paper. The video recorded on the curacy of 86.78% on facial liveness detection. The multiclass
Lenovo display recorded the highest EER of 4.39%. problem identified the genuine and various live spoof attacks,
focusing on successfully identifying each class of attack
in the dataset. Analysing the genuine live video clips and
C. Discussion comparing these against the spoof attacks, the model achieved
an HTER of 13.2% and an EER of 2.46%. The proposed
Overall, the ViViT-based proposed model performs well
model achieved favourable results compared with existing
on the Rose-Youtu dataset achieving an accuracy of 86.78%.
cutting-edge approaches on the same dataset and experimental
Based on the confusion matrix visualised in Figure 5, the
conditions. The study and proposed model notably contributed
genuine class is confused with the video recorded on the Mac
towards utilising custom video data loading and implementing
display spoof attack. Both the Vm and Vl classes are confused
video vision transformers for face liveness and anti-spoofing
in Figure 5; this is most likely because they are laptop display
detection towards optimising generalisation ability in other
devices which is not the model’s objective for identification.
environmental conditions.
Future expansions on this work would investigate the
TABLE IV
C OMPARISON OF V ISION T RANSFORMERS WITH S TATE OF THE ART effect of different patch sizes for the tubelet embedding in
METHODS ON THE ROSE -YOUTU DATASET. ViViT and the comparison of model performance on other
benchmark datasets such as CASIA, Idiap Replay-Attack and
Method Accuracy HTER EER
Motion-based approach [12] 95.44% — 4.56% MSU mobile face spoofing databases.
Wide and deep features [13] — 6.12% 4.27%
3D CNN [15] — — 7.00% R EFERENCES
ViViT 86.78% 13.28% 2.46%
DRL-FAS [16] — — 1.79% [1] “Facial recognition market by component (software
LDnet [14] 99.79% 0.08% — tools (3d facial recognition) and services),

Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023 Page 197
View publication stats

application (law enforcement, access control, emotion abs/2101.04756, 2021.


recognition), vertical (bfsi, government and defense, [14] N. Nanthini, N. Puviarasan, and P. Aruna, “A novel deep
automotive), and region - global forecast to 2025,” cnn based ldnet model with the combination of 2d and
https://www.marketsandmarkets.com/Market-Reports/ 3d cnn for face liveness detection,” in 2022 International
facial-recognition-market-995.html, [Online]. Accessed Conference on Innovative Computing, Intelligent Com-
on May 28, 2023. munication and Smart Electrical Systems (ICSES), 2022,
[2] S. Khairnar, S. Gite, K. Kotecha, and S. D. Thep- pp. 1–7.
ade, “Face liveness detection using artificial intelligence [15] H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C. Kot,
techniques: A systematic literature review and future “Learning generalized deep feature representation for
directions,” Big Data and Cognitive Computing, vol. 7, face anti-spoofing,” IEEE Transactions on Information
no. 1, 2023. Forensics and Security, vol. 13, no. 10, pp. 2639–2652,
[3] Z. Yu, Y. Qin, X. Li, C. Zhao, Z. Lei, and G. Zhao, 2018.
“Deep learning for face anti-spoofing: A survey,” IEEE [16] R. Cai, H. Li, S. Wang, C. Chen, and A. C. Kot,
Transactions on Pattern Analysis and Machine Intelli- “DRL-FAS: A novel framework based on deep rein-
gence, vol. 45, no. 5, pp. 5609–5631, 2023. forcement learning for face anti-spoofing,” CoRR, vol.
[4] S. Z. Rufai, A. Selwal, and D. Sharma, “On analysis abs/2009.07529, 2020.
of face liveness detection mechanisms via deep learn- [17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
ing models,” in 2022 International Conference on Sus- nature, vol. 521, no. 7553, pp. 436–444, 2015.
tainable Computing and Data Communication Systems [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
(ICSCDS), 2022, pp. 59–64. L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin,
[5] H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. “Attention is all you need,” in Advances in Neural In-
Kot, “Unsupervised domain adaptation for face anti- formation Processing Systems, I. Guyon, U. V. Luxburg,
spoofing,” IEEE Transactions on Information Forensics S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
and Security, vol. 13, no. 7, pp. 1794–1809, 2018. R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
[6] T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recog- [19] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić,
nition with local binary patterns,” in Computer Vision and C. Schmid, “Vivit: A video vision transformer,” in
- ECCV 2004, T. Pajdla and J. Matas, Eds. Berlin, 2021 IEEE/CVF International Conference on Computer
Heidelberg: Springer Berlin Heidelberg, 2004, pp. 469– Vision (ICCV), 2021, pp. 6816–6826.
481. [20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
[7] J. Määttä, A. Hadid, and M. Pietikäinen, “Face spoof- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min-
ing detection from single images using micro-texture derer, G. Heigold, S. Gelly et al., “An image is worth
analysis,” in 2011 International Joint Conference on 16x16 words: Transformers for image recognition at
Biometrics (IJCB), 2011, pp. 1–7. scale,” arXiv preprint arXiv:2010.11929, 2020.
[8] T. de Freitas Pereira, J. Komulainen, A. Anjos, J. M. [21] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan,
De Martino, A. Hadid, M. Pietikäinen, and S. Mar- and M. Shah, “Transformers in vision: A survey,” ACM
cel, “Face liveness detection using dynamic texture,” Comput. Surv., vol. 54, no. 10s, 2022.
EURASIP Journal on Image and Video Processing, vol. [22] G. Habib, T. J. Saleem, and B. Lall, “Knowledge dis-
2014, no. 1, p. 2, 2014. tillation in vision transformers: A critical review,” arXiv
[9] Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, preprint arXiv:2302.02108, 2023.
“A face antispoofing database with diverse attacks,” in [23] Z. Li, R. Cai, H. Li, K.-Y. Lam, Y. Hu, and A. C. Kot,
2012 5th IAPR International Conference on Biometrics “One-class knowledge distillation for face presentation
(ICB), 2012, pp. 26–31. attack detection,” IEEE Transactions on Information
[10] I. Chingovska, A. Anjos, and S. Marcel, “On the effec- Forensics and Security, vol. 17, pp. 2137–2150, 2022.
tiveness of local binary patterns in face anti-spoofing,” [24] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang,
in 2012 BIOSIG - Proceedings of the International Con- and L. Val Gool, “Temporal segment networks: Towards
ference of Biometrics Special Interest Group (BIOSIG), good practices for deep action recognition,” in The Eu-
2012, pp. 1–7. ropean Conference on Computer Vision (ECCV), 2016.
[11] D. Wen, H. Han, and A. K. Jain, “Face spoof detection [25] M. Hossin and M. N. Sulaiman, “A review on evaluation
with image distortion analysis,” IEEE Transactions on metrics for data classification evaluations,” International
Information Forensics and Security, vol. 10, no. 4, pp. journal of data mining & knowledge management pro-
746–761, 2015. cess, vol. 5, no. 2, p. 1, 2015.
[12] M. M. Hasan, M. S. U. Yusuf, T. I. Rohan, and S. Roy, Marc Marais is a MSc student at Rhodes University in Computer Science.
“Efficient two stage approach to detect face liveness : Interests: action recognition, machine learning and computer vision.
Motion based and deep learning based,” in 2019 4th Dane Brown is a Senior Lecturer at Rhodes University, where he obtained
International Conference on Electrical Information and his PhD. Interests: computer vision, machine learning, security and GPGPU.
Communication Technology (EICT), 2019, pp. 1–6. James Connan is a Senior Lecturer at Rhodes University. Interests: image
[13] S. Hashemifard and M. Akbari, “A compact deep learn- processing and machine learning.
ing model for face spoofing detection,” CoRR, vol. Alden Boby is a MSc student at Rhodes University in Computer Science.
Interests: machine learning, computer vision and object detection.

Page 198 Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2023

You might also like