Robust Attentive Deep Neural Network For Detecting
Robust Attentive Deep Neural Network For Detecting
ABSTRACT Generative Adversarial Network (GAN) based techniques can generate and synthesize real-
istic faces that cause profound social concerns and security problems. Existing methods for detecting
GAN-generated faces can perform well on limited public datasets. However, images from existing datasets
do not represent real-world scenarios well enough in terms of view variations and data distributions, where
real faces largely outnumber synthetic ones. The state-of-the-art methods do not generalize well in real-world
problems and lack the interpretability of detection results. Performance of existing GAN-face detection
models degrades accordingly when facing data imbalance issues. To address these shortcomings, we propose
a robust, attentive, end-to-end framework that spots GAN-generated faces by analyzing eye inconsistencies.
Our model automatically learns to identify inconsistent eye components by localizing and comparing artifacts
between eyes. After the iris regions are extracted by Mask-RCNN, we design a Residual Attention Network
(RAN) to examine the consistency between the corneal specular highlights of the two eyes. Our method can
effectively learn from imbalanced data using a joint loss function combining the traditional cross-entropy
loss with a relaxation of the ROC-AUC loss via Wilcoxon-Mann-Whitney (WMW) statistics. Comprehensive
evaluations on a newly created FFHQ-GAN dataset in both balanced and imbalanced scenarios demonstrate
the superiority of our method.
INDEX TERMS GAN-generated face, fake face detection, iris detection, corneal specular highlights,
residual attention network, data imbalance, AUC maximization, WMW statistics, FFHQ-GAN dataset.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
32574 VOLUME 10, 2022
H. Guo et al.: Robust Attentive Deep Neural Network for Detecting GAN-Generated Faces
FIGURE 2. The proposed architecture for GAN-generated face detection. We first use DLib [27] to detect faces and localize eyes, and use Mask R-CNN [19]
to segment out the iris regions. A Residual Attention Network (RAN) then performs binary classification on the extracted iris pair to determine if the face
is real or fake. The training is carried out using a joint loss combining the Binary Cross-Entropy (BCE) loss and the ROC-AUC loss with WMW relaxation to
better handle the learning from imbalanced data (see text).
A. GAN-GENERATED FACE DETECTION The channel attention [45] can automatically learn to focus
GAN-generated faces detection methods can be organized on important channels by analyzing the relationship between
into two categories. channels. SENet [46] embeds the channel attention mech-
Data-driven methods [28]–[32] mostly train a deep neu- anism into residual blocks, and effectiveness is shown on
ral network model to distinguish real and GAN-generated large-scale image classification. The attention mechanism
faces. These deep learning (DL) based methods work well is also used in [47] to distinguish important channels in
in many scenarios, as they can better learn representations in the network to improve the representation capability. The
a high-dimensional feature space instead of raw image pixels. idea of channel attention and spatial attention are combined
Physical and physiological methods look for signal jointly in [48], [49] to improve network performance signifi-
traces, artifacts, or inconsistencies left by the GAN synthe- cantly. The Residual Attention Network in [20] combines the
sizers. These methods are explainable in nature. Simple cues residual unit [50] with the attention mechanism by stacking
such as color difference are used in [33], [34] to distinguish residual attention blocks to improve performance and reduce
GAN images from the real ones. However, those methods model complexity.
are no longer effective as the GAN methods advance. More
sophisticated methods in [11], [35] leverage fingerprints or C. IMBALANCED DATA LEARNING
abstract signal-level traces of the noise residuals to differen- Learning from imbalanced data has been widely studied in
tiate GAN-generated faces. Many works [36]–[38] identify machine learning [51]–[54] and computer vision [55], [56].
GAN images by recognizing the specific artifacts produced Earlier solutions for imbalanced data learning are mainly
by the GAN upsampling process. In [39], the distribution of based on the sampling design, e.g. oversampling for minor
facial landmarks is analyzed to distinguish GAN-generated classes, undersampling for major classes, and weighed sam-
faces. Inconsistent head poses are detected to expose the fake pling [57], etc. These sampling-based methods come with
videos in [10]. The work of [16] identifies GAN-generated their own drawbacks. For example, undersampling may
faces as well as deepfakes face manipulations by inspecting ignore important samples, and oversampling may lead to
visual artifacts. Our prior work of [18] determines the incon- overfitting.
sistencies of the corneal specular highlights between left and Data augmentation provides an alternative solution to alle-
right eyes to expose GAN-generated faces. viate data imbalance issues. For image recognition, image
mirroring, rotation, color adjustments, etc. are simple meth-
ods to augment data samples [58]. However, data augmenta-
B. ATTENTION MECHANISM tion methods can only address the data imbalance problems
Since the seminal work of [40] in machine translation, the partly, as the size of the original dataset must be diverse
attention mechanism is widely used in many applications enough, such that a sufficient amount of representative sam-
on improving the performance of deep learning models by ples can be produced from augmentation.
focusing on the most relevant part of the features in a flexi-
ble manner. The Class Activation Mapping (CAM) [41] and III. METHOD
Grad-CAM [42] are widely used in many computer vision We next describe the proposed GAN-generated faces detec-
tasks [43]. However, in these works, attentions are only used tion framework. Given an input face image, facial landmarks
to visualize model prediction in showing significant portions are first localized using DLib [27], and Mask R-CNN [19] is
of the images. On the other hand, integrating the attention used to segment out the left and right iris regions of the eyes
mechanism into the network design is shown to be effective (§ III-A). We adopt a residual attention-based network [20]
in boosting performance, as the network can be guided by the to perform binary classification on the iris regions of interest
attention to focus on relevant regions during training [44]. to determine if the input image is real or fake (§ III-B).
FIGURE 3. Details of our Attention Module from the our RAN in Figure 2.
Our design is inspired from the residual attention network of [20].
FIGURE 5. The proposed pipeline for training the Residual Attention Network (RAN) on possibly imbalanced data for GAN-generated face
classification. The extracted iris pairs are passed as input to the RAN. A robust loss function derived from maximizing the AUC of ROC is
optimized in the training of the RAN. See details in § III-D.
As suggested in [20], the mixed attention yields the best we assume gw (xi ) 6= gw (xj ) for i 6= j (ties can be broken
performance. Thus, we use the Sigmoid function 1+exp1 −f in any consistent way).
( s,c )
to learn the mixed attention for each channel and each spa- Given a threshold λ, the number of negative examples with
tial location, where s ranges over all spatial positions and c prediction scores larger than λ is false positive (FP), and the
ranges over all channels of f . The proposed Residual Atten- number of positive examples with prediction scores greater or
tion Network (RAN) is constructed by stacking multiple equal to λ is true positive (TP). According to the FP and TP,
Attention Modules, as shown on the right side of Figure 2. we can calculate the false positive rate (FPR) and the true
Table 1 provides details of the architectures. Although the positive rate (TPR) as follows,
attention module plays an important role in classification, P P
i∈N I[gw (xi )>λ] I[g (x )≥λ]
a simple stacking of attention modules may reduce perfor- FPR = , TPR = i∈P w i ,
|N | |P|
mance. To this end, we adopt a simple solution by adding the
attention map onto the original feature map. This combination where I[a] is an indicator function with I[a] = 1 if a is true
allows attention modules stacked like a ResNet [50] and and 0 otherwise. The receiver operation curve (ROC) is a plot
improves the performance [20]. Given an input image, The of FPR versus TPR with setting different decision thresholds
RAN outputs a prediction score from the last Sigmoid layer, λ ∈ (−∞, ∞). Based on this definition, ROC is a curve
as an indication of the likelihood of the input image being a confined to [0, 1] × [0, 1] and connecting the point (0,0) to
GAN-generated image. (1,1). The value of AUC corresponds to the area enclosed by
the ROC curve.
C. AUC OF ROC FOR CLASSIFICATION EVALUATION
D. WMW AUC RELAXATION FOR LOSS DESIGN
Most classification loss measures including the popular
cross-entropy loss are ineffective in addressing the issue of The computation of an AUC score based on the area under a
data imbalance. The resulting models can produce accurate ROC curve cannot be directly used in a loss function due to its
but rather biased predictions that do not work well in practice. discrete nature. According to the Wilcoxon-Mann-Whitney
It is desirable to address data imbalance directly by specifi- (WMW) statistic [25], we can relax the AUC as follows,
cally designing a suitable loss function. 1 XX
AUC = I[gw (xi )>gw (xj )] .
Since the area under the curve (AUC) of a receiver oper- |P||N |
i∈P j∈N
ation curve (ROC) [26], [66] is a robust evaluation metric
for both balanced and imbalanced data, we would like to Therefore, the corresponding AUC loss (risk) can be defined
directly maximize the AUC to handle imbalanced situations. as:
The AUC is widely used in the binary classification problems. 1 XX
We next briefly review the definition of AUC, and then moti- LAUC = 1 − AUC = I[gw (xi )<gw (xj )] . (2)
|P||N |
i∈P j∈N
vate how we incorporate a loss term that directly maximize
the AUC performance. Given a labeled dataset {(xi , yi )}M i=1 , Obviously, LAUC takes value in [0, 1]. It is a fraction of pairs
where each data sample xi ∈ Rd and each corresponding of prediction scores from the positive sample and negative
label yi ∈ {−1, +1}. We define a set of indices of positive sample that are ranked incorrectly, i.e., the prediction score
instances as P = {i | yi = +1}. Similarly, the set of from a negative sample is larger than the prediction score
indices of negative instances is N = {i | yi = −1}. Let from a positive sample. If all prediction scores from the
gw : Rd → R be a parametric prediction function with positive samples are larger than any prediction score from
parameter w ∈ Rm . gw (xi ) represents the prediction score the negative samples, then LAUC = 0. This indicates we
of the i-th sample, where i ∈ {1, · · · , M }. For simplicity, obtain a perfect classifier. Furthermore, LAUC is independent
of the threshold λ. LAUC only depends on the prediction TABLE 2. Details of the FFHQ-GAN dataset regarding its balanced (-b)
and imbalanced (-imb) subsets.
scores gw (x). In other words, the predictor gw affects the
value of LAUC . Therefore, we aim to learn a classifier gw that
minimizes Eq.(2).
Although we can calculate LAUC by comparing predic-
tion score from the positive sample and prediction score
from the negative sample in each pair, the LAUC formulation
is non-differentiable due to the discrete computation. It is
TABLE 3. Results on the FFHQ-GAN dataset regarding its balanced (-b)
therefore desirable to find a differentiable approximation for and imbalanced (-imb) subsets.
LAUC . Inspired by the work in [25], we find an approximation
to LAUC that can be directly applied to our objective function
to minimize the AUC loss along with our imbalanced train-
ing procedure. Specifically, a differentiable approximation of
LAUC can be reformulated as:
1 XX
LAUC = R(gw (xi ), gw (xj )), (3)
|P||N |
i∈P j∈N
and R(gw (xi ), gw (xj )) =
(
(−(gw (xi ) − gw (xj ) − γ ))p , gw (xi ) − gw (xj ) < γ ,
(4)
0, otherwise,
B. IMPLEMENTATION DETAILS
where γ ∈ (0, 1] and p > 1 are two hyperparameters.
Loss for the Proposed Residual Attention Network: We use We implemented our method in PyTorch [67]. Experiments
a joint loss function comprising the conventional binary are conducted on a workstation with two NVIDIA GeForce
cross-entropy (BCE) loss function LBCE and the AUC loss 1080Ti GPUs.
function LAUC in weighted sum: For iris detection, Mask R-CNN is trained using the
datasets from [60], [61]. For each training eye image, the
L = α LBCE + (1 − α) LAUC , (5) outer boundary mask of each iris is obtained using the method
where α ∈ [0, 1] is a scaling factor that is designed for of [60] with default hyper-parameter settings. These masks
balancing the weights of the BCE loss and the AUC loss. are used to generate the iris bounding boxes and the corre-
sponding masks for training, using the default settings in [19].
IV. EXPERIMENT In the test stage, given an input face image, we first use the
For experimental evaluation of the proposed method and face detector and landmark extractor of DLib [27] to crop
comparison against the state-of-the-art methods, we first out the eye regions. Each cropped eye region is fed to Mask
introduce the newly constructed FFHQ-GAN datasets § IV-A. R-CNN for localizing the iris bounding box and segmentation
Implementation details of the proposed method are provided mask. This process is repeated for both the left and right eyes
in § IV-B. Performance evaluation on the FFHQ-GAN bal- to obtain the iris pairs as the input for our Residual Attention
anced and imbalanced subsets is in § IV-C. Ablation studies Network. We resize all iris pairs to a fixed size 96 × 96 for
are provided in § IV-D. Finally, qualitative results are shown training and testing to ensure that the whole pipeline works
in § IV-E. well.
Table 1 describes the details of our Residual Attention
A. THE NEW FFHQ-GAN DATASET Network (RAN), where the Attention Module (AM) detailed
We collect real human face images from the Flickr-Faces-HQ in Figure 3 is repeatedly stacked three times. The network is
(FFHQ) dataset [3]. GAN-generated face images are created trained using Adam optimizer [68] with the learning rate of
using StyleGAN2 [3] via http://thispersondoesnotexist.com, 0.001 and batch size 128. Training is terminated at 100 epochs
where the image resolution is 1024 × 1024 pixels. We ran- for balanced data and 2,000 for imbalanced data.
domly select 5,000 real face images from FFHQ and 5,000 Hyper-Parameters: We set p = 2 in Eq. (4) and γ = 0.4 for
GAN-generated face images. After iris detection, we discard balanced dataset, and γ = 0.6 for imbalanced data. For the
those images with the iris of any eye not detected. This ends experiments on the balanced dataset, α in Eq. (5) is set to 0.2.
up with 3,739 real faces (with iris pairs) and 3,748 fake faces For the experiments on the imbalanced dataset, α is set to 0.4.
(with iris pairs), which constitutes our new FFHQ-GAN These hyperparameters yields the best performance.
dataset. The split ratio of training and testing is 8:2.
To enable a thorough evaluation of the model in both C. EVALUATION ON THE FFHQ-GAN DATASET
balanced and imbalanced data scenarios, we sampled the We report evaluation of GAN-generated face detection on the
FFHQ-GAN dataset to form an imbalanced subset, where the FFHQ-GAN dataset in terms of Accuracy (ACC), Precision
statistics of the subsets are provided in Table 2. (P), Recall (R), F1 score (F1), the area under the curve (AUC)
FIGURE 6. Performance comparison of the proposed method with ResNet with BCE loss, Xception with BCE loss, and RAN with BCE loss.
FIGURE 7. Confusion Matrix on the FFHQ-GAN (left) balanced and (right) imbalanced datasets.
FIGURE 10. Examples of detected GAN-generated faces and their corresponding iris regions and the attention maps produced from our method. These
examples show that our method can detect a wide range of face images, including those with tilted or side views where both irises are visible.
on a balanced dataset and test on an imbalanced dataset. images and attends to the highlight parts for the fake images.
The obtained performance difference is similar to that of Figure 10 shows additional examples of the GAN-generated
training/test on the balanced dataset. This result suggests the face with the extracted iris pairs and corresponding attention
importance of the training model on an imbalanced dataset if maps. The visualization also provides an intuitive approach
the model is expected to deal with detection on imbalanced for human beings to identify GAN-generated faces by com-
data. paring their iris regions.
[2] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive grow- [28] F. Marra, C. Saltori, G. Boato, and L. Verdoliva, ‘‘Incremental learning for
ing of GANs for improved quality, stability, and variation,’’ 2017, the detection and classification of GAN-generated images,’’ in Proc. IEEE
arXiv:1710.10196. Int. Workshop Inf. Forensics Secur. (WIFS), Dec. 2019, pp. 1–6.
[3] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for [29] M. Goebel, L. Nataraj, T. Nanjundaswamy, T. M. Mohammed,
generative adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. S. Chandrasekaran, and B. S. Manjunath, ‘‘Detection, attribution and
Pattern Recognit. (CVPR), Jun. 2019, pp. 4401–4410. localization of GAN generated images,’’ 2020, arXiv:2007.10466.
[4] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, ‘‘Ana- [30] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, ‘‘CNN-
lyzing and improving the image quality of StyleGAN,’’ in Proc. IEEE/CVF generated images are surprisingly easy to spot. . . for now,’’ in Proc.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 8110–8119. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 7, Jun. 2020,
[5] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and pp. 1–10.
T. Aila, ‘‘Alias-free generative adversarial networks,’’ in Proc. NeurIPS, [31] Z. Liu, X. Qi, and P. H. S. Torr, ‘‘Global texture enhancement for fake
2021, pp. 1–12. face detection in the wild,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[6] A Spy Reportedly Used an AI-Generated Profile Picture to Connect Recognit. (CVPR), Jun. 2020, pp. 8060–8069.
With Sources on LinkedIn. Accessed: Jun. 13, 2019. [Online]. Available: [32] N. Hulzebosch, S. Ibrahimi, and M. Worring, ‘‘Detecting CNN-generated
https://bit.ly/35BU215 facial images in real-world scenarios,’’ in Proc. IEEE/CVF Conf. Comput.
[7] A High School Student Created a Fake 2020 US Candidate. Twitter Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2020, pp. 642–643.
Verified it. Accessed: Feb. 28, 2020. [Online]. Available: https://www. [33] S. McCloskey and M. Albright, ‘‘Detecting GAN-generated imagery using
cnn.com/2020/02/28/tech/fake-twitter-candidate-2020/index.html color cues,’’ 2018, arXiv:1812.08247.
[8] How Fake Faces are Being Weaponized Online. Accessed: Feb. 20, 2020. [34] H. Li, B. Li, S. Tan, and J. Huang, ‘‘Identification of deep net-
[Online]. Available: https://www.cnn.com/2020/02/20/tech/fake-faces- work generated images using disparities in color components,’’ 2018,
deepfake/index.html arXiv:1808.07276.
[9] These Faces are not Real. Accessed: Jul. 15, 2020. [Online]. [35] N. Yu, L. Davis, and M. Fritz, ‘‘Attributing fake images to GANs: Learning
Available: https://graphics.reuters.com/CYBER-DEEPFAKE/ACTIVIST/ and analyzing GAN fingerprints,’’ in Proc. IEEE/CVF Int. Conf. Comput.
nmovajgnxpa/index.html Vis. (ICCV), Oct. 2019, pp. 7556–7566.
[10] X. Yang, Y. Li, and S. Lyu, ‘‘Exposing deep fakes using inconsistent head [36] X. Zhang, S. Karaman, and S.-F. Chang, ‘‘Detecting and simulating arti-
poses,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), facts in GAN fake images,’’ in Proc. IEEE Int. Workshop Inf. Forensics
May 2019, pp. 8261–8265. Secur. (WIFS), Dec. 2019, pp. 1–6.
[11] F. Marra, D. Gragnaniello, L. Verdoliva, and G. Poggi, ‘‘Do GANs leave [37] J. Frank, T. Eisenhofer, L. Schönherr, A. Fischer, D. Kolossa, and T. Holz,
artificial fingerprints?’’ in Proc. IEEE Conf. Multimedia Inf. Process. Retr. ‘‘Leveraging frequency analysis for deep fake image recognition,’’ 2020,
(MIPR), Mar. 2019, pp. 506–511. arXiv:2003.08685.
[12] H. Mo, B. Chen, and W. Luo, ‘‘Fake faces identification via convolutional [38] R. Durall, M. Keuper, and J. Keuper, ‘‘Watch your up-convolution: CNN
neural network,’’ in Proc. 6th ACM Workshop Inf. Hiding Multimedia based generative deep neural networks are failing to reproduce spectral
Secur., Jun. 2018, pp. 43–47. distributions,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[13] N.-T. Do, I.-S. Na, and S.-H. Kim, ‘‘Forensics face detection from GANs (CVPR), Jun. 2020, pp. 7890–7899.
using convolutional neural network,’’ in Proc. ISITC, 2018, pp. 1–5. [39] X. Yang, Y. Li, H. Qi, and S. Lyu, ‘‘Exposing GAN-synthesized faces using
[14] R. Wang, F. Juefei-Xu, L. Ma, X. Xie, Y. Huang, J. Wang, and Y. Liu, landmark locations,’’ in Proc. ACM Workshop Inf. Hiding Multimedia
‘‘FakeSpotter: A simple yet robust baseline for spotting AI-synthesized Secur., Jul. 2019, pp. 113–118.
fake faces,’’ 2019, arXiv:1909.06122. [40] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
[15] B. Chen, X. Ju, B. Xiao, W. Ding, Y. Zheng, and V. H. C. de Albuquerque, jointly learning to align and translate,’’ 2014, arXiv:1409.0473.
‘‘Locally GAN-generated face detection based on an improved Xception,’’ [41] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, ‘‘Learning
Inf. Sci., vol. 572, pp. 16–28, Sep. 2021. deep features for discriminative localization,’’ in Proc. IEEE Conf. Com-
[16] F. Matern, C. Riess, and M. Stamminger, ‘‘Exploiting visual artifacts to put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2921–2929.
expose deepfakes and face manipulations,’’ in Proc. IEEE Winter Appl. [42] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
Comput. Vis. Workshops (WACVW), Jan. 2019, pp. 83–92. D. Batra, ‘‘Grad-CAM: Visual explanations from deep networks via
[17] H. Guo, S. Hu, X. Wang, M.-C. Chang, and S. Lyu, ‘‘Eyes tell all: gradient-based localization,’’ in Proc. IEEE Int. Conf. Comput. Vis.
Irregular pupil shapes reveal GAN-generated faces,’’ in Proc. ICASSP, (ICCV), Oct. 2017, pp. 618–626.
2022, pp. 1–6. [43] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
[18] S. Hu, Y. Li, and S. Lyu, ‘‘Exposing GAN-generated faces using inconsis- ‘‘Grad-CAM++: Generalized gradient-based visual explanations for deep
tent corneal specular highlights,’’ in Proc. IEEE Int. Conf. Acoust., Speech convolutional networks,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis.
Signal Process. (ICASSP), Jun. 2021, pp. 2500–2504. (WACV), Mar. 2018, pp. 839–847.
[19] K. He, G. Gkioxari, P. Dollar, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc. [44] S. Kardakis, I. Perikos, F. Grivokostopoulou, and I. Hatzilygeroudis,
ICCV, 2017, pp. 2961–2969. ‘‘Examining attention mechanisms in deep learning models for sentiment
[20] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, analysis,’’ Appl. Sci., vol. 11, no. 9, p. 3883, Apr. 2021.
and X. Tang, ‘‘Residual attention network for image classification,’’ in [45] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional block
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, attention module,’’ in Proc. ECCV, 2018, pp. 3–19.
pp. 3156–3164. [46] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze-and-excitation networks,’’ in
[21] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
MA, USA: MIT Press, 2012. pp. 7132–7141.
[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, [47] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, ‘‘Image super-
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ in resolution using very deep residual channel attention networks,’’ in Proc.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. ECCV, 2018, pp. 286–301.
[23] M. Fadaee, A. Bisazza, and C. Monz, ‘‘Data augmentation for low-resource [48] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua,
neural machine translation,’’ 2017, arXiv:1705.00440. ‘‘SCA-CNN: Spatial and channel-wise attention in convolutional networks
[24] F. Provost, T. Fawcett, and R. Kohavi, ‘‘The case against accuracy estima- for image captioning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
tion for comparing induction algorithms,’’ in Proc. ICML, vol. 98, 1998, (CVPR), Jul. 2017, pp. 5659–5667.
pp. 445–453. [49] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional
[25] L. Yan, R. H. Dodier, M. Mozer, and R. H. Wolniewicz, ‘‘Optimizing clas- block attention module,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
sifier performance via an approximation to the Wilcoxon-Mann-Whitney pp. 3–19.
statistic,’’ in Proc. ICML, 2003, pp. 848–855. [50] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
[26] S. Lyu and Y. Ying, ‘‘A univariate bound of area under ROC,’’ in Proc. recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
UAI, 2018, pp. 1–10. Jun. 2016, pp. 770–778.
[27] D. E. King, ‘‘Dlib-ml: A machine learning toolkit,’’ J. Mach. Learn. Res., [51] Y. Yang and Z. Xu, ‘‘Rethinking the value of labels for improving class-
vol. 10, pp. 1755–1758, Jan. 2009. imbalanced learning,’’ 2020, arXiv:2006.07529.
[52] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma, ‘‘Learning imbalanced XIN WANG (Senior Member, IEEE) received the
datasets with label-distribution-aware margin loss,’’ in Proc. NeurIPS, Ph.D. degree in computer science from the Uni-
2019, pp. 1567–1578. versity at Albany, State University of New York,
[53] S. Hu, Y. Ying, and S. Lyu, ‘‘Learning by minimizing the sum of Albany, NY, USA, in 2015. He is currently a
ranked range,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, Senior Machine Learning Scientist at Keya Med-
pp. 21013–21023. ical, Seattle, WA, USA. His research interests
[54] S. Hu, Y. Ying, X. Wang, and S. Lyu, ‘‘Sum of ranked range loss for include artificial intelligence, machine learning,
supervised learning,’’ 2021, arXiv:2106.03300.
reinforcement learning, medical image computing,
[55] Y. Wang, W. Gan, J. Yang, W. Wu, and J. Yan, ‘‘Dynamic curriculum
computer vision, and media forensics.
learning for imbalanced data classification,’’ in Proc. IEEE/CVF Int. Conf.
Comput. Vis. (ICCV), Oct. 2019, pp. 5017–5026.
[56] C. Huang, Y. Li, C. C. Loy, and X. Tang, ‘‘Learning deep representation
for imbalanced classification,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2016, pp. 5375–5384.
[57] H. He and E. A. Garcia, ‘‘Learning from imbalanced data,’’ IEEE Trans.
Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
[58] C. Shorten and T. M. Khoshgoftaar, ‘‘A survey on image data augmentation
for deep learning,’’ J. Big Data, vol. 6, no. 1, pp. 1–48, Dec. 2019.
[59] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time MING-CHING CHANG (Senior Member, IEEE)
object detection with region proposal networks,’’ 2015, arXiv:1506.01497. received the B.S. degree in civil engineering and
[60] C. Wang, J. Muhammad, Y. Wang, Z. He, and Z. Sun, ‘‘Towards complete the M.S. degree in computer science and informa-
and accurate iris segmentation using deep multi-task attention network tion engineering (CSIE) from the National Taiwan
for non-cooperative iris recognition,’’ IEEE Trans. Inf. Forensics Security, University, in 1996 and 1998, respectively, and
vol. 15, pp. 2944–2959, 2020. the Ph.D. degree from the Laboratory for Engi-
[61] C. Wang et al., ‘‘NIR iris challenge evaluation in non-cooperative envi- neering Man/Machine Systems (LEMS), School
ronments: Segmentation and localization,’’ in Proc. IEEE Int. Joint Conf.
of Engineering, Brown University, in 2008. He was
Biometrics (IJCB), Aug. 2021, pp. 1–10.
an Assistant Researcher at Mechanical Industry
[62] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, ‘‘Attention to scale:
Scale-aware semantic image segmentation,’’ in Proc. IEEE Conf. Comput. Research Labs, Industrial Technology Research
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 3640–3649. Institute (ITRI), Taiwan, from 1996 to 1998. From 2008 to 2016, he was
[63] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, ‘‘Spatial a Computer Scientist at the GE Global Research Center. From 2016 to 2018,
transformer networks,’’ in Proc. NIPS, vol. 28, 2015, pp. 2017–2025. he was with the Department of Electrical and Computer Engineering. He is
[64] Q. Jin, Z. Meng, C. Sun, H. Cui, and R. Su, ‘‘RA-UNet: A hybrid deep currently an Assistant Professor at the Department of Computer Science,
attention-aware network to extract liver and tumor in CT scans,’’ Frontiers College of Engineering and Applied Sciences (CEAS), University at Albany,
Bioeng. Biotechnol., vol. 8, p. 1471, Dec. 2020. State University of New York (SUNY). His research projects are funded
[65] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-net: Convolutional net- by GE Global Research, IARPA, DARPA, NIJ, VA, and UAlbany. He has
works for biomedical image segmentation,’’ in Proc. Med. Image Comput. authored more than 100 peer-reviewed journals and conference publications,
Comput.-Assist. Intervent. (MICCAI). Springer, 2015, pp. 234–241. seven U.S. patents, and 15 disclosures. His research interests include video
[66] C. Cortes and M. Mohri, ‘‘AUC optimization vs. error rate minimization,’’ analytics, computer vision, image processing, and artificial intelligence.
in Proc. Adv. Neural Inf. Process. Syst., vol. 16, 2003, pp. 313–320. He is a member of ACM. He was a recipient of the IEEE Advanced Video and
[67] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance deep Signal-based Surveillance (AVSS) 2011 Best Paper Award - Runner-Up, the
learning library,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32,
IEEE Workshop on the Applications of Computer Vision (WACV) 2012 Best
Dec. 2019, pp. 8026–8037.
Student Paper Award, the GE Belief - Stay Lean and Go Fast Management
[68] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
in Proc. ICLR, 2015, pp. 1–15. Award in 2015, and the IEEE Smart World NVIDIA AI City Challenge 2017
[69] F. Chollet, ‘‘Xception: Deep learning with depthwise separable convo- Honorary Mention Award. He serves as the Co-Chair for the Annual AI City
lutions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Challenge CVPR 2018–2021 Workshop, the Co-Chair for the IEEE Lower
Jul. 2017, pp. 1251–1258. Power Computer Vision (LPCV) Annual Contest and Workshop 2019–2021,
the Program Chair for the IEEE Advanced Video and Signal-Based Surveil-
lance (AVSS) 2019, the Co-Chair for the IWT4S 2017–2019, the Area Chair
for IEEE ICIP (2017, 2019–2021) and ICME (2021), and the TPC Chair for
HUI GUO is currently pursuing the Ph.D. degree the IEEE MIPR 2022.
with the University at Albany, State University
of New York. Her research interests include dig-
ital media forensics, computer vision, and deep
learning.