0% found this document useful (0 votes)

3 views152 pages

Advanced Strategies for Improving the Robustness of Deep Learning

This dissertation by Keyu Chen explores advanced strategies to enhance the robustness of deep learning, focusing on areas such as computer vision and natural language processing. It includes methodologies like discriminative adversarial domain generalization and episodic curriculum learning for low-resource domain adaptation. The work is part of the requirements for a Doctor of Philosophy degree at the University of South Florida, under the supervision of Dr. J. Morris Chang and a committee of experts.

Uploaded by

aniksquare33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views152 pages

Advanced Strategies for Improving the Robustness of Deep Learning

Uploaded by

aniksquare33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 152

University of South Florida

Digital Commons @ University of

South Florida

USF Tampa Graduate Theses and Dissertations USF Graduate Theses and Dissertations

November 2024

Advanced Strategies for Improving the Robustness of Deep

Learning
Keyu Chen
University of South Florida

Follow this and additional works at: https://digitalcommons.usf.edu/etd

Part of the Electrical and Computer Engineering Commons

Scholar Commons Citation

Chen, Keyu, "Advanced Strategies for Improving the Robustness of Deep Learning" (2024). USF Tampa
Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/10606

This Dissertation is brought to you for free and open access by the USF Graduate Theses and Dissertations at
Digital Commons @ University of South Florida. It has been accepted for inclusion in USF Tampa Graduate Theses
and Dissertations by an authorized administrator of Digital Commons @ University of South Florida. For more
information, please contact [email protected].
Advanced Strategies for Improving the Robustness of Deep Learning

Keyu Chen

A dissertation submitted in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy
Department of Electrical Engineering
College of Engineering
University of South Florida

Major Professor: J. Morris Chang, Ph.D.

Ismail Uysal, Ph.D.
John Licato, Ph.D.
Lu Lu, Ph.D.
Zhuo Lu, Ph.D.

Date of Approval:
November 8, 2024

Keywords: Computer Vision, Natural Language Processing, Domain Generalization,

Domain Adaptation

Copyright © 2024, Keyu Chen

Dedication

This dissertation is dedicated to my parents Zhongwen Chen and Hongying Xiao for
their unconditional support in all my endeavors. I also dedicate this dissertation to my wife
Jingwei He for her encouragement and inspiration.
Acknowledgments

I want to take a moment to extend my deepest appreciation to everyone who played a

pivotal role in my journey as a Ph.D. student.
Foremost, I owe a tremendous debt of gratitude to my principal advisor, Dr. J. Morris
Chang. His willingness to welcome me into his lab at the University of South Florida as a
Ph.D. candidate has been a cornerstone of my academic and professional development. Under
his mentorship, I’ve honed skills crucial to my field, from drafting research proposals and
writing academic papers to delivering presentations. Dr. Chang’s guidance, patience, and
support have been instrumental in navigating the challenges of research and the complexities
of Ph.D. life.
My dissertation would not be possible without my committee members: Dr. Zhuo Lu,
Dr. Ismail Uysal, Dr. John Licato, and Dr. Lu Lu, along with Dr. Gamage Dumindu
Samaraweera, who served as the chairperson of my Ph.D. defense. The knowledge imparted
by them, whether through coursework or insightful discussions, has significantly contributed
to the depth of my dissertation.
In closing, I cannot overstate my gratitude towards my family and friends. Their unwa-
vering love, patience, and support have been the bedrock of my journey. This dissertation,
a milestone of my academic path, would not have been realized without them.
Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1: General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2: Discriminative Adversarial Domain Generalization with The Meta-

learning-based Cross-domain Validation . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Generative Adversarial Nets (GAN) . . . . . . . . . . . . . . . 9
2.2.2 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Domain Generalization . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Discriminative Adversarial Learning . . . . . . . . . . . . . . 15
2.3.2 Meta-learning-based Cross Domain Validation . . . . . . . . . 17
2.3.3 Summary of DADG . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Baseline Approaches . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Experimental Datasets . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.4 Effectiveness Analysis . . . . . . . . . . . . . . . . . . . . . . 23
2.4.5 Impact of Different DADG Components . . . . . . . . . . . . 25
2.4.6 Impact of Linear Related Domains . . . . . . . . . . . . . . . 26
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Chapter 3: Epi-Curriculum: Episodic Curriculum Learning for Low-Resource

Domain Adaptation in Neural Machine Translation . . . . . . . . . . . . . . . . . 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 The Encoder-Decoder Model and NMT . . . . . . . . . . . . . 34
3.2.2 Domain Adaptation for NMT . . . . . . . . . . . . . . . . . . 35
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Episodic Training Framework . . . . . . . . . . . . . . . . . . 38

i
3.3.3 Pre-defined Curriculum Learning . . . . . . . . . . . . . . . . 41
3.3.4 Overall Training Flow . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 45
3.4.3 Comparison Group . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.5 Results and Discussion (En-De) . . . . . . . . . . . . . . . . . 48
3.4.6 Results and Discussion (En-Ro) . . . . . . . . . . . . . . . . . 49
3.4.7 Results and Discussion (En-Fr) . . . . . . . . . . . . . . . . . 51
3.4.8 Robustness to Domain Shift . . . . . . . . . . . . . . . . . . . 52
3.4.9 Robustness to Parameter Perturbation . . . . . . . . . . . . . 54
3.5 Curriculum Learning Analysis . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Curriculum Validity . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.2 Impact of Training Schedulers . . . . . . . . . . . . . . . . . . 57
3.5.3 Impact of Denoising . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 4: SuperCon: Supervised Contrastive Learning for Imbalanced Skin

Lesion Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Skin Image Analysis in Deep Learning . . . . . . . . . . . . . 65
4.2.2 Class Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.3 Representation Learning . . . . . . . . . . . . . . . . . . . . . 68
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Representation Training . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Classifier Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 Experimental Datasets . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Effectiveness Analysis . . . . . . . . . . . . . . . . . . . . . . 74
4.4.4 Evaluation on Extremely Imbalanced Dataset . . . . . . . . . 77
4.4.5 Summary of Experimental Evaluation . . . . . . . . . . . . . 80
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Chapter 5: CS-AF: A Cost-sensitive Multi-classifier Active Fusion Framework

for Skin Lesion Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Multi-classifier Fusion . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 Fusion of CNNs for Skin Lesion Analysis . . . . . . . . . . . . 87
5.2.3 Cost-sensitive Machine Learning . . . . . . . . . . . . . . . . . 88
5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

ii
5.3.1 Multi-classifier Fusion . . . . . . . . . . . . . . . . . . . . . . 90
5.3.2 Cost-sensitive Problem Formulation . . . . . . . . . . . . . . . 91
5.3.3 Computing the Objective Weights . . . . . . . . . . . . . . . . 95
5.3.4 Computing the Subjective Weights . . . . . . . . . . . . . . . 96
5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 Experiment Dataset . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.2 Base Classifiers Preparation . . . . . . . . . . . . . . . . . . . 100
5.4.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . 102
5.4.4 Evaluate the Effectiveness of CS-AF . . . . . . . . . . . . . . 102
5.4.5 Analyze the “Cost-sensitive” of CS-AF . . . . . . . . . . . . . 105
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Chapter 6: General Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Appendix A: Copyright Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . 136

iii
List of Tables

Table 2.1 Cross domain classification accuracy (in %) on VLCS dataset. . . . 22

Table 2.2 Cross domain classification accuracy (in %) on PACS dataset. . . . 23

Table 2.3 Cross domain classification accuracy (in %) on Office-Home dataset. 25

Table 2.4 Cross domain classification accuracy (in %) on PACS dataset. . . . 26

Table 2.5 Cross domain classification accuracy (in %) on MNIST dataset. . . 28

Table 3.1 The notations and its description of our Epi-Curriculum. . . . . . . 37

Table 3.2 Data statistics for the En-De, En-Ro, and En-Fr tasks. . . . . . . . 44

Table 3.3 BLEU scores over Testing sets on En-De task. . . . . . . . . . . . 48

Table 3.4 The average improvement over baselines on the En-De task. . . . . 49

Table 3.5 BLEU scores over Testing sets on En-Ro task. . . . . . . . . . . . 50

Table 3.6 The average improvement over baselines on the En-Ro task. . . . . 51

Table 3.7 BLEU scores over Testing sets on En-Fr task. . . . . . . . . . . . . 52

Table 3.8 The average improvement over baselines on the En-Fr task. . . . . . 53

Table 3.9 BLEU scores over Testing sets on En-De task with different
training schedulers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Table 3.10 BLEU scores over Testing sets on En-Ro task with different
training schedulers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Table 3.11 BLEU scores over Testing sets on En-Fr task with different
training schedulers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Table 3.12 BLEU scores over Testing sets on En-De task with and without
noise data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Table 3.13 BLEU scores over Testing sets on En-Ro task with and without
noise data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

iv
Table 3.14 BLEU scores over Testing sets on En-Fr task with and without
noise data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Table 4.1 The sampled number of images for training and testing. . . . . . . 73

Table 4.2 Experimental results on the dataset of ISIC 2020 and ISIC 2019. . . 74

Table 4.3 Experimental results only on the ISIC 2020. . . . . . . . . . . . . . 76

Table 5.1 The statistics of different split training datasets. . . . . . . . . . . . 99

Table 5.2 The performance of base classifiers of the 12 CNN architectures. . . 101

v
List of Figures

Figure 2.1 The diagram of general multi-source domain generalization. . . . . 9

Figure 2.2 The diagram of DADG. . . . . . . . . . . . . . . . . . . . . . . . . 15

Figure 2.3 The training flow of DADG. . . . . . . . . . . . . . . . . . . . . . . 19

Figure 3.1 The diagram of our episodic training process. . . . . . . . . . . . . 38

Figure 3.2 Probabilistic view of the Default training scheduler. . . . . . . . . 46

Figure 3.3 Cross-domain improvement of domain-specific models on the

En-De task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 3.4 Cross-domain improvement of domain-specific models on the

En-De task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 3.5 BLEU score degradation by adding Gaussian noise. . . . . . . . . . 54

Figure 3.6 Performance of En-De, En-Ro, and En-Fr on the seen domains. . . 55

Figure 3.7 Probabilistic view of the Advanced (left) and the Reversed (right)
scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 4.1 Diagram of our two-stage training. . . . . . . . . . . . . . . . . . . 69

Figure 4.2 Confusion matrices of (a) Vanilla, (b) Focal-Loss, (c) ROS, (d)
RUS, (e) SuerCon-CE and (f) SuperCon using backbone net-
work ResNet152 on dataset ISIC 2019 + 2020. . . . . . . . . . . . . 78

Figure 4.3 Confusion matrix of (a) Vanilla, (b) Focal-Loss, (c) ROS, (d)
SuperCon-CE and (e) SuperCon using backbone network ResNet152
on dataset ISIC 2020. . . . . . . . . . . . . . . . . . . . . . . . . . 79

Figure 4.4 t-SNE visualization of extracted features from fθ , using (a)

Vanilla; (b) Focal-Loss; (c) ROS; (d) SuperCon-CE; (e) SuperCon. 80

Figure 5.1 The overview of CS-AF framework. . . . . . . . . . . . . . . . . . . 89

Figure 5.2 The calculation of the objective weights. . . . . . . . . . . . . . . . 91

Figure 5.3 The calculation of the cost-sensitive confusion matrix. . . . . . . . 91

vi
Figure 5.4 Two examples of cost matrices. . . . . . . . . . . . . . . . . . . . . 94

Figure 5.5 The calculation of the subjective weights. . . . . . . . . . . . . . . 95

Figure 5.6 Evaluate the effectiveness of CS-AF. . . . . . . . . . . . . . . . . . 102

Figure 5.7 The sensitivity results of each single class of CS-AF with Cost
Matrix A and Cost Matrix B. . . . . . . . . . . . . . . . . . . . . . 104

Figure 5.8 The specificity results of each single class of CS-AF with Cost
Matrix A and Cost Matrix B. . . . . . . . . . . . . . . . . . . . . . 104

vii
Abstract

Machine Learning (ML) and Deep Learning (DL) have achieved great success across
diverse fields in the last decades, such as facial recognition, medical diagnosis, and lan-
guage translation. The success, however, largely hinges on the assumption that the training
and testing data share the same distribution or domain. In practice, real-world data often
exhibits domain shifts, leading to notable degradation in model performance. Hence, the
generalization capability of machine lea=rning models is of great importance, which refers
to the domain generalization problem. In this dissertation, we present DADG, an effective
algorithm for domain generalization, aiming to learn domain-invariant features from seen
domains and boost the model classifier using cross-domain validation. In the experimental
evaluation, a comprehensive comparison has been made among DADG and other 8 state-of-
the-art algorithms. Extensive experiments have been conducted on 4 benchmark datasets.
Our results show that DADG outperforms the other algorithms on 3 datasets.

As a universal problem across different areas in deep learning, domain shift is also ex-
hibited in Natural Language Processing. We propose Epi-Curriculum, an effective learning
algorithm for low-resource domain adaptation in Neural Machine Translation (NMT). It
effectively improves the model’s robustness to domain shift and adaptability when only hun-
dreds of parallel training data are available in the target domain. Extensive experiments
have been conducted on 3 different translation tasks. Our results show that Epi-Curriculum
outperforms the other 5 methods in terms of the BLEU score.

One key principle of our DADG and Epi-Curriculum is to learn domain-invariant features
across different seen domains, which refers to representation learning. Representation learn-
ing is foundational to the success of deep learning, varying widely in its forms depending on
the specific learning objectives. Such approaches not only overcome domain-specific chal-

viii
lenges but also provide new solutions for related issues inherent in machine learning tasks.
Among these, class imbalance stands out as a critical challenge, characterized by a dispropor-
tionate distribution of classes that adversely affects model accuracy. This imbalance often
results in models that are biased toward the majority class, overlooking minority classes
due to the uniform weight attributed to each sample during training. Traditional solutions
have predominantly focused on adjusting the sampling ratio or modifying loss functions, ap-
proaches that do not address the fundamental objective of deep learning: the extraction of
meaningful features. We propose SuperCon, an effective learning framework, aiming to learn
distinguishable features for each class using supervised contrastive loss. In the experimental
evaluation, a comprehensive comparison has been made between SuperCon and the other
5 conventional algorithms. Our results show that SuperCon significantly outperforms the
conventional methods and effectively extracts differentiated features.
As an effective method for integrating multiple single models, ensemble learning signif-
icantly enhances the performance beyond that achievable by any individual model. How-
ever, a primary challenge lies in effectively adjusting the strengths and weaknesses of each
model to optimize their contributions to the final prediction. We present CS-AF, an en-
semble learning algorithm that actively determines the influence of each single model with
an application-sensitive cost matrix. In the experimental evaluation, CS-AF outperforms 4
other state-of-the-art algorithms in terms of accuracy and pre-defined cost on a benchmark
skin lesion image dataset, showcasing the effectiveness of CA-AF.

ix
Chapter 1: General Introduction

Machine learning and deep learning applications have achieved great success across di-
verse fields in the last decade, such as recommendation systems [32], facial recognition [60],
and medical diagnosis [160]. For instance, YouTube used two deep-learning encoders to
learn billions of user-item interactions to capture the users’ behavior and for further video
recommendations [32].
However, the success heavily relies on the fact that the testing data are in the same
distribution (domain) as the data used to train the model. In practice, real-world situations
often exhibit out-of-distribution test cases (e.g., domain shifts), leading to notable degrada-
tion in model performance. For instance, the language translation model trained solely on
the news domain performed poorly on the COVID-19 news due to the highly frequent new
medical terms [117]. It refers to the generalization capability to domain shift, meaning the
ability to perform well on the domains not presented in the training (e.g., unseen domains).
Therefore, strategies for handling domain shifts across machine learning and deep learning
applications are important.
Two techniques address the model’s generalization capability to domain shift, domain
generalization (DG) and domain adaptation (DA), where both techniques aim to train a
model that can perform well on unseen domains. The main difference between DG and DA
is that DG assumes the unseen domains are unavailable in the training phase. While DA
can access some of the unseen domains during training. Designing an effective approach in
DG or DA is challenging. First, the approach should be model-agnostic. Domain shift is a
general problem across different machine learning and deep learning applications, such that
an approach designed for a specific network architecture has huge limitations. Second, the

1
approach should be data-independent. Domain shift could occur across any type and number
of domains, such as different art forms of images, different topics of languages, and different
formats of text. A data-dependent approach can only lead to promising performance in some
particular cases and fail in other cases.
In this report, the main focus is to design and build an effective approach for enhancing
the model’s generalization capability across different fields. We consider there exists a feature
space, which contains the common features for both the seen and unseen domains and also
keeps the variance between different objects. In other words, this feature space maintains
the common information among all domains and is distinguishable across different objects.
In Chapter 2, we present a novel approach for domain generalization, Discriminative Ad-
versarial Domain Generalization (DADG), which leverages discriminative adversarial train-
ing to learn the domain-invariant features among all the source domains and uses meta-
learning-based cross-domain validation to improve the robustness of the classifier. We test
our approach on 4 benchmark DG datasets and compare it with other 8 state-of-the-art algo-
rithms. The experimental results show that our approach outperforms the other algorithms
on 3 datasets and achieves the second on the rest.
In Chapter 3, we proposed Epi-Curriculum, an approach for low-resource domain adap-
tation in Neural Machine Translation (NMT) for the popular encoder-decoder-based trans-
former language model. It effectively improves the model’s robustness to domain shift and
adaptability when only a few hundred of parallel training data are available during fine-
tuning. We leverage an episodic training framework to enhance the model’s robustness to
domain shift, where multiple tasks are designed to force the encoder or decoder to perform
well with another inexperienced decoder/encoder that has never trained on such domains
before. As an effective method for adapting to new domains [144], curriculum learning is
also used to rank the training data from easy to difficult and gradually present more dif-
ficult data tasks to the model. We evaluate our Epi-Curriculum on three language pairs:
English-German (En-De), English-Romanian (En-Ro), and English-French (En-Fr). The

2
experimental results show that our approach improves both the model’s adaptability and
robustness. For instance, it outperforms the baselines on unseen and seen domains by 2.28
and 3.64 BLEU [103] scores on the En-De task, 3.32 and 2.23 on the En-Ro task, 1.67 and
1.55 on the En-Fr task.
The objective of our DADG and Epi-Curriculum is to enhance the model’s generalization
capability to domain shift. Generalization capability encompasses various dimensions, each
critical for ensuring that models perform reliably in real-world scenarios. There are also other
aspects of generalization, such as class-imbalanced generalization [17], adversarial robustness
[54], and unseen task generalization [44].
In Chapter 4, we address the image class imbalance problem, which is prevalent in ma-
chine learning and often causes models to disproportionately favor the majority class while
underrepresenting minority classes. This occurs due to the equal weighting of samples during
training, leading to biased models with compromised accuracy. Traditional methods have
focused on adjusting sampling ratios or modifying loss functions, but these have not concen-
trated on the fundamental goal of deep learning: to effectively extract meaningful features.
We proposed SuperCon, which learns distinguished features for each class and trains a classi-
fier to handle the classification task. Specifically, we first try to learn a feature representation
that is closely aligned among the intra-classes and distantly apart from the inter-classes using
supervised contrastive learning, and then train a classifier on top of the distinct features. A
comprehensive evaluation using four different backbone networks has been conducted on two
benchmark datasets. The experimental results show that our SuperCon consistently outper-
forms the other baseline approaches. Moreover, our visualization analysis demonstrates that
SuperCon can learn better feature representation for classification tasks.
In Chapter 5, we investigate the effectiveness of ensemble learning for the class imbal-
ance problem. However, a primary challenge lies in effectively adjusting the strengths and
weaknesses of each model to optimize their contributions to the final prediction. We present
CS-AF, a cost-sensitive multi-classifier active fusion framework for skin lesion classification.

3
Two types of weights are defined to address the challenge: the objective weights and the
subjective weights. The objective weight indicates the reliability of the model during train-
ing and the subjective weight determines the confidence when the single model meets a new
sample during inference. We trained 96 base classifiers as the input of our fusion frame-
work, utilizing 12 CNN architectures on a skin lesion benchmark dataset. Compared with
other state-of-the-art methods, our CS-AF consistently outperforms the static fusion baseline
approaches in terms of accuracy.

4
Chapter 2: Discriminative Adversarial Domain Generalization with
Meta-learning-based Cross-domain Validation1

The generalization capability of machine learning models, which refers to generalizing

the knowledge for an “unseen” domain via learning from one or multiple seen domain(s), is
of great importance to developing and deploying machine learning applications in real-world
conditions. Domain Generalization (DG) techniques aim to enhance such generalization
capability of machine learning models, where the learned feature representation and the
classifier are two crucial factors to improve generalization and make decisions. In this paper,
we propose. Discriminative Adversarial Domain Generalization (DADG) with meta-learning
based cross-domain validation. Our proposed framework tries to learn a domain-invariant
feature representation from source domains and generalize it to unseen domains. It contains
two main components that work synergistically to build a domain-generalized Deep Neural
Network (DNN) model: (i) discriminative adversarial learning, which proactively learns a
generalized feature representation on multiple “seen” domains, and (ii) meta-learning based
cross-domain validation, which simulates train/test domain shift via applying meta-learning
techniques in the training process. In the experimental evaluation, a comprehensive compar-
ison has been made between our proposed approach and other existing approaches on three
benchmark datasets. The results show that DADG consistently outperforms a strong base-
line DeepAll, and outperforms the other existing DG algorithms in most of the evaluation
cases.
1
This chapter was published in Neurocomputing Volume 467 (refer to [19]). Permission is included in
Appendix A.

5
2.1 Introduction

Machine Learning (ML) and Deep Learning (DL) have achieved great success in numer-
ous applications, such as skin lesion analysis [105, 36], human activity recognition [129, 169],
active authentication [152], facial recognition [37, 98, 173], botnet detection [90, 167, 168]
and community detection [127, 170]. Most of the ML/DL applications are underlying the
assumption that the training and testing data are drawn from the same distribution (do-
main). However, in practice, it is more common that the data are from various domains.
For instance, the image data for the medical diagnosis application might be collected from
different hospitals, by different types of devices, or using different data preprocessing proto-
cols. The domain shift issue results in a rapid performance degradation, where the machine
learning applications is trained on “seen” domains and tested on other “unseen” domains.
Even well-known strong learners such as deep neural networks are sensitive to domain shifts
[50]. It is crucial to enhance the generalization capability of machine learning models in
real-world applications. Because, on one hand, it is costly to re-collect/label the data and
re-train the model for such “unseen” domains. On the other hand, we can never enumerate
all the “unseen” domains in advance.
Domain Generalization (DG), as illustrated in Figure 2.1, which aims to learn a domain-
invariant feature representation from multiple given domains and expecting good perfor-
mance on the “unseen” domains. It is one of the techniques that aims to enhance the
generalization capability of machine learning models. However, designing an effective do-
main generalization approach is challenging. First, a well-designed DG approach should be
model-agnostic. Domain shift is a general problem in the designing of ML/DL models, such
that the approach should not be designed for a specific network architecture. Second, an
effective DG approach should not be data-dependent. There exists different types of domain
shift, such as different art forms or different centric-images. A data-dependent approach can
lead promising results on some datasets. However, the approach can be overfitting to the

6
particular domain shift and might not have comparable performance on the other datasets.
Hence, it is a challenging task to design an effective DG approach.
To date, a few algorithms have been proposed to enhance the generalization capability of
ML/DL models. For instance, D-SAM [40] designs a domain-specific aggregation module for
each “seen” domain, and plugs it on a particular network architecture to eliminate the do-
main specific information. However, it is a model-based approach, because the aggregation
module is designed for a particular model, and additional implementation of aggregation
module is required when the model is changed. Hex [143] is proposed to learn robust rep-
resentations cross various domains via reducing the model dependence on high-frequency
textural information. The original supervised model is trained with an explicit objective to
ignore the superficial statistics, which only presents in certain datasets. Its representation
learning is fully unsupervised, and performs good on certain image datasets. However, due
to the assumption of domain shift and the unsupervised natural, Hex might not have the
promising performance on the other image datasets. Approaches that leveraging the idea of
meta-learning for domain generalization have been also proposed [79, 85, 5]. For instance,
MLDG [79] was inspired by MAML [45] to simulate the domain-shift and optimize meta-
train and meta-test together during the training phase. However, it only focuses on the
classifier optimization, and lacks of effective guidance on the feature representation learning,
where the better feature representation can benefit the classifier to make decisions.
In this paper, we present a novel DG approach, Discriminative Adversarial Domain Gen-
eralization (DADG). Our DADG contains two main components, discriminative adversarial
learning (DAL) and meta-learning based cross domain validation (Meta-CDV). We adopt
the DAL to learn the set of features, which provides domain-invariant representation for the
following classification task, and apply the Meta-CDV to further enhance the robustness of
the classifier. Specifically, on one hand, we consider the DAL component as a discriminator
that trains a domain-invariant feature extractor by distinguishing the source domain of corre-
sponding training data. On the other hand, we employ meta-learning optimization strategy

7
to “boost” the objective task classifier by validating it on previously “unseen” domain data
in each iteration. The two components guide each other from both feature representation
and object recognition level via a model-agnostic process over iterations to build a domain-
generalization model. Note that our DADG makes no assumption on the datasets, and it is
a model-agnostic approach, which can be applied to any network architectures.
In the experimental evaluation, a comprehensive comparison has been made among our
DADG and other 8 existing DG algorithms, including DeepAll (i.e., the baseline that simply
used pre-trained network, without applying any DG techniques), TF [78], Hex [143], D-SAM
[40], MMD-AAE [81], MLDG [79], Feature-Critic (FC) [85], and JiGen [15]. We conduct
the comparison and the evaluation of our approach on three well-known DG benchmark
datasets: PACS [78], VLCS [133] and Office-Home [140], utilizing two deep neural network
architectures, AlexNet and ResNet-18. Our experimental result shows that our approach
performs well at cross domain recognition tasks. Specifically, we achieve the best performance
on 2 datasets (VLCS and Office-Home) and perform 2nd best on PACS. For instance, on
VLCS dataset, we improve on the strong baseline DeepAll by 2.6% (AlexNet) and 3.11%
(ResNet-18). Moreover, an ablation study also conducted to evaluate the influence of each
component in DADG.
To summarize, our work has the following contributions:

• We present a novel, effective and model-agnostic framework, Discriminative Adversarial

Domain Generalization (DADG) to tackle the DG problem. Our approach adopts
discriminative adversarial learning to learn the domain-invariant feature extractor and
utilizes meta-learning optimization strategy to enhance the robustness of the classifier.

• To the best of our knowledge, DADG is the first work that uses meta-learning op-
timization to regularize the feature learning of discriminative adversarial learning in
domain generalization.

8
Train

art-painting Test

sketch cartoon
ML/DL Model

photo

Figure 2.1 The diagram of general multi-source domain generalization.

• A comprehensive comparison among our algorithm and the state-of-the-art algorithms

has been conducted. For the sake of reproducibility and convenience of future stud-
ies about domain generalization, we have released our prototype implementation of
2
DADG.

The rest of this paper is organized as follows: Section 2.2 presents the related litera-
ture review. Section 2.3 presents the notations in common domain generalization problem,
and describes our proposed algorithm. Section 2.4 presents the experimental evaluation.
Section 2.5 presents the conclusion.

2.2 Related Work

2.2.1 Generative Adversarial Nets (GAN)

Generative Adversarial Nets (GAN) [53] aims to approximate the distribution Pd of a

dataset via a generative model. GAN simultaneously trains two components generator G
2
https://github.com/keyu07/DADG

9
and discriminator D. The two components, generator and discriminator can be built from
neural networks (e.g., convolutional layers and fully connected layers). The input of G is
sampled from a prior distribution Pz (z) through which G generates fake samples similar to
the real samples. Meanwhile, D is trained to differentiate between fake samples and real
samples, and sends feedback to G for improvement. GAN can be formed as a two-player
minimax game with value function V (G , D):

min max V (G , D) = Ex∼Pd [log (D(x))]+

G D
(2.1)
Ez∼Pz [log (1 − D(G (z)))]

GAN-based discriminative adversarial learning is able to learn a latent space from mul-
tiple different domains, where the latent space is similar to the given domains. It has been
used in some domain adaptation works, which we will discuss below.

2.2.2 Domain Adaptation

Domain adaptation (DA) is one of the closely related work to domain generalization. The
main difference between DA and DG is that DA assumes unlabeled target data is available
during the training phase, but DG has no access to the target data. Many domain adaptation
algorithms [136, 50, 49, 135] are designed via mapping the source and the target domain into
a domain-invariant feature space. GAN or GAN-based discriminative adversarial techniques
have been utilized in many such domain adaptation works. For instance, ADDA [136] maps
the data from the target domain to the source domain through training a domain discrimina-
tor. DANN [50] is proposed to train a “domain classifier” to learn the latent representations
of the source and the target domains. Tzeng et al. [135] proposes to use multiple adversarial
discriminators to apply on the data of different available source domains. Discriminative
adaversarial learning successfully learns the domain-invariant feature representation, which

10
considered as a latent space that similar to all source domains. This success motivates us to
optimize the feature learning of domain generalization.

2.2.3 Domain Generalization

In contrast to domain adaptation, domain generalization is a more challenging problem,

because it requires no prior knowledge about the target domain. Given a ML/DL applica-
tion that has multiple “seen” or/and “unseen” domains, we observe that each domain has
two elements: the private element and the global element. The private element contains
the specific representation/information of each domain, while the global element holds the
invariant features across different domains. Most of the recent domain generalization works
aim to improve the learnt feature by using one of the two strategies: (i) Eliminating the
influence of the private elements or (ii) Extracting the global elements. Other than the
two main strategies, there are other alternative studies, such as a data augmentation based
method [132] and a recent self-supervised learning method JiGen [15]. JiGen [15] uses a
jigsaw-puzzle classifier to guide the feature extractor to capture the most informative part
of the images, and it achieves current state-of-the-art results on three domain generalization
benchmark datasets. We include JiGen [15] in all our evaluations.
Many model-enhancement based studies are proposed under the first strategy. For in-
stance, Li et al. [78] develops a low-rank parameterized network to decrease the size of
parameters. D’Innocente et al. [40] proposes to build domain-specific aggregation modules
and stack on the backbone network to merge specific and generic information. However, it
is a model based approach. Because one set of aggregation modules can only apply on one
particular backbone network. Additional implementation is required when we change the
network architecture. Hex [143] is proposed to learn robust representations cross various
domains via reducing the model dependence on high-frequency textural information. The
original supervised model is trained with an explicit objective to ignore the so called superfi-
cial statistics, which is presented in the training set but may not be present in future testing

11
sets. Its representation learning is fully unsupervised, and performs good on certain image
datasets. However, because the assumption of domain shift and the unsupervised natural
of Hex, it might not have the comparable good performance on the other image datasets.
However, designing an approach to weaken certain types of domain-specific elements may
suffer from overfitting on such domain elements. Though some outstanding results have
been shown by this kind of approaches on certain datasets, while may not be able to be
generalized to many more “unseen” domains. For instance, the different domain types are
considered as different art forms or different centric-images.
For the second strategy, most of the previous works are focusing on learning domain-
invariant representation, which is able to capture the important similar information among
multiple different domains and have the capability of generalizing to more “unseen” do-
mains. As such, these works are more similar to the work of domain adaptation. For
instance, Ghifary et al. [52] proposes to learn domain-invariant features via a multi-domain
reconstruction auto-encoder. However, the effectiveness for reconstruction the auto-encoder
is limited while applying to more complex datasets [78]. Motiian et al.[96] employs max-
imum mean discrepancy (MMD) and proposes to learn a latent space that minimizes the
distance among images that have the same class label but different domains. Li et al. [81]
proposes to align source domains to learn a domain-agnostic representation using adversarial
autoencoders with MMD constraints, and uses adversarial learning to match the distribution
of generated data with a prior distribution.
Our approach also belongs to the second strategy. We use discriminative adversarial
learning to learn a latent distribution among the source domains. By doing so, we achieve a
domain-invariant feature representation that different domains are indistinguishable. Beyond
the domain-invariant feature representation, in order to improve the relevant classification
task, we also propose a more robust classifier, by using meta-learning based optimization,
which leads to more competitive classification results. To the best of our knowledge, this is

12
the first work that uses meta-learning optimization to regularize discriminative adversarial
learning in domain generalization.

2.2.4 Meta-Learning

Meta-learning introduces a concept “learning-to-learn” and recently receives great inter-

ests with applications including few-shot learning [45, 100, 113] and learning optimizations
[82, 3]. It learns from various tasks during training and such that the model can be quickly
generalized to new tasks. MAML [45] is typical in those works. It utilizes sampled episodes
during training, where each episode is designed to simulate the few-shot tasks in a train-test
split manner. Recently, a few works have applied this episodic meta-learning optimization
method in domain generalization [79, 5, 85]. For instance, MLDG [79] borrows the idea of
[45] to optimize the classifier, by simulating the train-test domain shift during training phase.
MetaReg [5] proposes to learn a regularization function for the network classifier. Li et al.
[85] proposes to simultaneously learn an auxiliary loss and measure whether the performance
of the validation set has been improved. However, MLDG [79] and MetaReg [5] only focus
on classifier optimization, and are lacking of details addressing the learning of a domain-
invariant feature space. The success of the meta-learning method on the enhancement of
classifier robustness motivates us to optimize the network classifier for domain generalization.
To summarize, in order to address the challenging domain generalization problem, we apply
discriminative adversarial learning and meta-learning, where the discriminative adversarial
learning extracts domain-invariant feature representation, and meta-learning enhances the
classifier robustness.

2.3 Methodology

The design of DADG is based on our assumption that there exists a domain-invariant
feature representation, which contains the common information for both the “seen” and
“unseen” domains. It should satisfy the following properties: (i) The feature representation

13
should be invariant in terms of data distributions (domains). Since ML/DL models are
designed to transfer the knowledge from seen domains to unseen domains, they could fail
if the distributions differ a lot. (ii) It should keep the variance between different objects
(classes). This helps the model to capture the unique information of different objects and
to make precise decisions. We use two key components in DADG to address the above
two properties: discriminative adversarial learning (DAL) and meta-learning based cross
domain validation (Meta-CDV). DAL aims to learn a domain-invariant feature representation
where different data distributions are indistinguishable. Therefore, the domain variance will
be minimized. Meta-CDV brings the learnt features to supervised learning by training a
classifier in a meta-learning manner. It evaluates the validation performance of previous
unseen domains within each training iteration.
We introduce Figure 2.2 to better illustrate our DADG in high level. The goal of DADG
is to find the optimized feature representation point, which satisfies the two properties. A,
B and C present the different domains. DAL and Meta-CDV address DG in two aspects: (i)
As shown by the orange lines, the dash lines are the gradient directions when tackling feature
learning on different domains ∇DA and ∇DB , respectively. While the solid line is the actual
gradient direction guided by DAL and finally reaches a representation point indistinguishable
from given domains. (ii) As shown by the blue lines, the dash lines indicate the gradient
directions when solving certain tasks ∇TA and ∇TB , respectively. While the solid line
denotes the actual gradient direction led by classification task on two domains and further
optimized by cross domain validation (∇TC ). The model finally learns a domain-invariant
feature representation point that satisfies the two properties.
In the rest of this section, we denote the input data space as x ∈ X , the class label space
as y ∈ Y and the domain label (i.e., belonging to which distribution) space as y d ∈ Y d . The
source domains are described as Di ∈ S, and the target domains as T . Also, please note
that in the rest of this section, the superscript of each parameter indicates different updating
stages within one iteration, denoted as m, while the subscript indicates different iterations,

14
∇𝑇𝐶

∇𝑇𝐵
∇𝐷𝐵

∇𝐷𝐴
∇𝑇𝐴

Figure 2.2 The diagram of DADG.

denoted as n. We introduce our two main components in the remaining sections: DAL in
2.3.1 and Meta-CDV in 2.3.2. Finally, we summarize the two components together in 2.3.3.

2.3.1 Discriminative Adversarial Learning

As described above, the goal of this component is to learn a domain classification model,
which aims to classify data from different domains. We consider our DAL containing two
parts: (i) a feature extractor fθ with parameter θ, and (ii) a discriminator dψ with parameter
ψ. Both θ and ψ are learnable parameters during training phase.
In our approach, we first randomly divide the source domains S into two mutually exclu-
sive sets: Sd for DAL and Sc for Meta-CDV. The discriminator acts as a domain classifier,
which takes the learnt sample features fθ (xj ) of each arbitrary input xj and tries to discrim-
inate its domain label y d . Thus, we need to learn the parameters (ψ) that minimize the

15
classification loss, which as follows:

Ldisc (dψnm (fθnm (xj )), yjd ) (2.2)

The loss function of DAL is presented as follows:

X X
F (·) = Ldisc (dψnm (fθnm (xj )), yjd ) (2.3)
Di ∈Sd xj ∈Di

The objective of the feature extractor is to maximize the discriminative loss, to achieve
indistinguishable of the learnt feature representation. Following the design of GAN [53], the
objective function of our discriminative adversarial learning can be written as the following
minimax optimization:
argmin max
m
F (·) (2.4)
ψnm θn

Such minimax parameter updating can be achieved by gradient reversal layer (GRL)
[49], which placed between the feature extractor and discriminator. During forward propa-
gation, GRL keeps the learnable parameters same. During back propagation, it multiplies
the gradient by −λ and passes it to the preceding layer.
To summarize, we update the parameters of feature extractor and discriminator as fol-
lows:
θnm+1 ← θnm − α · ∇(−λ · F (·)) (2.5)

m
ψn+1 ← ψnm − α · ∇F (·) (2.6)

where the α is the DAL learning rate. Thereafter, the θnm+1 will be shared in further training
within the same iteration (as we illustrated in Figure 2.3 step ○),
1 and φm
n+1 will be used in

the next iteration.

16
2.3.2 Meta-learning-based Cross Domain Validation

After the feature extractor has been trained to minimize the domain variance, we adopt
meta-learning based cross domain validation (Meta-CDV) to address the enhancement of
the classifier robustness. Robust classifier is able to help the feature extractor to keep the
discriminant power between various classes. This is accomplished by training the classifica-
tion model on 2 seen domains Sd in DAL and validating the performance on cross domains
Sc .
To train the model on seen domains Sd , the classification model is composed of the feature
extractor fθ from DAL and a classifier cφ with parameters φ. The training loss is defined as
follows:
Ltrain (cφmn (fθnm+1 (xj ), yj ) (2.7)

where xj is an arbitrary input and yj is the corresponding output label.

The loss function of classification training on seen domains is presented as follows (as
illustrated in Figure 2.3 step ○):
2

X X
G (·) = Ltrain (cφmn (fθnm+1 (xj )), yj ) (2.8)
Di ∈Sd xj ∈Di

Note that the training is performed over the updated feature extractor parameters θm+1
in DAL. As such, the parameters are updated as follows:

θnm+2 ← θnm+1 − β · ∇ G (·) (2.9)

φm+1
n ← φm
n − β · ∇ G (·) (2.10)

where the β is the classification learning rate. Here the updated parameter θnm+1 is involved
in the calculation of training loss. It also means that we need the second derivative with
respect to θ, while minimizing the loss function 2.8.

17
After finishing the classification task on seen domains, we evaluate the performance
on cross domains Sc to boost the classification model. This process simulates the virtual
train/test settings. The evaluation is performed on the updated parameters θnm+2 and φm+1
n

(as illustrated in Figure 2.3 step ○).

3 More concretely, this evaluation comes up with the
cross domain validation loss:
Lval (cφm+1
n
(fθnm+2 (xj )), yj ) (2.11)

The loss function of cross domain validation is as follows:

X X
H(·) = Lval (cφm+1
n
(fθnm+2 (xj )), yj ) (2.12)
Di ∈Sc xj ∈Di

Finally, as illustrated in Figure 2.3 step ○,

2 ○4 and ○,
5 we update our classification model

by adding the training loss Ltrain and cross domain validation loss Lval at the end of each
iteration:
m
θn+1 ← θnm+1 − γ · ∇H(·) (2.13)

φm m
n+1 ← φn − γ · ∇H(·) (2.14)

where γ presents the learning rate of cross domain validation.

Note that the parameter updating on seen domains classification is performed over the
parameter θnm+1 and φm
n , whereas the cross domain validation is evaluated over parameter

θnm+2 and φm+1

n . In other words, the optimization of our classification model is involved in
the third derivative with respect to θ and the second derivative with respect to φ.

2.3.3 Summary of DADG

As illustrated in Figure 2.3, the DAL and Meta-CDV optimize the model by addressing
different aspects of domain generalization, and work synergistically within one iteration.
In each iteration, we randomly split the train/validation (Sd /Sc ) domains. DAL learns a
domain-invariant feature extractor (fθ ) by maximizing the discriminative loss. Then, our

18
Feature
extractor GRL Discriminator (𝑑𝜓 𝑚 ) ℒdisc
( )

Feature
extractor Classifier (𝑐𝜑𝑚 ) ℒtrain
( ) 2
5

3 3 ∇(ℒ train + ℒ val)

Feature
Classifier (𝑐𝜑𝑚+1 ) 4
extractor ℒ val
( )

Figure 2.3 The training flow of DADG.

approach learns a robust classification model by adopting a simple classification training

and cross domain validation, which optimized in meta-learning based manner. For the whole
process, the objective function can be introduced as:

argmin max
m
F (·) + argmin (G (·) + H(·)) (2.15)
ψnm θn
θnm+1 , φm
n

Once Equation 2.15 is optimized to converge on the source domains, we evaluate the
classification model using unseen domains.

2.4 Experimental Evaluation

We conduct our experiments on 3 benchmark datasets (PACS [78], VLCS [133] and
Office-Home [140]) and 2 deep neural network architectures with pre-trained parameters

19
(AlexNet [72] and ResNet-18 [59]) to evaluate the generalization capability of our proposed
approach. A comprehensive comparison has been made among our approach and other
baseline approaches. The presented results are shown that our DADG performs consistently
comparable in all the evaluations, and achieves the state-of-the-art results in two datasets.
The effectiveness of each component in our approach also discussed. All the details are
described in following.

2.4.1 Baseline Approaches

We compare our proposed approach performance with following baseline DG approaches.

• DeepAll is the baseline that simply use deep learning network to train the aggregation
of all source domains and test the unseen domain. It is a strong baseline that surpasses
many previous DG works [78].

• TF [78] introduces a low-rank parameter network to decrease the size of parameters.

This work also shows that the DeepAll can surpass many previous studies and first
provides PACS dataset.

• Hex [143] attempts to reduce the sensitivity of a model on high frequency texture
information, and thus to increase model domain-robustness.

• MMD-AAE [81] is based on adversarial autoencoder. It aligns different domain dis-

tributions to an arbitrary prior via MMD regularization, to learn an invariant feature
representation.

• Feature-Critic(FC) [85] aims to train a robust feature extractor. It uses meta-learning

approach, along with an auxiliary loss to measure whether the updated parameter has
improved the performance on the validation set.

20
• MLDG [79] is the first work that addresses domain generalization using meta-learning.
MAML inspires it [45] and proposed visual cross domain classification task by splitting
source domains into meta-train and meta-test.

• D-SAM [40] plugs parallel domain-specific aggregation modules on a given network

architecture to neglect domain specific information.

• JiGen [15] is the first work that addresses DG by self-supervised learning. It divides
each image into small patches and shuffles the order. Then, trains an object classifier
and a jigsaw order classifier simultaneously. It achieves the state-of-the-art results on
the three datasets VLCS [133], PACS [78] and Office-Home [140].

2.4.2 Experimental Datasets

We utilize three well-known domain generalization benchmark datasets:

• VLCS [133] is composed of 10,729 images with resolution 227 × 227, taken from 4
different datasets (i.e., domains): PASCAL VOC2007 [41], LabelMe [116], Caltech101
[43] and Sun09 [25]. It depicts 5 categories (i.e., classes): bird, car, chair, dog and
person.

• PACS [78] contains more severe domain shifts than VLCS. PACS aggregates 9,991
images in 7 different classes: dog, elephant, giraffe, guitar, house, horse and person. It
shared by 4 different domains: Photo, Art, Cartoon and Sketch.

• Office-Home [140] was created to evaluate DA and DG algorithms for object recognition
in deep learning. There are 15,592 images from 4 different domains: Art, Clipart,
Product and real-world images, each domain includes 65 classes.

21
Table 2.1 Cross domain classification accuracy (in %) on VLCS dataset.

VLCS VOC LabelMe Caltech Sun Avg.

AlexNet
TF [78] 69.99 63.49 93.63 61.32 72.11
HEX∗ [143] 68.51 63.67 89.63 62.12 70.98
MMD-AAE [81] 67.70 62.60 94.40 64.40 72.28
FC∗ [85] 66.79 61.48 95.68 63.13 71.77
MLDG∗ [79] 70.01 61.06 95.68 65.08 72.96
D-SAM [40] 63.75 54.81 94.96 64.56 69.52
JiGen [15] 70.62 60.90 96.93 64.30 73.19
DeepAll 68.11 61.30 94.44 63.58 71.86
DADG 70.77 63.44 96.80 66.81 74.46
ResNet-18
MLDG∗ [79] 74.41 63.45 96.75 69.35 75.99
D-SAM∗ [40] 70.42 58.70 88.90 71.36 72.35
JiGen∗ [15] 74.91 63.00 98.39 69.37 76.42
DeepAll 73.84 62.17 97.10 67.28 75.10
DADG 76.17 67.22 98.50 70.95 78.21

2.4.3 Experimental Setting

All three benchmark datasets contain the data of four different domains. We first hold
one domain (i.e., the target domain) for testing and the rest three for training. Then, in the
training phase, we randomly select two domains to apply discriminative adversarial learning
(DAL), and select one domain to boost our classifier by meta-learning based cross domain
validation (Meta-CDV). Our discriminator consists of two fully connected layers with 1024
neurons each and one output layer with 1 neuron.
The neural network is updated by stochastic gradient descent(SGD) in 2000 iterations
during training. We use cross-entropy loss for both DAL (domain classification task) and
Meta-CDV (classification task). Negative log-likelihood loss also tested for classification
task, but it hardly effects the performance.
For most the hyperparameters, we followed MLDG: base classification learning rate β =
5 × 10−4 , cross domain validation learning rate γ = 5 × 10−4 , momentum = 0.9 and weight

22
Table 2.2 Cross domain classification accuracy (in %) on PACS dataset.

PACS Photo Art-paint Cartoon Sketch Avg.

AlexNet
TF [78] 89.50 62.86 66.97 57.51 69.21
HEX [143] 87.90 66.80 69.70 56.30 70.18
FC [85] 90.10 64.40 68.60 58.40 70.38
MLDG[79] 88.00 66.23 66.88 58.96 70.02
D-SAM [40] 85.55 63.87 70.70 64.66 71.20
JiGen [15] 89.00 67.63 71.71 65.18 73.38
DeepAll 88.65 63.12 66.16 60.27 69.55
DADG 89.76 66.21 70.28 62.18 72.11
ResNet-18
MLDG∗ [79] 94.03 76.42 73.03 68.15 77.91
D-SAM [40] 95.30 77.33 72.43 77.83 80.72
JiGen [15] 96.03 79.42 75.25 71.35 80.51
DeepAll 93.06 75.60 72.30 68.10 77.27
DADG 94.86 79.89 76.25 70.51 80.38

decay = 5 × 10−5 . The DAL learning rate is α = 5 × 10−5 . While a big α value will lead
an unstable training process and 5 ×10−5 is appropriate for PACS, VLCS and Office-Home.
The value of α should be picked carefully on the other datasets, 1/10 of the β and γ is
suggested. The model-agnostic can be achieved by simply changing the backbone network
architectures without additional implementation. All of our experiments are implemented
using PyTorch, on a server with GTX 1080Ti 11 GB GPU.

2.4.4 Effectiveness Analysis

In this section, we discuss the performance of our proposed approach and the baseline
approaches in terms of classification accuracy. Table 2.1 - Table 2.3 show the results of
datasets VLCS, PACS and Office-Home. To make a more comprehensive comparison, we
implement MLDG our own, because only demo code is provided by the author. Besides, we
implement Hex, Feature-Critic, D-SAM and JiGen by using the code that are provided by

23
the authors. All the implementations are evaluated on the datasets or network architectures
they did not report. Our results of these approaches are highlighted in the three tables with
*. The details of each dataset are presented below:
For VLCS dataset, we follow the standard protocol of MTAE [52] to randomly divide the
data of each source domain into training (70%) and testing (30%) sets. Finally, we test on
all the images in target domain. The upper and bottom parts of Table 2.1 show the results
when using different network architectures AlexNet and ResNet-18, respectively. From table
2.1, we can observe that (i) The baseline DeepAll performs competitively and surpasses
many previous DG works on overall performance, such as HEX, Feature-Critic and D-SAM.
But our approach outperforms DeepAll in all target domain cases and on different network
architectures. (ii) On AlexNet, our DADG performs better than DeepAll by 2.6% and better
than Jigen by 1.27%, such that we achieve the new state-of-the-art result on VLCS dataset.
More specifically, DADG provides the best results in two (i.e., VOC and SUN respectively)
out of four target cases. (iii) On ResNet-18, DADG surpasses the previous SOTA result
Jigen in average performance and performs the best in three out of four target domain cases.
For PACS dataset, we follow the training protocol of TF, considering three domains as
source domains and the remaining one as target. The evaluation results are shown in Table
2.2, we can see that: (i) On AlexNet, although we do not achieve the best performance on
any target domain cases, our DADG provides consistently comparable results, and performs
the 2nd best in average results. (ii) On ResNet-18, we have two best results on Art-paint
(79.89%) and Cartoon (76.25%), and only slight worse (0.34%) than the best JiGen in average
performance.
For Office-Home dataset, we follow the protocol of D-SAM, also considering three as
source domains and the rest one as target. The results are shown in Table 2.3, and we can
observe that: (i) The advantage of of D-SAM in average results originates from its results
on Art and Clipart, but the rest two were lower than DeepAll. (ii) Our DADG achieves the

24
Table 2.3 Cross domain classification accuracy (in %) on Office-Home dataset.

Office-Home Art Clipart Product Real-World Avg.

ResNet-18
MLDG∗ [79] 52.88 45.72 69.90 72.68 60.30
D-SAM [40] 58.03 44.37 69.22 71.45 60.77
JiGen [15] 53.04 47.51 71.47 72.79 61.20
DeepAll 54.31 41.41 70.31 73.03 59.77
DADG 55.57 48.71 70.90 73.70 62.22

best in two target cases and the best in average results, and improves the previous SOTA
result Jigen by 1.02%.
To summarize the experimental evaluation, we can conclude that: (i) DeepAll exceeds
many previous approaches in different datasets. In general, only MLDG, JiGen and our
DADG can outperform DeepAll in all three datasets. (ii) As we mentioned in Section 2.2.3.
The approaches that aim to neglect particular domain-specific information, may assist the
model in some datasets but fail in others. For instance, HEX and D-SAM are better than
DeepAll on PACS, but worse than DeepAll on VLCS. (iii) our DADG has consistently
comparable results in all the datasets and achieves the SOTA results on VLCS and Office-
Home, also the second best on PACS. On VLCS and Office-Home, DADG outperforms the
previous SOTA JiGen all over 1%.

2.4.5 Impact of Different DADG Components

In this section, we conduct an extended study using PACS dataset with network architec-
ture AlexNet to investigate the impact of the two key components (i.e., DAL and Meta-CDV)
in our proposed approach DADG. Specifically, we test the performance in terms of classifica-
tion accuracy by excluding each component in our approach respectively. DADG-DAL only
contained the discriminative adversarial learning (DAL) component and trained the classifi-
cation model conventionally instead of in meta-learning manner. While DADG-CDV meant

25
Table 2.4 Cross domain classification accuracy (in %) on PACS dataset.

PACS Photo Art-paint Cartoon Sketch Avg.

AlexNet
DeepAll 88.65 63.12 66.16 60.27 69.55
DADG-DAL 89.51 65.43 69.19 61.70 71.46
DADG-CDV 89.10 64.22 68.24 60.60 70.54
DADG 89.76 66.21 70.28 62.18 72.11

that we removed the DAL component and only updated the classification model parameters
in meta-learning manner.
From the results in Table 2.4, we can see that DADG-DAL and DADG-CDV consistently
perform better than DeepAll, and our full version DADG surpasses both baseline models in
average performance and in every target domain cases. In the comparison between DADG-
DAL and DADG-CDV, the DADG-DAL is consistently better than the DADG-CDV. The
results in Table 2.4 show that: (i) Employing discriminative adversarial learning is able
to effectively guide the feature extractor to learn the invariant features among multiple
source domains. (ii) Since the only difference between DeepAll and DADG-CDV is the
updating manner. Thus, applying meta-learning based cross domain validation can make
the classification model more robust. (iii) The full version DADG consistently performs the
best in every single case, which has shown that combining domain invariant representation
and robust classifier together helped the model to enhance generalization. (iv) The domain-
invariant representation plays a more crucial role rather than the robust classifier. Because
the invariant representation provides an easier task for the classifier to make decision.

2.4.6 Impact of Linear Related Domains

We assume that there exists a domain-invariant feature representation for both source
and target domains. However, it is also possible that some target domains are less relevant
or even irrelevant to the source domains.

26
The domain types were considered as different art forms (art, cartoon in PACS) or
different centric images (LabelMe and SUN in VLCS) in previous sections. It is very hard
to define whether the target domain is less relevant to the source domains. To explore this
situation, we conduct an experiment using digit images in six different angles as six different
domains. To be more specific, we adopt the MNIST [75] dataset and randomly chose 1,000
images in each class and trained with AlexNet [73]. We denote the digit images rotated with
0◦ by R0 and then rotate the digit images in a counter-clockwise direction by 15◦ , 30◦ , 45◦ ,
60◦ and 75◦ . Since the rotation angles are continues related, which means sometimes the
target domains are out of the scope of source domains (irrelevant). For example, when the R0
and R15 are as the target domains, we consider that the target domains are out of the scope
of source domains. During the training phase, 4 domains are selected as source domains
and the rest 2 are target domains. For each iteration, our DADG randomly adopts 2 source
domains for DAL and another 2 for Meta-CDV. The model performance will evaluated on
the rest 2 target domains. A comparison is made among DeepAll, MLDG [79] and DADG.
From the results in Table 2.5, we can see that: (i) For all 3 approaches, the performance
are better when the target domains close to 30◦ and 45◦ , and much worse when the target
domains are close to 0◦ and 75◦ . (ii) Our DADG outperforms the other two approaches
in 10 out of total 15 different cases and achieves the best overall average accuracy among
the 3 approaches. The results show the performance drop when the target domains are
irrelevant to the source domains. It happens to all the approaches and can be considered as
a common situation in domain generalization. Although our DADG outperforms other 2 in
average, only 10/15 better than the MLDG. Compare to the performance on VLCS, PACS
and Office-Home (Table 2.1, 2.2, and 2.3), our DADG does not show significant advantage
on this experiment. Because we select 2 source domains to do discriminative adversarial
learning (DAL), and the rest source domains will train with Meta-CDV. When the number
of source domains increased, DAL only contributes small portion in each iteration. As we
mentioned in the section 2.4.5 (iv), DAL plays a more critical role than Meta-CDV. Finally,

27
Table 2.5 Cross domain classification accuracy (in %) on MNIST dataset.

Source Target DeepAll MLDG∗ [79] DADG

AlexNet
R30 , R45 , R60 , R75 R0 , R15 69.35 69.51 69.57
R15 , R45 , R60 , R75 R0 , R30 89.30 88.89 89.53
R15 , R30 , R60 , R75 R0 , R45 89.16 89.18 89.29
R15 , R30 , R45 , R75 R0 , R60 88.72 89.10 89.18
R15 , R30 , R45 , R60 R0 , R75 84.55 84.64 84.59
R0 , R45 , R60 , R75 R15 , R30 92.62 92.56 92.72
R0 , R30 , R60 , R75 R15 , R45 94.17 94.43 94.33
R0 , R30 , R45 , R75 R15 , R60 94.58 94.65 94.61
R0 , R30 , R45 , R60 R15 , R75 90.16 90.16 90.19
R0 , R15 , R60 , R75 R30 , R45 92.09 92.41 92.62
R0 , R15 , R45 , R75 R30 , R60 94.35 94.44 94.40
R0 , R15 , R45 , R60 R30 , R75 89.93 90.13 90.24
R0 , R15 , R30 , R75 R45 , R60 92.00 92.28 92.22
R0 , R15 , R30 , R60 R45 , R75 89.47 89.54 89.65
R0 , R15 , R30 , R45 R60 , R75 74.14 74.80 74.83
Average 88.31 88.45 88.53

if we have a great number of source domains, the contribution of DAL can be even negligible.
Thus, our DADG is sensitive to the number of source domains.

2.5 Conclusion

In this paper, we proposed DADG, a novel domain generalization approach, that con-
tains two main components, discriminative adversarial learning and meta-learning based
cross domain validation. The discriminative adversarial learning component learns a domain-
invariant feature extractor, while the meta-learning based cross domain validation component
trains a robust classifier for the objective task (i.e., classification task). Extensive experi-
ments have been conducted to show that our feature extractor and classifier could achieve

28
good generalization performance on three domain generalization benchmark datasets. Exper-
imental results indicate that the feature extractor and classifier achieve good generalization
on three benchmark domain generalization datasets. The experimental results also show
that our approach consistently beat the strong baseline DeepAll. For instance, while using
PACS dataset, our approach performs better than DeepAll by 1.56% (AlexNet) and 3.11%
(ResNet-18). Notably, we also reach the state-of-the-art performance on VLCS and Office-
Home datasets, and improve the average accuracy by over 1% in each case. As we mentioned
in the 2.4.6, in current stage, our DADG is sensitive to the number of source domains be-
cause the DAL tends to be unimportant with the increasing number. In the future work, we
plan to address this limitation and design an approach that can handle various number of
source domains.

29
Chapter 3: Epi-Curriculum: Episodic Curriculum Learning for Low-Resource
Domain Adaptation in Neural Machine Translation3

Neural Machine Translation (NMT) models have achieved comparable results to human
translation with a large number of parallel corpora available. However, their performance
remains poor when translating on new domains with a limited number of data. Recent
studies either only show the model’s robustness to domain shift or the superiority in adapt-
ing to new domains with a limited number of data. A solution for addressing both the
model’s robustness and adaptability is underexplored. In this paper, we present a novel
approach Epi-Curriculum to address low-resource domain adaptation (DA), which contains
a new episodic training framework along with a denoised curriculum learning. Our episodic
training framework enhances the model’s robustness to domain shift by episodically expos-
ing the encoder/decoder to an inexperienced decoder/encoder. The denoised curriculum
learning filters the noised data and further improves the model’s adaptability by gradu-
ally guiding the learning process from easy to more difficult tasks. Extensive experiments
have been conducted on English-German (En-De), English-Romanian (En-Ro), and English-
French (En-Fr) translation tasks. Our results show that: (i) Epi-Curriculum outperforms the
baseline on unseen and seen domains by 2.28 and 3.64 BLEU score on En-De task, and 3.32
and 2.23 on En-Ro task; (ii) Our episodic training framework outperforms the recent popular
meta-learning framework in terms of robustness to domain shift and achieves comparable
adaptability to new domains.
3
This chapter was published in IEEE Transactions on Artificial Intelligence (refer to [21]). Permission is
included in Appendix A.

30
3.1 Introduction

Neural Machine Translation (NMT) [124] has yielded state-of-the-art translation perfor-
mance for many tasks [8, 77, 112]. However, these models tend to perform poorly on the
domains that present very different distributions from those of the training data [71, 27]. For
instance, a model trained solely on the news domain is likely to underperform in the medical
domain. It has been approved that models can be trained to perform well on the given
domains with large in-domain (target domain) corpora [8, 9], but the target domain(s) may
be unknown when a model is built, and the amount of training data is limited (low-resource)
[175, 56]. Thus, it is of great importance to have a model that is robust enough to perform
well on new domains with a limited amount of data available.
However, training a good model for low-resource domain adaptation is challenging. As
stated in [74], a good model should have two key properties: adaptability and robustness.
The adaptability indicates the model can quickly adapt to a new domain with only a small
amount of target domain data. The robustness means the model should perform well on
both seen domains (domains in the training data) and unseen domains (domains not in the
training data). Additionally, combined with the definition of robustness in [97], we consider
a robust model should also perform well before fine-tuning without any adaptation on target
domains.
Many existing works are trying to handle domain adaptability. A general method is to
fine-tune the model on the limited in-domain data [88, 34], which refers to the concept of
domain adaptation (DA). However, with only a few amount of in-domain data, the model
suffers from so-called catastrophic forgetting [130]. In terms of improving the adaptability to
new domains, MAML (Model-Agnostic Meta-Learning) [44] has shown its strength with only
a small amount of data. It refers to a concept of episodic learning, which involves training a
model on a series of distinct but related tasks, such that the model can generalize and quickly
adapt to new tasks. This idea has been investigated in a few NMT works [120, 83, 158, 123].
For example, Meta-MT [120] meta-trains the model to obtain good initialized parameters,

31
such that it can be later rapidly adapted with a few amount of in-domain data. Based
on Meta-MT, Meta-curriculum [158] leverages curriculum learning [10] to further enhance
the performance on unseen domains. However, it has been shown that the MAML-based
methods sacrifice the robustness before fine-tuning and on seen domains to achieve more
pronounced adaptability [120, 158, 74].
Existing works for addressing the robustness mainly focus on the model’s awareness of
domain shift. An intuitive solution is adding auxiliary networks to be aware of the domain
shift, where the additional networks could be a domain adaptor [6] or a sub-network [106].
Another way to enhance the model awareness is instance weighting [46, 66]. This method
usually adjusts the loss function to weight training samples according to the target domain
relevance. However, these works only lean on the adaptability of seen domains.
To handle both adaptability and robustness, we present Epi-Curriculum, which contains
two main components: an episodic training framework and curriculum learning. Inspired
by [80], we generalize its insight to the general transformer architecture (encoder-decoder)
for improving the model’s robustness to domain shifts in NMT. Specifically, we train the
model by synthesizing the real domain shift when an encoder/decoder combines with an-
other inexperienced decoder/encoder that has never trained on such domains before. By
optimizing the combination of an encoder/decoder and an inexperienced decoder/decoder,
both the encoder and decoder will be robust enough to overcome domain shift. The insight
is that, a neural network performs poorly on a new domain because the input statistics are
different from the network’s expectations (different distribution of what it was trained on).
In other words, the current layer accepts unexpected statistics from its previous layer, and
the current layer will also produce unexpected statistics for the next layer. Therefore, if the
neural network can be trained to perform well with unexpected inputs, its robustness to new
domains will be enhanced.
Curriculum learning ranks the training data from easy to difficult and gradually presents
more difficult tasks to the NMT model during training, which has been shown as an effective

32
method for adapting to new domains [10, 95, 149, 158]. Inspired by this, curriculum learning
is plugged into our episodic training framework to guide the model for better adaptation. We
follow the general curriculum learning framework of difficulty measure and training scheduler
[150], where difficulty measure determines the relative “difficulty” of each training sample
and training scheduler decides the sequence of data subsets throughout the training process.
Additionally, the NMT model’s performance can be degraded by noisy training data [69],
where the noise can be many types, such as wrong language content and misaligned sentences.
It has been approved that curriculum learning is also an effective method for data cleaning
[148, 149]. Thus, to let the model focus on the highly relevant in-domain corpus, curriculum
learning is also applied to filter the data.
We evaluate our Epi-Curriculum on English-German (En-De), English-Romanian (En-
Ro), and English-French (En-Fr) translation tasks with 10, 9, and 9 different domains. A
general experimental setting of domain generalization uses several domains as source (seen)
domains to train the model and fine-tune the model on each target domain with only a few
hundred in-domain data. In our experiment, we always have 5 source domains and the rest
are target domains. BLEU score [103, 108] is reported and the experimental results show
that Epi-Curriculum improves the model’s robustness and adaptability on both seen and
unseen domains. For instance, it outperforms the baselines on unseen and seen domains by
2.28 and 3.64 on the En-De task, 3.32 and 2.23 on the En-Ro task, 1.67 and 1.55 on the
En-Fr task. We also demonstrate the model’s robustness to domain shift, where the encoder
and decoder improve the baseline by 2.55 and 2.59 BLEU scores, respectively. Moreover,
comprehensive studies were conducted to evaluate the curriculum validity, the impact of
different training schedulers, and the impact of denoising.
Our contributions mainly lie in the following aspects:

• We introduce a novel and effective framework Epi-Curriculum, to handle both the

model’s adaptability and robustness for low-resource domain adaptation in NMT. Pre-
vious works have investigated episodic training for the model’s adaptability. While this

33
is the first attempt to address the robustness by using episodic training. Our curricu-
lum learning not only guides the model from easy to difficult tasks but also denoises
the training data.

• We evaluate Epi-Curriculum on three language pairs, empirically showing the strength

of our proposed approach. A comprehensive analysis was also conducted to demon-
strate the model’s robustness and the impact of our curriculum learning.

• For the sake of reproducibility and convenience of future studies for a fair comparison,
we have released our prototype implementation and our selected data4 .

The rest of this paper is organized as follows: Section 2 presents the related literature
review. Section 3 presents the notations and our algorithm design. Section 4 presents the
experimental evaluation. Section 5 concludes.

3.2 Related Work

3.2.1 The Encoder-Decoder Model and NMT

The encoder-decoder model has been widely used as the standard architecture for NMT
[4]. Given a source sentence S = (s1 , s2 , ..., sN ), the encoder-decoder model maps it into
a target sentence T = (t1 , t2 , ..., tM ), which can be formalized as the product of series of
conditional probabilities:
M
Y
P(T |S) = P(tm |t<m , S) (3.1)
m=1

Generally, the encoder takes a variable-length source sentence and generates a fixed-
length numerical representation, which then directly passes to the decoder and returns a
meaningful sentence in the target language.
By introducing the self-attention and multi-head attention to the prevailing encoder-
decoder architecture, the transformer model [139] obtains promising results in many different
4
https://tinyurl.com/4ztfea55

34
NMT datasets [8, 47]. Following the successes of the transformer model, Raffel et al. [112]
extensively propose the T5 model. Fortunately, T5 has released the model with pre-trained
parameters for generality and reproducibility, and we will use the pre-trained T5 model as
the test bench in this paper.

3.2.2 Domain Adaptation for NMT

Many works in the literature try to address the challenge of adaptability and robustness.

3.2.2.1 Adaptability

An effective method is curriculum learning [10]. It sorts the training data in complexity
or other meaningful order instead of random sampling to make the model gradually learn
from the training data within a specific order. A general curriculum learning includes two
steps: difficulty measure and training scheduler. Difficulty measure determines the relative
“difficulty” of each training sample, and the training scheduler decides the sequence of data
subsets throughout the training process [150]. In the context of domain adaptation, the
“difficulty” is usually evaluated by domain divergence [10, 149], where the higher divergence
score indicates that the sentence has more in-domain features and is more divergent from the
samples that were used for pre-trained the model. For instance, Zhang et al. [163] fine-tuned
a general initialed model by using in-domain data that was sorted by the divergence score,
showing the effectiveness for domain-specific translation. Moreover, existing works have
shown that curriculum learning is also effective for data selection [149, 55], where the noise
data would degrade the translation performance of domain adaptation [69, 146]. Grangier
et al. [55] investigated a data selection strategy for domain adaptation in NMT to denoise
the training data and achieve better generalization. Wang et al. [149] leveraged the idea
into curriculum learning for multi-domain data selection. The data selection method assigns
each training sentence a score, and training data with a positive score likely moves a model
toward the target domain.

35
Another effective method for fast adaptation is meta-learning [44, 99]. The core idea is
to use previous experiences to inform how new tasks are approached, such that the model
can be adapted to a new task in a small amount of data. Meta-learning often breaks the
training process into a variety of tasks (episodes) in one iteration, which refers to the concept
of episodic training. For instance, MAML [44] was proposed as an effective method for
handling low-resource scenarios. It divided one general training iteration into two episodes:
meta-train and meta-test. Specifically, the meta-test evaluated the model’s performance
obtained by meta-train and optimized the model using the meta-loss. Models trained by
this strategy are likely to have the ability to adapt quickly to new tasks/domains with only
a small amount of data. Inspired by MAML, Sharaf et al. [120] meta-trained the model and
obtained a set of initial parameters that can quickly adapt to domains with very limited data.
By adding curriculum learning and presenting the data from low to high divergence during
training, Zhan et al [158] further improved the adaptability to new domains. However, as
claimed in [74] and shown in the experimental results, these MAML-based methods sacrifice
the robustness before fine-tuning and on the seen domains to achieve good adaptability.
To improve the model’s adaptability, we leverage curriculum learning in our method for
both denoising and ranking the training data. Specifically, we follow [144] to denoise the
data, follow [95] to measure the sentence-level difficulty, and design a difficulty-based training
scheduler to guide the training.

3.2.2.2 Robustness

The strategy to enhance the model’s robustness for domain adaptation mainly focuses
on the ability to handle domain shifts. An intuitive method is adding a set of learnable
parameters to be aware of domain shift, where the additional parameters could be a domain
adaptor [109, 6] or a sub-network [106, 38]. For instance, Bapna et al. [6] introduce a
lightweight domain-specific adaptor for each domain of interest after both the encoder and
decoder. The domain-specific adapter is tuned on its in-domain data while the rest is frozen.

36
Table 3.1 The notations and its description of our Epi-Curriculum.

Notation Description
D the dataset contains all the source domains
Di the dataset of i th source domain
sij j th sentence of the i th domain from the source language
tij j th sentence of the i th domain from the target language
gθ the transformer’s encoder g with parameters θ
hϕ the transformer’s decoder h with parameters ϕ
fΘ the transformer f with parameters Θ, where fΘ (s) = hϕ (gθ (s))

Another way to enhance the awareness of domain shift is instance weighting [46]. It
adjusts the loss function to weight training examples according to their target domain rele-
vance, where the domain relevance can be determined by n-gram [66, 146, 18]. For instance,
Chen et al. [18] propose cost weighting that a domain classifier and a translation model
are trained simultaneously. The domain classifier is trained to discriminate between the in-
domain and generic domains, and the cost is determined by the probability of the sentence
being in-domain. However, these works usually require additional adjustments for different
model architectures and only focus on the adaptability of seen domains.
To address the robustness, we use an episodic training framework [80] to improve the
model’s awareness of domain shifts. It is noteworthy that episodic training is a general
concept and can be developed into different applications according to the different objectives.
The MAML-based methods design episodic training for the model’s adaptability. While we
develop it for robustness.

3.3 Methodology

In this section, we first formalize the problem setting of domain adaptation, then sub-
sequently explain the sub-tasks in our episodic framework, followed by our pre-defined cur-
riculum learning.

37
Domain Aggregation Training (AGG) Domain Specific Training
Di …
D1 … Encoder Decoder Lossagg D1 Encoder1 Decoder1 Loss1
Dn
…
Share
Di Encoderi Decoderi Lossi
Episodic Encoder Training
…
Di Encoder Decoderk Lossenc
Dn Encodern Decodern Lossn
Episodic Training
Episodic Decoder Training
Pick Domain k (i ≠ k)
Di Encoderk Decoder Lossdec

Figure 3.1 The diagram of our episodic training process.

3.3.1 Problem Setting

Given source domains D = {D1 , ..., Dn }, the goal is to train a teacher model on the
source domains and then fine-tune the model on a new domain to obtain a student model.
In order to build the encoder/decoder and inexperienced decoder/encoder combination in
our episodic framework, we split the encoder-decoder-based NMT model into two modules:
an encoder g with parameters θ and a decoder h with parameters ϕ. Thus, for an NMT
model f with all parameters Θ, it can be formulated as fΘ (s) = hϕ (gθ (s)). Table 3.1 gives a
clear definition of the notations.

3.3.2 Episodic Training Framework

Our episodic training framework generalizes the insight of [80] to the transformer’s
encoder-decoder-based architecture. Specifically, there are four sub-tasks within each episode
of our episodic training framework: (i) Domain Aggregation Training trains the backbone
model. (ii) Domain-Specific Training provides the inexperienced models. (iii) Episodic En-
coder Training and (iv) Episodic Decoder Training combine the encoder/decoder of the
backbone model and the decoder/encoder of an inexperienced model to simulate the domain
shift. The overall training diagram is shown in Fig 3.1.

38
3.3.2.1 Domain Aggregation Training

One traditional domain adaptation pipeline is to continuously train the pre-trained model
on the aggregation of all the source domains D [175]. As illustrated in the upper left of Fig
3.1, both the encoder and decoder will learn from all the domains. The optimization is as
follows:
X j j
argmin (sij ,tij )∈D Lagg (ti , hϕ (gθ (si ))) (3.2)
θ,ϕ

where the Lagg is cross-entropy loss.

Among most of the existing works, this simple approach always performs competitively,
and even better than the published methods in some cases [120, 158]. Hence, we use the do-
main aggregation model (AGG) as the backbone approach of our episodic training framework,
and apply the other sub-tasks to improve its robustness throughout the training process.

3.3.2.2 Domain-Specific Training

We improve the robustness by exposing each module (encoder and decoder) of the AGG
model to an inexperienced partner. An inexperienced partner can be an encoder or decoder
that has never trained on the corpus of such domains before. Thus, we employ domain-
specific training (right side of Fig 3.1) to provide the inexperienced partner. To formalize,
for each domain-specific model with its own encoder gθi and decoder hϕi , we optimize the
model using its associated domain corpus Di ∈ D:

X j j
argmin (sij ,tij )∈Di Li ((ti , hϕi (gθi (si ))) (3.3)
θi ,ϕi

where Li is the loss of i th domain. The domain-specific model is only trained on one single
domain i and easily performs badly on the other domains.

39
3.3.2.3 Episodic Encoder Training

To make the encoder robust enough, we consider it should perform well with a decoder
that has never trained on such domains before. As illustrated in the bottom left of Fig 3.1,
given the source domain i and the AGG model, its decoder is replaced by a random decoder
hϕk of domain-specific models. Then the encoder and the inexperienced decoder will together
try to translate the sentence in the domain i and compute episodic encoder loss Lenc . We
formalize the optimization as follows:

X j j
argmin (sij ,tij )∈Di̸=k Lenc ((ti , hϕ̄k (gθ (si ))) (3.4)
θ

where i ̸= k is to guarantee that the decoder hϕk has no experience with domain i. Note that
hϕ̄k means the parameters ϕk will not be updated during back-propagation because we want
the decoder to remain ignorant about the domains outside of k. Intuitively, this combination
can perform poorly due to the decoder’s ignorance. But by minimizing Lenc , the encoder
will be trained to encode the source sentences into features that can be decoded correctly
by a decoder with no experience.

3.3.2.4 Episodic Decoder Training

Similarly, to train a robust decoder, we assume that the decoder should be able to
correctly decode the features generated by an encoder that has never been exposed to such
domains before. As illustrated in the bottom left of Fig 3.1, the encoder of the aggregation
model is replaced by a random encoder gθk of the domain-specific models. We then ask them
to translate the source language in the domain i and compute episodic decoder loss Ldec . To
minimize the loss, we optimize:

X j j
argmin (sij ,tij )∈Di̸=k Ldec ((ti , hϕ (gθ̄k (si ))) (3.5)
ϕ

40
where i ̸= k is to ensure that i mismatches k. Similar to the optimization in episodic
encoder training, gθ̄k aims to keep the parameters θk unchanged and to remain ignorant
about domains other than k. Eventually, by minimizing the loss Ldec , the decoder of AGG
gradually learns to correctly decode the features generated by a random and inexperienced
encoder.

3.3.3 Pre-defined Curriculum Learning

Aiming to have better adaptability, we use curriculum learning [95, 144], making the
episodic training process learn curricula of different characteristics in different stages. More
generic data will be presented to the model in the early stages to avoid the model from
landing on a bad local minima. More in-domain information will be introduced to the model
in the final stages to improve the adaptability by learning unfamiliar domain data. Moreover,
our training data is denoised before training to ensure the model is trained on high-quality
in-domain data. As many works have shown its effectiveness on domain adaptation in NMT
[122, 163, 55]. Hence, there are two factors in our curriculum learning that will be described
separately: (i) Data denoising and (ii) Divergence Scoring.

3.3.3.1 Divergence Scoring

For a given sentence s, we follow [95] to evaluate its cross-entropy difference between two
neural language models (NLM) as its Z -domain divergence score dZ (s):

logP(s; Θ̄Z ) − logP(s; Θ̄base )

dZ (s) = (3.6)
|s|

P(s; Θ̄base ) is the base language model with parameters Θ̄base trained on general domains.
While P(s; Θ̄Z ) is the Z -domain language model obtained by fine-tuning the base model on
Z -domain monolingual data. The dZ (s) is normalized by the sentence length of s. The higher

41
divergence score indicates that the given sentence s is more divergent from the samples in
the general domain, so we will present it to the model in the later stage.

3.3.3.2 Data Denoising

Given a sentence pair (s, t) of domain Z , we follow [144] to evaluate its cross-entropy
difference between two NMT models:

logP(t|s; ΘZ ) − logP(t|s; Θbase )

qZ (s, t) = (3.7)
|t|

P(t|s; Θbase ) is the base model with parameters Θbase trained on general domains. P(t|s; ΘZ )
is a domain-specific model with parameters ΘZ by fine-tuning the base model on a small Z -
domain parallel corpus D
b Z with trusted quality. The cross-entropy difference qZ is normalized

by the target sentence length |t|. Additionally, [55] has shown that the qZ with positive
value has a positive influence in gradient for adapting a base model to domain Z . Thus,
for a multi-domain adaptation problem, the model benefits from a batch of samples with
all positive values. In this work, we filter the samples with negative qZ because we consider
those samples will have the opposite influence on domain adaptation.

3.3.4 Overall Training Flow

To summarize, our full Epi-Curriculum contains three main steps: (i) We first use Equa-
tion 3.7 to denoise the training corpus by filtering the samples with negative qZ values.
(ii) Secondly, the rest corpus is sorted according to Equation 3.6. (iii) Once the data are
properly processed, a divergence-score-based training scheduler is plugged into our episodic
training framework. The full pseudocode of our Epi-Curriculum training policy is given in
Algorithm 3.1, where α and β indicate the learning rate of aggregation optimization and
domain-specific optimization, respectively.

42
Algorithm 3.1 Epi-Curriculum Training Policy
Require: Aggregation Model: (gθ , hϕ ); Domain-Specific Models: {(gθ1 , hϕ1 ), ..., (gθn , hϕn )};
Pre-trained NMT Models: {Θ1 , ..., Θn }; Pre-trained NLM Models: {Θ̄1 , ... Θ̄n }; Base NMT
model: Θbase ; Base NLM model: Θ̄base
Hyperparameters: α, β
Output: (gθ , hϕ )
1: Use Θbase and ΘZ to score qZ for each sentence in D
2: Filter the sentences with negative qZ
3: Use Θ̄base and Θ̄Z to score the dZ and sort the rest sentences
4: while not done do
5: Sample the sentences based on the sorted dZ with pre-defined probabilities
6: for (gθi , hϕi )∈{(gθ1 , hϕ1 ), ..., (gθn , hϕn )} do
7: Update θi ← θi - β∇gθi Li
8: Update ϕi ← ϕi - β∇hϕi Li
9: end for
10: Update θ ← θ - α∇gθ (Lagg + Lenc )
11: Update ϕ ← ϕ - α∇hϕ (Lagg + Ldec )
12: end while

3.4 Experiments

In this section, the experiments are designed to investigate the following questions: (i)
How does our Epi-Curriculum empirically compare to the baselines and alternative ap-
proaches? (ii) Do our encoder and decoder have the strength to overcome domain shift?
(iii) What is the impact of other variants of the curriculum? To explore these, we conducted
experiments on English-German (En-De), English-Romanian (En-Ro), and English-French
(En-Fr) translation tasks.

3.4.1 Datasets

The data sources for our two tasks are the following: En-De: It consists of 10 parallel
corpora, where 9 of them are collected from the Open Parallel Corpus (OPUS) [131] and the
Covid-19 is from authentic public institution data sources5 ,6 . These 10 corpora cover a wide
5
https://www.bundesregierung.de/
6
https://www.euro.who.int/en/health-topics/health-emergencies/coronavirus-covid-19/news/

43
Table 3.2 Data statistics for the En-De, En-Ro, and En-Fr tasks.
Training Fine-Tuning Testing
Domain
(960k) (10k) (20k)
English-German (En-De)
Covid-19 / 338 714
Bible / 300 640
Unseen Books / 303 723
ECB / 298 704
TED2013 / 417 939
EMEA 33,067 610 1,315
Tanzil 43,779 476 1,033
Seen KDE4 75,610 813 1,794
OpenSub 86,499 1,118 2,624
JRC 29,071 312 675
English-Romanian (En-Ro)
KDE4 / 1,006 2,190
Bible / 301 625
Unseen
QED / 566 1,160
GlobalVoices / 417 790
EMEA 43,779 509 1,085
Tanzil 38,821 461 850
Seen TED2013 42,327 477 1,015
OpenSub 88,257 1,022 2,013
JRC 30,306 338 724
English-French (En-Fr)
ECB / 253 510
Bible / 255 543
Unseen
Books / 321 615
GlobalVoices / 365 721
EMEA 41,376 414 873
Tanzil 34,159 323 713
Seen KDE4 52,764 542 1,074
TED2013 39,950 398 805
JRC 30,995 324 637

range of topics that enable us to evaluate domain adaptation: Covid-19, Bible, Books, ECB,
TED2013, EMEA, Tanzil, KDE4, OpenSub and JRC. En-Ro: All 9 corpora are collected
from the Open Parallel Corpus (OPUS): KDE4, Bible, QED, GlobalVoices, EMEA, Tanzil,
TED2013, OpenSub, and JRC. En-Fr: All 9 corpora are collected from OPUS: ECB, Bible,
Books, GlobalVoices, EMEA, Tanzil, KDE4, TED2013, and JRC-Acquis.
For each translation task, only 5 domains are used for training, and all domains are
used for individual fine-tuning evaluation. Thus, we are able to investigate the model’s
adaptability on the domains that have never been seen during the training. The 5 domains in

44
training are named as seen domains and the rest are as unseen domains. Each domain is split
into Training (only for seen), Fine-tuning, and Testing. Table 3.2 presents data statistics for
the English-German (En-De) and English-Romanian (En-Ro) tasks. The number of tokens
is fixed due to the variety of each domain’s average sentence length, where the number of
Fine-Tuning is small to simulate the low-resource scenario. Following [120], the number of
tokens in each set is 960k tokens in the Training set, 10k tokens in the Fine-Tuning set,
and 20k tokens in the Testing set. All the corpora are processed by sentencepiece7 with the
vocabulary size of 32,128. We filter the length of sentences and keep the sentences no longer
than 175 and no shorter than 5. Because long sentences require larger computational space
and short sentences are too easy to translate.

3.4.2 Implementation Details

3.4.2.1 Model Initialization

T5-small model with pre-trained parameters on C4 (Colossal Clean Crawled Corpus)

dataset [112] is used as the backbone network for all the required models (Algorithm 3.1).
Upon this foundation, the pre-trained NMT models ({Θ1 , ..., Θn }) and pre-trained NLM
models ({Θ̄1 , ... Θ̄n }) are obtained by training the model on each source domain with 90k
tokens.

3.4.2.2 Hyperparameters for Episodic Training

To follow the settings in T5, all experiments are trained using Adafactor [121] optimizer
but with scale parameter = False and relative step = False to keep a consistent learning
rate. We found that this task is sensitive to the learning rate, where a smaller learning rate
(e.g., < 1e-5) would lead to slow convergence and a bigger one (e.g., >= 5e-5) would lead to
oscillated training loss. [1e-5, 3e-5] could be a proper range of α and β. In the experiment, we
use α = 1e-5 and β = 3e-5 to have faster convergence for the domain-specific domains. Our
7
https://github.com/google/sentencepiece

45
Stage 1 Stage 2 Stage 3
0.5 0.5 0.5

Probability of Sampling 0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
Low High Low High Low High

Figure 3.2 Probabilistic view of the Default training scheduler.

approach can be converged within 9 epochs and all the experiments are run on an NVIDIA
RTX 3090 GPU with 24 GB of memory.

3.4.2.3 Curriculum Learning Settings

To have the training scheduler for our curriculum learning, we filter the noise data and
then follow [162] to sort and group the training samples evenly. Those in the same group
have similar divergence scores and different groups have the same number of samples. Each
training sample is first evaluated by Equation 3.7 and is filtered if the sample has negative qZ
(section 3.3.3.2). There are approximately 8% of the total training samples being filtered for
all 3 tasks. The rest samples are then scored by Equation 3.6 and sorted in ascending order.
Then the sorted data are evenly divided into 5 shards, such that their average scores are from
low to high. To implement the training scheduler that begins with low-divergence samples,
we start with higher probabilities for low-divergence samples to be sampled. Gradually, in
the later training stages, we increase the probabilities for high-divergence samples. At the
final stage, all samples have an equal probability of being sampled. Fig 3.2 illustrates this
default probabilistic view, which indicates the probability of sampling in the three stages.

46
3.4.3 Comparison Group

The following NMT approaches are included in our experiments: Vanilla: The T5 model
that is pre-trained on the generic domain (C4) without further training on the Training set.
It is the default baseline for domain adaptation in NMT [88]. AGG (Transfer Learning): The
domain aggregation model introduced in Equation 3.2 without any special training schemes,
which continues training on Vanilla with all training corpora. It is a strong baseline with
comparable performance in many existing works [158, 74, 83]. AGG-Curriculum: The AGG
model trains with our pre-defined curriculum learning strategy. We are able to observe
the sole performance of our curriculum learning. Meta-MT [120]: The standard meta-
training approach that directly applies the MAML algorithm [44]. Including Meta-MT in
our comparison is essential as this framework has gained significant popularity in recent works
[158, 74, 104]. We follow the algorithm and implement it on our own for a fair comparison.
Epi-NMT: The approach trains with our episodic framework without curriculum learning
for evaluating the sole performance of the episodic framework. Epi-Curriculum: The full
version of our approach.

3.4.4 Evaluation

Once the training is done, we follow the general domain adaptation setting, to individually
adapt the model on each of the Fine-Tuning sets and evaluate its performance on the relative
Testing set. The SacreBLEU [108] is reported based on the average results from over 5
times training, and the generated results utilize a beam size 5. There are three types of
results we want to observe and highlight: Before FT: The BLEU score before fine-tuning,
which demonstrates the model’s robustness. After FT: The BLEU score after individual
fine-tuning for evaluating the model’s final performance. ∆FT: The BLEU improvement
through individual fine-tuning to show the model’s adaptability.

47
Table 3.3 BLEU scores over Testing sets on En-De task.

Unseen
Seen
Covid-19 Bible Books ECB TED2013 Average
Before FT
Vanilla 25.65 10.94 10.80 31.79 26.45 21.33
AGG 25.92 12.74 10.48 32.70 26.52 28.62
AGG-Curriculum 26.05 12.06 11.36 32.66 26.91 28.93
Meta-MT 25.72 12.04 10.61 32.37 26.17 28.22
Epi-NMT 26.63 12.51 11.65 32.96 26.93 30.58
Epi-Curriculum 27.12 12.70 11.90 34.12 26.63 31.57
After FT
Vanilla 25.84 11.47 11.16 32.83 26.68 21.94
AGG 26.39 13.43 11.30 33.02 27.04 28.82
AGG-Curriculum 26.71 13.48 11.77 33.76 28.36 29.60
Meta-MT 26.64 13.40 11.23 33.56 26.91 29.20
Epi-NMT 27.44 14.48 12.37 34.07 28.19 31.25
Epi-Curriculum 27.73 15.11 12.53 34.89 29.11 32.46
∆FT
Vanilla 0.19 0.53 0.36 1.04 0.23 0.62
AGG 0.47 0.69 0.82 0.32 0.52 0.20
AGG-Curriculum 0.66 1.42 0.41 1.10 1.44 0.67
Meta-MT 0.92 1.36 0.62 1.19 0.74 0.98
Epi-NMT 0.81 1.97 0.72 1.11 1.26 0.68
Epi-Curriculum 0.61 2.41 0.63 0.77 2.48 0.90

3.4.5 Results and Discussion (En-De)

In this section, we discuss the results of the English-German (En-De) task (shown in
Table 3.3 and Table 3.4). Table 3.3 shows the performance for each unseen domain and the
average result of all the seen domains. While Table 3.4 shows the average improvement over
the baselines, where the baseline for unseen is Vanilla and for seen is AGG as the Vanilla
never trained on the seen domains.
Based on the results Before FT in Table 3.3, we can see that: (i) Compared to Vanilla,
AGG shows significant BLEU score gaps in seen domains due to training on the source data.
(ii) The meta-learning approach Meta-MT is even worse than the AGG in 4 unseen domains
and the seen average. (iii) Comparing our proposed episodic framework (Epi-NMT) and
curriculum learning (AGG-Curriculum) solely, Epi-NMT outperforms AGG-Curriculum in

48
Table 3.4 The average improvement over baselines on the En-De task.

Before FT After FT (∆FT)

Unseen Seen Unseen Seen
Vanilla 0 / 0 /
AGG 0.55 0 0.64 (0.09) 0
AGG-Curriculum 0.68 0.30 1.22 (0.54) 0.77 (0.47)
Meta-MT 0.26 -0.41 0.75 (0.50) 0.37 (0.78)
Epi-NMT 1.01 1.95 1.71 (0.70) 2.43 (0.48)
Epi-Curriculum 1.37 2.94 2.28 (0.91) 3.64 (0.70)

9 out of 10 domains. (iv) Our episodic-based approaches (Epi-NMT and Epi-Curriculum)

have the best performance in 5 out of 6 cases.
From the results of After FT and ∆FT in Table 3.3, we can observe that: (i) Unlike the
performance Before FT, Meta-MT demonstrates its superiority in ∆FT and surpasses AGG
after fine-tuning in 3 domains and the seen average. (ii) Epi-Curriculum shows its strength
in adaptability, where it performs the best in 6 out of 6 cases.
Based on the results in Table 3.4, we can observe that: (i) Although Meta-MT has
strength in adaptability (0.50 and 0.78), its weakness in robustness is also obvious, which has
the lowest improvement Before FT. (ii) Epi-NMT consistently outperforms AGG-Curriculum
in robustness and achieves comparable adaptability (0.70 and 0.48), demonstrating the ro-
bustness and adaptability of the episodic framework. (iii) Our proposed episodic framework
Epi-NMT is only worse than Meta-MT on ∆FT in seen domains, indicating that the episodic
framework is more effective than the MAML-based framework. (iv) Our full version approach
Epi-Curriculum consistently performs the best in Before FT (1.37 and 2.94), After FT (2.28
and 3.64), and ∆FT (0.91 and 0.70), only the adaptability is slightly worse than Meta-MT
in seen domains (0.70 vs 0.78).

3.4.6 Results and Discussion (En-Ro)

The results of English-Romanian (En-Ro) are shown in Table 3.5 and Table 3.6.

49
Table 3.5 BLEU scores over Testing sets on En-Ro task.

Unseen
Seen
KDE4 Bible QED GlobalVoices Average
Before FT
Vanilla 24.99 7.61 22.87 24.08 23.08
AGG 27.69 9.02 24.64 25.22 32.44
AGG-Curriculum 28.29 8.95 24.68 25.08 32.65
Meta-MT 27.82 8.59 24.70 25.16 32.10
Epi-NMT 27.53 8.60 25.69 25.52 35.04
Epi-Curriculum 27.27 8.63 25.31 25.25 34.78
After FT
Vanilla 26.88 8.53 23.02 25.04 23.83
AGG 29.26 10.08 24.92 26.23 32.95
AGG-Curriculum 30.27 11.02 25.02 27.19 33.25
Meta-MT 30.15 10.39 24.91 27.35 32.86
Epi-NMT 31.22 10.82 25.76 28.27 35.47
Epi-Curriculum 31.45 11.12 25.94 28.25 35.19
∆FT
Vanilla 1.89 0.92 0.15 0.96 0.75
AGG 1.57 1.06 0.28 1.01 0.51
AGG-Curriculum 1.98 2.07 0.34 2.11 0.60
Meta-MT 2.33 1.80 0.21 2.19 0.76
Epi-NMT 3.69 2.22 0.07 2.75 0.43
Epi-Curriculum 4.18 2.49 0.63 3.00 0.41

From the results Before FT in Table 3.5, we can observe that: (i) The meta-learning
approach Meta-MT is worse than the AGG in 2 new domains and the seen average. (ii) Epi-
NMT outperforms AGG-Curriculum in 4 out of 5 cases. (iii) Our episodic-based approaches
(Epi-NMT and Epi-Curriculum) have the best performance in 4 out of 5 domains, especially
the strength in seen average.
Based on the results of After FT and ∆FT in Table 3.5, we can observe that: (i) Meta-MT
also shows the same pattern that surpasses the AGG after fine-tuning in 6 domains. (ii) Our
episodic-based approaches (Epi-NMT and Epi-Curriculum) monopolize all the scores after
fine-tuning, where the Epi-Curriculum performs the best in 3 unseen domains. (iii) Epi-

50
Table 3.6 The average improvement over baselines on the En-Ro task.

Before FT After FT (∆FT)

Unseen Seen Unseen Seen
Vanilla 0 / 0 /
AGG 1.76 0 1.76 (0.09) 0
AGG-Curriculum 1.86 0.21 2.51 (0.54) 0.30 (0.09)
Meta-MT 1.68 -0.34 2.33 (0.50) -0.09 (0.25)
Epi-NMT 1.95 2.60 3.15 (0.70) 2.52 (-0.08)
Epi-Curriculum 1.73 2.34 3.32 (0.91) 2.23 (-0.10)

NMT and Epi-Curriculum also show their superiority in ∆FT, where the Epi-Curriculum
achieves the greatest improvement in all the unseen domains.
From the results in Table 3.6: (i) Meta-MT still has strength in adaptability (0.5 and 0.25)
and weakness in robustness (1.68 and -0.34). (ii) Epi-NMT and Epi-Curriculum outperform
the other approaches in most cases except the ∆FT in seen domains (-0.08 and -0.1), where
only AGG-Curriculum (0.09) and Meta-MT (0.25) have positive improvement compared to
the baseline AGG. (iii) Epi-NMT outperforms Epi-Curriculum in 3 out of 4 cases, only worse
than Epi-Curriculum in unseen domains After FT. But Epi-Curriculum performs better than
Epi-NMT in more domains in Table 3.5.

3.4.7 Results and Discussion (En-Fr)

The results of English-French (En-Fr) are shown in Table 3.7 and Table 3.8.
From the results Before FT in Table 3.7, we can see that: (i) Meta-MT keeps its poor
performance in the comparison. (ii) Epi-Curriculum has the best performance in 3 out of 5
cases, especially the superiority in seen average.
Based on the results of After FT and ∆FT in Table 3.7, we can observe that: (i) Meta-
MT surpasses AGG in all the unseen domains, indicating its strong adaptability. (ii) Epi-
Curriculum has the best performance in 4 cases.
From the results in Table 3.8: (i) Although Meta-MT has strength in adaptability, its
performance before fine-tuning (0.64 vs 0.73) is the worst and only better than AGG after

51
Table 3.7 BLEU scores over Testing sets on En-Fr task.

Unseen
Seen
ECB
Bible Books GlobalVoices Average
Before FT
Vanilla 37.55 13.02 15.83 28.98 33.17
AGG 38.27 13.68 16.63 29.74 42.03
AGG-Curriculum 38.45 14.05 16.94 30.14 42.32
Meta-MT 38.03 13.87 16.54 29.53 41.91
Epi-NMT 38.21 14.02 16.91 30.13 42.57
Epi-Curriculum 38.19 14.16 16.87 30.49 43.60
After FT
Vanilla 37.58 13.23 16.36 29.11 33.58
AGG 39.32 14.81 16.96 29.92 42.36
AGG-Curriculum 39.47 14.92 17.25 30.38 42.74
Meta-MT 39.36 15.15 17.35 30.05 42.30
Epi-NMT 39.41 15.02 17.20 30.51 42.85
Epi-Curriculum 39.87 15.09 17.46 30.53 43.91
∆FT
Vanilla 0.03 0.21 0.53 0.13 0.41
AGG 1.05 1.13 0.33 0.18 0.33
AGG-Curriculum 1.02 0.87 0.31 0.24 0.42
Meta-MT 1.33 1.28 0.81 0.52 0.39
Epi-NMT 1.20 1.00 0.29 0.38 0.28
Epi-Curriculum 1.68 0.93 0.59 0.04 0.31

fine-tuning (1.41 vs 1.18). (ii) Epi-Curriculum outperforms the others in both before and
after fine-tuning performance.

3.4.8 Robustness to Domain Shift

To understand how the episodic framework improves the model’s robustness to domain
shift, we compare its impact on cross-domain testing with other approaches on the En-
De task. The BLEU improvement is reported by evaluating the performance of an en-
coder/decoder of an inexperienced model, and combined with the decoder/encoder trained
by Epi-Curriculum, where the inexperienced model is one of the domain-specific models
(introduced in Equation 3.3). For instance, to compute the encoder improvement on the

52
Table 3.8 The average improvement over baselines on the En-Fr task.

Before FT After FT (∆FT)

Unseen Seen Unseen Seen
Vanilla 0 / 0 /
AGG 0.73 0 1.18 (0.44) 0
AGG-Curriculum 1.05 0.29 1.44 (0.38) 0.38 (0.09)
Meta-MT 0.64 -0.12 1.41 (0.76) -0.06 (0.06)
Epi-NMT 0.97 0.54 1.47 (0.49) 0.49 (-0.05)
Epi-Curriculum 1.08 1.57 1.67 (0.58) 1.55 (-0.02)
12
2.5 AGG
AGG-Curriculum 10
2.0 Meta-MT
BLEU Improvement

1.5 Epi-NMT 8
Epi-Curriculum
1.0 6
0.5 4
0.0 2
0.5
0
Covid-19 Bible Books ECB TED2013 EMEA Tanzil KDE4 OpenSub JRC
Unseen Domains Seen Domains
Figure 3.3 Cross-domain improvement of domain-specific models on the En-De task.

Covid-19 domain, we replace the encoder of domain-specific models (EMEA, Tanzil, KDE4,
OpenSub, and JRC) with the encoder trained by Epi-Curriculum, test them on the Testing
set of Covid-19 and report the average BLEU improvements. The model is excluded when
it matches the input domain to maintain the cross-domain testing.
The results are shown in Fig 3.3 and Fig 3.4 for encoder and decoder, respectively. We
can see that: (i) AGG and Meta-MT have negative impacts in the Books and TED2013
domain. (ii) AGG-Curriculum has no negative impact and is only slightly lower than Meta-
MT in the Bible domain. (iii) Epi-NMT and Epi-Curriculum perform very close and have
the best performance in 8 out of 10 domains, except the Covid-19 and Bible. To quantize,
Epi-Curriculum outperforms AGG by 2.55 and 2.59 BLEU scores in the case of encoder

53
AGG 10
2.0
AGG-Curriculum
Meta-MT 8
BLEU Improvement
1.5 Epi-NMT
Epi-Curriculum 6
1.0
0.5 4

0.0 2

0
Covid-19 Bible Books ECB TED2013 EMEA Tanzil KDE4 OpenSub JRC
Unseen Domains Seen Domains
Figure 3.4 Cross-domain improvement of domain-specific models on the En-De task.
Covid-19 Bible Books ECB TED2013
0 0 0 0.0 0
1 2.5
BLEU Degradation

2 1 2
2 5.0
4 3 2
7.5 4
AGG AGG AGG AGG AGG
6 AGG-Curriculum 4 AGG-Curriculum 3 AGG-Curriculum 10.0 AGG-Curriculum AGG-Curriculum
Meta-MT 5 Meta-MT Meta-MT Meta-MT 6 Meta-MT
8 Epi-NMT Epi-NMT 4 Epi-NMT 12.5 Epi-NMT Epi-NMT
Epi-Curriculum 6 Epi-Curriculum Epi-Curriculum Epi-Curriculum 8 Epi-Curriculum
10 5 15.0
0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03
EMEA Tanzil KDE4 OpenSub JRC
0 0 0 0 0
2 1 2 2
BLEU Degradation

1
4 2 4 4
6 3 6 2 6
AGG AGG AGG AGG AGG
8 AGG-Curriculum 4 AGG-Curriculum 8 AGG-Curriculum 3 AGG-Curriculum 8 AGG-Curriculum
10 Meta-MT 5 Meta-MT 10 Meta-MT Meta-MT 10 Meta-MT
Epi-NMT Epi-NMT Epi-NMT 4 Epi-NMT Epi-NMT
12 Epi-Curriculum 6 Epi-Curriculum 12 Epi-Curriculum Epi-Curriculum 12 Epi-Curriculum
5
0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03
Guassian noise std ( ) Guassian noise std ( ) Guassian noise std ( ) Guassian noise std ( ) Guassian noise std ( )

Figure 3.5 BLEU score degradation by adding Gaussian noise.

and decoder, respectively. This experiment demonstrates that our episodic training strategy
indeed enhances the model’s robustness to domain shift.

3.4.9 Robustness to Parameter Perturbation

In terms of a model’s robustness, recent studies have analyzed the quality of the minima
that the model falls into [67, 164, 111]. Rather than having parameters that only themselves
have low loss values, it suggests that a model with good generalization ability should seek
parameters that lie in neighborhoods having uniformly low loss values. In other words, a
robust model will be obtained by converging to a flat minimum, instead of a sharp one.

54
34 English-German 37 English-Romanian English-French
42
32 35
30 33
40
BLEU Score

BLEU Score

BLEU Score
28 31
26 29
AGG AGG 38 AGG
24 AGG-Curriculum 27 AGG-Curriculum AGG-Curriculum
Meta-MT Meta-MT Meta-MT
22 Epi-NMT 25 Epi-NMT Epi-NMT
Epi-Curriculum Epi-Curriculum Epi-Curriculum
20 23 36
Low Low-Med Medium Med-High High Low Low-Med Medium Med-High High Low Low-Med Medium Med-High High
Sentence Divergence Sentence Divergence Sentence Divergence

Figure 3.6 Performance of En-De, En-Ro, and En-Fr on the seen domains.

Therefore, if a model’s performance is not dependent on a precisely tuned solution, it would

less likely to suffer from parameter perturbations.
To this end, it is natural to compare the robustness of our approach and others by
simulating the parameter perturbations. In detail, we observe the translation performance
degradation by increasingly adding Gaussian noise to the model’s parameters of the En-De
task. Results are shown in Fig 3.5. We can observe that: (i) BLEU scores decrease as
the parameters are perturbed harder. (ii) Although the performance is similar when the
perturbation is not much (0.01, 0.02), it is obvious that the models trained by our episodic
framework (Epi-NMT and Epi-Curriculum) drop the slowest. (iii) When we have the biggest
perturbation (σ = 0.03), Epi-NMT and Epi-Curriculum have the best performance in 9 out
of 10 domains, only slightly worse than Meta-MT in Tanzil.
This experiment suggests that the minima found by our episodic framework have higher
quality. The minima drop into a wider area that the neighbor points are also with low loss
values, and further demonstrate the robustness of our approach.

3.5 Curriculum Learning Analysis

3.5.1 Curriculum Validity

To verify our curriculum validity, we categorize all the sentences (ignore different do-
mains) in the Testing set into 5 shards with the thresholds used in the Training set, where

55
Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3
0.5 0.5 0.5 0.5 0.5 0.5
Probability of Sampling

Probability of Sampling
0.4 0.4 0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1 0.1 0.1

0 0 0 0 0 0
Low High Low High Low High Low High Low High Low High

Figure 3.7 Probabilistic view of the Advanced (left) and the Reversed (right) scheduler.

Table 3.9 BLEU scores over Testing sets on En-De task with different training schedulers.

Unseen
Seen
Covid-19 Bible Books ECB TED2013 Average
AGG-Curriculum
Default 26.05 12.06 11.36 32.66 26.91 28.93
Before Advanced 26.31 12.55 11.24 32.81 26.78 29.49
Reversed 25.64 12.38 10.83 32.27 26.66 28.82
Default 26.71 13.48 11.77 33.76 28.36 29.60
After Advanced 26.57 13.56 11.97 33.53 28.20 30.17
Reversed 26.18 13.14 11.39 33.46 27.51 29.51
Epi-Curriculum
Default 27.12 12.70 11.90 34.12 26.63 31.57
Before Advanced 26.43 12.31 11.87 34.01 26.42 32.29
Reversed 26.44 12.19 11.65 34.28 26.17 31.80
Default 27.73 15.11 12.53 34.89 29.11 32.46
After Advanced 27.07 14.51 12.48 34.93 28.77 32.74
Reversed 27.13 14.40 12.34 34.73 28.82 32.25

each shard has a similar number of samples. The result of the En-De, En-Ro, and En-Fr
tasks are shown in Fig 3.6, we can observe that: (i) The BLEU scores gradually decrease
with increasing divergence levels, indicating that the metric introduced in Equation 3.6 is
able to evaluate adaptation difficulty. (ii) Epi-NMT and Epi-Curriculum have much better
performance in all levels of divergence groups, showing the robustness of our episodic frame-
work. (iii) Compared to Epi-NMT, Epi-Curriculum has strength in the shards with higher
divergence scores, demonstrating the effectiveness of the curricula.

56
Table 3.10 BLEU scores over Testing sets on En-Ro task with different training schedulers.

Unseen
Seen
KDE4 Bible QED GlobalVoices Average
AGG-Curriculum
Default 28.29 8.95 24.68 25.08 32.65
Before Advanced 28.42 8.71 24.63 25.29 33.16
Reversed 28.14 8.61 24.80 25.78 32.27
Default 30.27 10.72 25.02 27.19 33.25
After Advanced 30.38 10.75 24.86 26.47 33.82
Reversed 30.32 10.63 25.31 27.18 32.89
Epi-Curriculum
Default 27.27 8.63 25.31 25.25 34.78
Before Advanced 27.23 8.79 25.42 25.06 35.54
Reversed 27.41 8.77 26.06 26.06 34.23
Default 31.45 11.12 25.94 28.25 35.19
After Advanced 30.70 10.77 25.48 28.14 35.95
Reversed 31.03 10.81 26.19 28.43 34.80

3.5.2 Impact of Training Schedulers

The curriculum training scheduler defines the order in which samples of different diver-
gence shards are presented to the training process. It is natural to introduce the shards from
low to high divergence [10]. However, there does not exist a golden standard for the training
scheduler.
In this section, we introduce other two variants of training schedulers to investigate the
impact of different schedulers on curriculum-based approaches (AGG-Curriculum and Epi-
Curriculum). The two variants are described below:

• Advanced: The shards with low divergence scores have more probabilities to be sampled
in the first two stages, and equal probability in the last stage. Fig 3.7 (left) illustrates
this probability view.

• Reversed: The shards are sorted in descending order of divergence. Fig 3.7 (right)
shows this probability view.

57
Table 3.11 BLEU scores over Testing sets on En-Fr task with different training schedulers.

Unseen
Seen
ECB Bible Books GlobalVoices Average
AGG-Curriculum
Default 38.45 14.05 16.94 30.14 42.32
Before Advanced 38.08 13.79 16.68 29.91 42.67
Reversed 38.00 14.30 17.23 29.89 42.18
Default 39.47 14.92 17.25 30.38 42.74
After Advanced 39.92 15.22 16.95 30.22 43.02
Reversed 39.22 15.29 16.96 30.74 42.59
Epi-Curriculum
Default 38.19 14.16 16.87 30.49 43.60
Before Advanced 38.46 13.86 17.02 30.02 44.13
Reversed 37.87 13.79 16.49 30.79 43.43
Default 39.87 15.09 17.46 30.53 43.91
After Advanced 39.83 14.86 17.09 30.04 44.52
Reversed 39.01 15.52 17.01 30.17 43.68

Table 3.9, Table 3.10 and Table 3.11 show the results with different training schedulers
on En-De and En-Ro tasks, respectively. We can see that different training schedulers
do not lead to significant performance changes except in the seen domains. The average
of seen domains is sensitive to different training schedulers, where its performance can be
summarized as Reversed < Default < Advanced. This order is also related to the portion
that low divergence samples are sampled during the training process. Thus the performance
of seen may depend on the performance of the low-divergence corpus. Additionally, same as
[162], our results indicate that curriculum learning can lead to better results, but the default
low-to-high divergence curriculum is not the only helpful curriculum strategy for training.

3.5.3 Impact of Denoising

To understand the impact of data denoising in our approach, we compare the performance
of AGG-Curriculum and Epi-Curriculum trained with noise data. Specifically, only Equation
3.6 is applied to sort the data without using Equation 3.7 to filter them. Table 3.12, Table
3.13 and Table 3.14 show the results with and without noised data. However, the results

58
Table 3.12 BLEU scores over Testing sets on En-De task with and without noise data.

Unseen
Seen
Covid-19 Bible Books ECB TED2013 Average
AGG-Curriculum
Denoised 26.05 12.06 11.36 32.66 26.91 28.93
Before
Noised 26.27 12.61 10.89 32.30 27.73 28.86
Denoised 26.71 13.48 11.77 33.76 28.36 29.60
After
Noised 26.74 13.65 11.48 33.56 28.85 29.75
Epi-Curriculum
Denoised 27.12 12.70 11.90 34.12 26.63 31.57
Before
Noised 27.03 12.38 12.16 34.07 27.18 31.44
Denoised 27.73 15.11 12.53 34.89 29.11 32.46
After
Noised 27.69 14.42 12.51 34.26 28.91 32.28

Table 3.13 BLEU scores over Testing sets on En-Ro task with and without noise data.

Unseen
Seen
KDE4 Bible QED GlobalVoices Average
AGG-Curriculum
Denoised 28.29 8.95 24.68 25.08 32.65
Before
Noised 28.25 9.05 24.37 25.18 32.88
Denoised 30.27 10.72 25.02 27.19 33.25
After
Noised 30.21 11.03 25.07 26.08 33.62
Epi-Curriculum
Denoised 27.27 8.63 25.31 25.25 34.78
Before
Noised 27.06 8.65 25.56 24.99 34.31
Denoised 31.45 11.12 25.94 28.25 35.19
After
Noised 31.87 10.78 25.72 27.76 34.85

do not show a significant difference between the performance with noise and without noise
on both En-De and En-Ro tasks. For both En-De and En-Ro tasks, the results are very
close, and hard to conclude which one is better. This is because it is difficult to affect a
translation model with only approximately 8% less amount of corpus. But it also proves the
effectiveness of our approach, which can achieve the same performance with 8% less amount
of data.

59
Table 3.14 BLEU scores over Testing sets on En-Fr task with and without noise data.

Unseen
Seen
KDE4 Bible QED GlobalVoices Average
AGG-Curriculum
Denoised 38.45 14.05 16.94 30.14 42.32
Before
Noised 38.21 14.54 17.17 29.81 42.17
Denoised 39.47 14.92 17.25 30.38 42.74
After
Noised 39.93 14.59 17.50 30.78 42.60
Epi-Curriculum
Denoised 38.19 14.16 16.87 30.49 43.60
Before
Noised 37.94 14.42 16.58 30.15 43.86
Denoised 39.87 15.09 17.46 30.53 43.91
After
Noised 39.10 14.63 17.81 30.72 44.05

3.5.4 Limitations

Despite the robustness and adaptability, our Epi-Curriculum also has one essential limi-
tation. The episodic framework has a high computational cost in both time complexity and
space complexity.
To analyze the time complexity, we assume O(1) is the time complexity for updating
a model’s parameters once in each iteration, such as the conventional AGG. However, the
episodic framework begins with O(N) for training N source domains once, followed by O(2)
for applying episodic encoder training and episodic decoder training, and last comes with
O(1) for the final update combined with episodic encoder loss and episodic decoder loss. In
the experiment of both En-De and En-Ro tasks, there are 5 domains in training and the
training time of the episodic framework is eight times that of AGG. In practice, AGG needs
only 15 minutes to train an epoch on the En-De task but the episodic framework requires
approximately 2 hours. Apparently, the additional training time can not be ignored with an
increasing number of source domains.
To analyze the time complexity, we let O(1) be the space complexity for storing the
parameters of one model, such as one single AGG. Whereas our episodic framework requires

60
O(N) space to store N domain-specific models and O(1) space to store the AGG model. The
additional space requirement also can not be ignored with a larger translation model.

3.6 Conclusion

We present Epi-Curriculum for low-resource domain adaptation in NMT. A novel episodic

framework is proposed to handle the model’s robustness to domain shift, and a denoised
curriculum learning is applied to further boost the model’s adaptability. Experiments on
En-De and En-Ro empirically show the effectiveness of our approach, where Epi-Curriculum
outperforms the baseline on unseen and seen domains by 2.28 and 3.64 on En-De task, and
3.32 and 2.23 on En-Ro task. The results also demonstrate that the episodic framework is
more effective than the MAML-based framework.

61
Chapter 4: SuperCon: Supervised Contrastive Learning for Imbalanced Skin
Lesion Classification8

Convolutional neural networks (CNNs) have achieved great success in skin lesion clas-
sification. A balanced dataset is required to train a good model. However, due to the
appearance of different skin lesions in practice, severe or even deadliest skin lesion types
(e.g., melanoma) naturally have a small amount represented in a dataset. In that, classi-
fication performance degradation occurs widely, it is significantly important to have CNNs
that work well on class imbalanced skin lesion image datasets. In this paper, we propose
SuperCon, a two-stage training strategy to overcome the class imbalance problem of skin
lesion classification. We first try to learn a feature representation that is closely aligned
among the intra-classes and distantly apart from the inter-classes, and train a classifier that
aims to correctly predict the label based on the learned representations. In the experimental
evaluation, extensive comparisons have been made between our approach and other existing
approaches on skin lesion benchmark datasets. The results show that our two-stage training
strategy effectively addresses the class imbalance classification problem, and significantly
improves existing works in terms of F1-score and AUC score, resulting in state-of-the-art
performance.

4.1 Introduction

Deep learning has been extensively applied in the field of skin image analysis. For exam-
ple, Zhang et al. [160] have shown that convolutional neural networks (CNNs) have achieved
state-of-the-art performance in skin image classification. The performance of a deep neural
8
This chapter was pre-printed in arXiv:2202.05685 (refer to [20]). Permission is included in Appendix A.

62
network model on skin image analysis highly depends on the quality and quantity of the
dataset. However, in most of the real-world skin image datasets [57, 30, 29] the data of
certain classes (e.g., the benign lesions) is abundant which makes them an over-represented
majority, while the data of some other classes (e.g., the cancerous lesions) is deficient which
makes them an underrepresented minority. It is more crucial to precisely identify the sam-
ples from an underrepresented (i.e., in terms of the amount of data) but more important
minority class (e.g., cancerous skin lesions). For instance, a deadly cancerous skin lesion
(e.g., melanoma) that rarely appears during the examinations should be barely misclassified
as benign or other less severe lesions (e.g., dermatofibroma). Hence, it is of great impor-
tance to enable CNNs to work well on imbalanced skin image datasets, especially for those
minority but deadliest skin diseases, e.g., Melanoma.
Conventionally, methods for addressing class imbalanced data roughly drop into three
categories: data-level, loss-level and model-level. In data-level methods, techniques are
usually proposed to directly adjust the training data distributions by tuning the sampling
rate of the data in different classes [62, 76, 110]. For instance, RUS [76] trained a teacher
model by randomly reducing the number of sampled data in the majority classes, and then
re-trains the model on the original dataset. Rather than assuming every sample with equal
weight during the backpropagation, the loss-level methods adapt the loss weight of each
training sample [147, 68, 86, 142, 159, 165]. For example, Lin et al. [86] proposed focal
loss, which reduced the impact of those easily classified samples and majority classes on the
loss. Model-level methods always use a pool of classifiers that are trained with different
distributions and vote during the inference phase [114, 51, 33, 171]. For example, given a
pool of classifiers, Ren et al. [114] determined the classifier reliability by fuzzy set theory, and
combined the decision credibility of each test sample to make the final decisions. However,
conventional methods are not able to regularize the learned features of the training samples
for unstructured data types (e.g. image, video, audio, and text), which is the key property
of deep learning.

63
In this study, we expect a model that can easily discriminate different classes if they are
well represented in a non-overlapping distribution. To date, some existing works have shown
the superiority of this intuition. Zhou et al. [166] presented a two-stage training strategy to
simultaneously learn the universal features to model the minority class features, and adap-
tively predict the outputs afterward by aggregating the above two branches. Chu et al. [28]
decomposed the features of each class into a class-generic and a class-specific component, and
augmented the minority classes feature to enlarge the training samples. These approaches
have considered the class imbalanced problem in the feature space, but they only try to
enlarge the number of minority classes in the feature space, and lack of an effective method
to improve the feature quality. Hence, it calls for a solution where a good representation of
each class should be obtained regardless of data class disproportions. Specifically, to solve
the problem of enabling the good performance of CNNs on class imbalanced datasets, we
would like to focus on solving the challenge of learning high-quality representation that is
not affected by the ratio of the number of samples. There are a few existing works aiming
to enhance the ability of representation learning [93, 153, 61, 119, 70]. For instance, Misra
et al [93] tried to learn invariant feature representation between the original image and its
augmentation using a noise contrastive estimator. Khosla et al [70] leveraged the class label
to learn clustered representation for downstream tasks.
In this paper, we introduce SuperCon, a two-stage training strategy for skin lesion clas-
sification in class imbalanced scenarios by learning good feature representations. It utilizes
supervised contrastive loss [70] to learn clustered feature representations for different classes
in the first stage and apply focal loss [86] to further enhance the robustness of the classifier.
To be more specific, the supervised contrastive loss is able to pull the learned representa-
tions of the same class together and push apart the representations from different classes.
During the second stage, focal loss can reduce the impact of majority class samples have on
the cross-entropy loss. The two stages train sequentially, which means the classifier will be
fine-tuned on top of a frozen feature extractor after the first stage is finished. By optimizing

64
the two stages, the model is robust enough to extract discriminative features and perform
well on the imbalanced dataset.
A comprehensive evaluation using four different backbone networks has been conducted
on the current benchmark datasets ISIC challenge 2019 [134, 30, 31] and ISIC challenge
2020 [115]. The experimental results show that our SuperCon consistently outperforms the
other baseline approaches. Particularly, under a more class-imbalanced dataset, SuperCon
performs much better than using focal loss [86] during transfer learning by 0.3 in terms of
the averaged AUC score among all the backbone networks.
To summarize, our work makes the following contributions:

• We present a novel and effective training strategy for skin image analysis on class im-
balance dataset. SuperCon adopts supervised contrastive loss and focal loss to address
representation learning and imbalance classification tasks respectively.

• To the best of our knowledge, this is the first work that uses supervised contrastive
loss to address the class imbalance problem, especially for skin lesion classification.

• A comprehensive comparison and analysis have been conducted. For the sake of repro-
ducibility and convenience of future studies about skin image analysis on imbalance
dataset, we have released the prototype implementation of our proposed SuperCon 9 .

The rest of the paper is organized as follows: Section 4.2 presents the related literature
review. Section 4.3 describes our proposed method. Section 4.4 presents the experimental
evaluation. Section 4.5 presents the conclusion.

4.2 Related Work

4.2.1 Skin Image Analysis in Deep Learning

In recent years, extensive research works have applied deep learning in the field of
skin image analysis, such as lesion segmentation [84, 155, 2, 156] and disease diagnosis
9
https://github.com/keyu07/SuperCon ISIC

65
[160, 92, 171, 65, 7, 105]. For example, Yuan et al. [155] presented an end-to-end deep
learning framework to automatically segment dermoscopic images. Li et al. [84] introduced
a transformation-consistent strategy in the self-sensing model to enhance the regularization
effect for pixel-level predictions, such that further improves the skin image segmentation
performance. Zhang et al. [160] proposed an attention neural network model for skin image
lesion classification in dermoscopy images and achieves state-of-the-art performance. Perez
et al. [105] investigated the performance of transfer learning among different backbone neu-
ral network architectures. In this paper, we focus on improving the performance of disease
diagnosis (classification) under the class imbalance setting.

4.2.2 Class Imbalance

Class imbalance problem refers to that when certain classes (minority classes) contain
significantly fewer samples than other certain classes (majority classes). But in many cases,
the minority class is the class of interest. For example, the ISIC challenge 2020 [115] dataset
consists of 32,542 benign skin lesion samples and 584 malignant lesion samples, where the
malignant type is considered as cancerous skin lesion and the class of interest. Conventional
deep learning approaches easily over-classify the majority class due to the increased prior
probability, resulting in more frequent misclassifying of the minority class.
In terms of the focus of a machine learning pipeline, conventional works addressing this
problem can be roughly divided into three categories: data-level, loss-level, and model-level
methods. In data-level methods, techniques are usually proposed to adjust the training
distribution by tuning the sampling rate of data from different classes [62, 76, 110]. For
instance, random under-sampling (RUS) [76] pre-trained a model by randomly ignoring the
sample from the majority class, and re-trains the model on the original dataset. Pouyanfar
et al. [110] applied a dynamic sampling method that adjusts sampling rates according to the
F1-score of the previous iteration. Instead of modifying the training data distribution, loss-
level methods decrease the importance of certain majority classes during the decision-making

66
process or consider certain pre-defined weights of each class during the training process
[147, 68, 86, 142, 159, 165]. Lin et al. [86] proposed focal loss by reshaping the cross-entropy
loss in order to reduce the impact of those easily classified samples and majority classes on the
loss during the training process. Rather than training a single model, model-level methods
normally prepare a pool of models and fuse them to make the final decisions [114, 51, 33, 171].
The contributions of different models to one final decision are normally weighted by each
model’s reliability, the model confidence on each testing sample or both. For instance, Ren et
al. [114] determined the classifier reliability by fuzzy set theory, and combined the decision
credibility of each testing sample to make the decisions. As a common sense, deep learning
neural networks are good at learning high hierarchy features from unstructured data types
(image, video, text, etc.), which the conventional class imbalanced methods are not able to
handle.
To extract good features for the upcoming classification task, many recent works have
shown the superiority of addressing the imbalanced problem in feature space. Zhou et al.
[166] presented a two-stage training strategy to simultaneously learn the universal features
and the minority class features by uniformly and reversely sampling the training image and
aggregated the two features to predict the class labels afterward. Duggal et al. [39] presented
an early-exiting framework, which tries to adapt the model’s attention by ignoring easily
classified samples from the early stage and focusing on the hard ones at the late stages.
Chu et al. [28] decomposed the features of each class into class-generic and class-specific
components, and augmented the minority classes features to enlarge the training samples.
Those approaches significantly mitigate the imbalanced problem, but they only enlarge
the quantity and lack of considerations to improve the feature quality. In this paper, we as-
sume that a good prediction model should be obtained regardless of the class disproportion
if both majority and minority are well represented and come from non-overlapping distribu-
tions. Hence, we aim to leverage robust representation learning techniques to improve the
feature quality for the class imbalanced problem.

67
4.2.3 Representation Learning

Representation learning in deep learning refers to learning rich and representative pat-
terns from the abundance of data. Self-supervised learning and contrastive learning have
gained popularity recently because they are able to obtain robust representation via self-
defined tasks without using annotating data [93, 153, 61, 119, 70, 101]. For instance, Noroozi
et al [101] aimed to re-order the divided image patches, forcing the model to learn the rela-
tionship between different parts of an object. Wu et al. [93] tried to learn invariant feature
representation between the original image and its augmentation by using noise contrastive
estimation. Khosla et al. [70] leveraged the class label to learn clustered representation for
downstream tasks, outperforming the best top-1 accuracy on ResNet200 [59].

4.3 Methodology

To address the class imbalance problem, we assume that a good model should be obtained,
regardless of class disproportion, if different classes are well represented and come from
non-overlapping distribution. Based on this assumption, we suggest that a well-learned
feature representation should satisfy the following properties: (i) The representation should
be similar enough within the same class, resulting in a more centralized cluster in the feature
space and an easier classification task. (ii) The representation of different classes should be
away from each other. Otherwise, overlapping distributions could mislead the classifier to
make wrong decisions.
To handle the two properties above, we develop SuperCon, a two-stage training strategy
for imbalanced image classification: representation training and classifier fine-tuning. During
the first stage, supervised contrastive loss [70] is applied to encourage the feature extractor
to generate closely aligned feature representation in the same class and generate distantly
representation from different classes. In this way, the learned feature representation is well
embedded and is able to further guide the classifier to make the right decisions. During
the second stage, we leverage the focal loss [86] to reduce the impact that the majority

68
Representation Training

Feature Mapping
Extractor Module

Augmentation
Module

Share Parameters
Classifier Fine-tuning

Feature
Extractor Classifier

Figure 4.1 Diagram of our two-stage training.

class has on the loss. As illustrated in Figure 4.1, the upper and the bottom branches show
the training process of representation training and classifier fine-tuning, respectively. After
finishing the representation training, the parameters of the well-trained feature extractor for
classifier fine-tuning will be shared and frozen. In the rest of this section, the two-stage
training will be presented separately: representation training in section 4.3.1 and classifier
fine-tuning in section 4.3.2.

4.3.1 Representation Training

As described above, the goal of this stage is to learn a feature extractor that closely aligns
intra-class feature representation and pulls apart the inter-class feature representation. The
representation training includes 3 modules (shown in Figure 4.1 upper branch): (i) A data
augmentation module A, which provides different views of the input images. (ii) A feature
extractor Fθ with parameter θ, which takes the images as input and encodes them into
embedded feature space. (iii) A mapping module M, which maps the embedded feature into

69
a lower dimensional space. Note that both the feature extractor and the mapping module
contain the learnable parameters during training.
For each training sample x ∈ X , a random view x ′ ∈ X ′ is generated via the data
augmentation module:
x ′ = A(x) (4.1)

where x ′ describes the input image in a different view and contains some subset of the
information in the original image. The view could be different rotation angle, color distortion,
etc. We consider that the various views provide additional noise to the feature extractor and
enhance its generalization ability. The details of the augmentation methods used in our
experiments will be introduced in section 4.4.2.
Both the original image and its augmentation view will be processed by the feature
extractor to learn the feature vector:

rep, rep ′ = Fθ (x), Fθ (x ′ ) (4.2)

where the rep indicates the learned feature vector. Therefore, the original images and aug-
mented views are included in each mini-batch, resulting in a multiview-batch.
To save the expense of computing supervised contrastive loss, we map the feature vectors
to a lower dimension using the mapping module:

z, z ′ = M(rep), M(rep ′ ) (4.3)

where z and z ′ represent the lower dimension of the feature representation of the original
input and its view. After this, the supervised contrastive loss is applied on z and z ′ as
follows:

X −1 X exp(zi zp /τ )
LSuperCon = log P (4.4)
i∈I
|Pi | p∈P n∈N exp(zi zn /τ )
i

70
where the index i is called the anchor, |P| is the total number of positive samples in the
multiview-batch, the p is the index of positive samples P, the n is the index of negative
samples N and τ is a scalar temperature parameter. The positives and negatives are the
samples have the same class labels as the anchor and have different labels from the anchor,
respectively.
For all the samples in the batch, each sample will be an anchor once, and the sample
similarity will be computed via the inner product (cosine similarity) of their feature vectors,
for both the anchor-positive and anchor-negative pairs. Recall that our objective is to
closely align intra-class feature representation and distantly apart the inter-class feature
representation. It means in feature space, the cosine similarity of the intra-class should be
close to 1, and the inter-class should be close to 0. Hence, by minimizing the loss function
iteratively, the feature extractor is able to learn close feature representation within the same
class and learns very different representations from different classes.
To summarize, we update the parameters of the feature extractor as follows:

θm+1 ← θm − λ · ∇(LSuperCon ) (4.5)

where m is the number of learning iteration, and λ is the constant learning rate.

4.3.2 Classifier Fine-tuning

Since the feature extractor is well trained in the previous stage, the label prediction
task is then addressed by a classifier fine-tuning. A common issue of class-imbalanced data
training is that the easily classified majority class samples dominate the gradient, but the
minority class has not been presented enough. Finally, the model is over-confident in the
majority but fails in the minority. This issue is mitigated by using focal loss [86], which
reduces the impact that the majority class has on the loss.

71
To train the classifier, the classification model is composed of the feature extractor Fθ from
representation training and a trainable classifier Cφ with learnable parameters φ. On top of
the well-trained feature extractor, we freeze its parameters θ and only fine-tune the classifier
in this stage. By using the focal loss, which already shown a performance improvement on
class imbalance problem [86]. It reshapes the cross-entropy loss and assigns different weights
to majority and minority classes. The loss function is defined as follows:

LFocal (Prot ) = −αt (1 − Prot )γ log (Prot ) (4.6)

where αt is a class-wise weight that is used to increase the importance of the minority class.
While γ is a sample-wise weight that is used to reduce the propagation impact from well-
classified samples. t and Pro indicate the index of the sample and the related predicted
probability.
The classifier parameter is updated as follows:

φm+1 ← φm − β · ∇(LFocal ) (4.7)

where the β is the classification learning rate.

4.4 Experimental Evaluation

We conduct our experiments on two benchmark datasets, and use ResNet backbone
networks [59] to evaluate the performance of our proposed approach. Comprehensive com-
parisons have been made among SuperCon and other baseline approaches. The experimental
results show that SuperCon significantly improves other approaches and achieves state-of-
the-art performance on these two datasets.

72
Table 4.1 The sampled number of images for training and testing.

Training I Training II Testing

Maglignant 467 + 4,522 (ISIC 2019) 467 117

Benign 26,003 26,003 6,509

4.4.1 Experimental Datasets

Two benchmark datasets in skin lesion classification are used to evaluate our approach:
ISIC Challenge 2019 [134, 30, 31] and ISIC Challenge 2020 [115]. ISIC 2019 consists of 25,331
images in 8 classes, where 4,522 images share the same label (melanoma) as the minority
class in ISIC 2020. While ISIC 2020 dataset consists of 33,126 images in 2 classes, where 584
are confirmed malignant skin lesions and the rest 32,542 are benign lesions. The ISIC 2020
is randomly split into 80%/20% as training and testing. To explore the effectiveness under
different imbalanced ratios, we add 4,522 melanoma images from ISIC 2019 to our training
set of ISIC 2020. Accordingly, the detailed data split is shown in Table 4.1.

4.4.2 Experimental Settings

During the two-stage training process, we first employ representation training to train a
feature extractor that good at capturing the representations of the training images. Then,
the feature extractor is frozen and a classifier is fine-tuned using the Stochastic Gradient
Descent (SGD) optimizer for 10 epochs. The loss converges quickly in the second stage
due to the well-learned feature representation from the first stage. Both stages use batch
size 128 to ensure at least one anchor in the mini-batch. A popular image augmentation
process SimAugment [23] is used in both stages to provide the multiviews of the input image,
which applies random flips, rotations, color distortion, and Gaussian blur. The mapping
module m takes the output of the feature extractor and reduces its dimension to 128. The
hyperparameters of the first stage are below: τ = 0.1 and learning rate λ = 0.01. Because
smaller τ will lead to better performance but extremely small τ will lead to unstable training.

73
Table 4.2 Experimental results on the dataset of ISIC 2020 and ISIC 2019.

ISIC SuperCon
Vanilla Focal-Loss ROS RUS SuperCon
2019+2020 -CE
ResNet18
F1-score 0.51 0.57 0.35 0.56 0.56 0.58
AUC score 0.742 0.760 0.725 0.794 0.773 0.778
ResNet50
F1-score 0.51 0.58 0.36 0.58 0.67 0.67
AUC score 0.791 0.833 0.744 0.826 0.888 0.892
ResNet101
F1-score 0.56 0.64 0.38 0.62 0.75 0.68
AUC score 0.817 0.845 0.760 0.854 0.923 0.919
ResNet152
F1-score 0.53 0.63 0.40 0.60 0.74 0.84
AUC score 0.834 0.848 0.774 0.860 0.921 0.921
Average
F1-score 0.53 0.61 0.37 0.59 0.68 0.69
AUC score 0.796 0.822 0.751 0.834 0.876 0.878

During the second stage, since the classifier only needs to fine-tune, so we simply use a small
learning rate β = 5e-4, α = 0.25 and γ = 5 for focal loss. Note that the final results are not
sensitive to the change of α and γ. ResNets pre-trained on ImageNet [35] are employed as our
backbone networks to evaluate the proposed approach: ResNet18, ResNet50, ResNet101 and
ResNet152 [59]. Model-agnostic can be achieved by simply changing the backbone network
architectures without additional implementation. All of our experiments are implemented
using PyTorch, on a server with an RTX 3090 24 GB GPU.

4.4.3 Effectiveness Analysis

To demonstrate the effectiveness of SuperCon, a comparison between other baseline ap-

proaches and ours has been conducted.

74
• Vanilla is the baseline that trains the given dataset on the pre-trained CNN network
with cross-entropy loss.

• Focal-Loss [86] has the same setting as the Vanilla but uses the focal loss.

• ROS [62] Random Over-Sampling trains the given dataset by over-sampling the mi-
nority class, which is the malignant skin lesion.

• RUS [76] Random Under-Sampling randomly pre-trains a model by randomly reducing

the number of sampled data in the majority class, and then re-train the model on the
original dataset.

• SuperCon-CE is the approach that employs our two-stage training strategy but uses
cross-entropy loss during the second stage (classifier fine-tuning).

• SuperCon is our full two-stage training strategy with focal loss.

Given a baseline approach, we evaluate its effectiveness in terms of the Macro F1-score
and AUC score.
TP
Precision = (4.8)
TP + FP
TP
Recall = (4.9)
TP + FN
Precision × Recall
Micro F 1 = 2 × (4.10)
Precision + Recall
C
1 X
Macro F 1 = F 1(Ci ) (4.11)
|C | C
i

where in equation 4.8, TP, FP, and FN are True Positives, False Positives, and False Nega-
tives. In equation 4.11, |C | represents the total number of classes in the dataset, and F 1(Ci )
indicates the F1-score for each single class. When working with an imbalanced dataset, using
the Micro F1-score can be misleading. Because the Micro F1-score gives equal importance
to each sample, the majority classes will drive a huge portion of this score. The macro F1-
score gives equal importance to each class, and the majority and minority classes will drive

75
Table 4.3 Experimental results only on the ISIC 2020.

SuperCon
ISIC 2020 Vanilla Focal-Loss ROS SuperCon
-CE
ResNet18
F1-score 0.50 0.53 0.43 0.79 0.79
AUC score 0.500 0.552 0.804 0.709 0.726
ResNet50
F1-score 0.50 0.55 0.43 0.86 0.89
AUC score 0.500 0.542 0.809 0.865 0.876
ResNet101
F1-score 0.50 0.57 0.43 0.93 0.94
AUC score 0.500 0.558 0.810 0.897 0.901
ResNet152
F1-score 0.50 0.55 0.44 0.93 0.94
AUC score 0.500 0.554 0.821 0.903 0.906
Average
F1-score 0.50 0.55 0.43 0.88 0.89
AUC score 0.500 0.552 0.811 0.843 0.852

an equal portion of this score. Thus, in the experimental evaluation, we only use Macro
F1-score and present F1-score for short.
The results are shown in Table 4.2, we can observe that: (i) With deeper backbone
network architecture, both the F1-score and AUC score increase on all the approaches. (ii)
Vanilla always has reasonable performance in terms of AUCs. But the F1-score is very low.
(iii) Focal-Loss and RUS have slight improvement on the Vanilla in both scores. (iv) ROS
always has the worst performance, even worse than the baseline Vanilla. (v) Our SuperCon
consistently outperforms the others in terms of AUC score and is only slightly lower than that
without using focal loss on ResNet101. (vi) We can see that the performance improvement is
mainly coming from the representation training if compare the last two columns (SuperCon-
CE and SuperCon). (vii) On the overall average using the 4 different backbone networks,
SuperCon-CE and SuperCon outperform the other approaches by at least 0.1 on the F1-score
and over 0.04 on the AUC score, respectively.

76
To better demonstrate the effectiveness of SuperCon, confusion matrices of the six ap-
proaches using ResNet152 are plotted in Figure 4.2, the x-axis and y-axis indicate the pre-
dicted and true label. We can see that: (i) In Vanilla, the minority class is dominated
by the majority class, where 111 out of 117 samples are classified as the majority class.
(ii) Compared to Vanilla, Focal-Loss leads to a better classification performance on True
Negative, where 89 samples in melanoma are correctly classified. But it also has a higher
False Negative, where 417 samples in the majority class are misclassified. (iii) In Figure
4.2(c). ROS shows a very high False Negative, where over 40% majorities are misclassified.
Because, by design, ROS uniformly learns the minority and majority classes, resulting in
overperformed on the minority and underperformed on majority class. (iv) RUS also tries to
balance the distribution by randomly ignoring the majority of samples, but it fine-tunes the
original data distribution afterward. Thus it has a much lower False Negative than ROS. (v)
Both SuperCon-CE and SuperCon have significantly improvement in terms of the number
of correctly classified samples in both majority and minority classes.

4.4.4 Evaluation on Extremely Imbalanced Dataset

In the previous section, we conduct a comparison among our SuperCon and other baseline
approaches on the dataset ISIC 2019+2020. However, a more imbalanced dataset is more
likely to lead to failure in conventional training. In this section, Training II is used to evaluate
the effectiveness of our two-stage training strategy under a more imbalanced ratio. The detail
of sampled images is shown in the last two columns in Table 4.1. In this way, the imbalanced
ratio (i.e., the number of majority / the number of minority) is increased significantly from 5.2
to 55.7. Confusion matrix and t-SNE [138] extracted feature visualization of each approach
is shown in Figure 4.3 and Figure 4.4, respectively. Notably, RUS [76] is not evaluated
under this setting, because the amount of samples (only 467 malignant lesion samples) is
not enough for training if we balance the distribution by under-sampling the majority class.
One can be considered as a limitation of RUS.

77
Malignant Benign

Malignant Benign

Malignant Benign
6467 42 6092 417 3672 2837

111 6 28 89 2 115
Benign Malignant Benign Malignant Benign Malignant
(a) (b) (c)
Malignant Benign

Malignant Benign

Malignant Benign
5972 537 6320 189 6431 78

23 94 15 102 17 100
Benign Malignant Benign Malignant Benign Malignant
(d) (e) (f)

Figure 4.2 Confusion matrices of (a) Vanilla, (b) Focal-Loss, (c) ROS, (d) RUS, (e)
SuerCon-CE and (f) SuperCon using backbone network ResNet152 on dataset ISIC 2019 +
2020.

The experimental results are shown in Table 4.3, we can observe that: (i) Vanilla con-
sistently has 0.5 on both metrics and Focal-Loss is only slightly better than it. (ii) ROS
has the best AUC among all approaches with ResNet18 but improves limited when us-
ing more powerful backbone networks. Its F1-score is the worst and even all lower than
0.5. (iii) The performance of SuperCon-CE and SuperCon are much better than the other
approaches, where they improve the F1-score and AUC by 0.33 and 0.032 on average, re-
spectively. (iv) The performance of SuperCon-CE and SuperCon are very close, and the

78
Malignant Benign

Malignant Benign

Malignant Benign
6509 0 6450 59 4290 2219 6500 9 4290 2219

117 0 106 11 2 115 22 95 2 115

Benign Malignant Benign Malignant Benign Malignant Benign Malignant Benign Malignant
(a) (b) (c) (d) (e)

Figure 4.3 Confusion matrix of (a) Vanilla, (b) Focal-Loss, (c) ROS, (d) SuperCon-CE and
(e) SuperCon using backbone network ResNet152 on dataset ISIC 2020.

SuperCon is consistently better than that without using focal loss on both the F1-score and
AUC. (v) Comparing SuperCon-CE and SuperCon in Table 4.2 and 4.3, the performance
with a high imbalance ratio is better than that using additional data from ISIC 2019. We
consider that the ISIC 2019 and ISIC 2020 are coming from different distributions, although
the additional ISIC 2019 and the malignant type in 2020 are cancerous lesions. This could
caused by the images being generated by using different devices or different preprocessing
protocols. Hence the out-of-distribution samples mislead the representation training in the
first stage, resulting in a lower performance.
Figure 4.3 shows the confusion matrices of the four approaches using ResNet152, where
the x-axis and y-axis indicate the predicted and true label. We can see that: (i) Vanilla
simply classifies all samples as benign lesions and none of the malignant samples are classified
correctly, producing 0 False Positive and 0 True Negative. One is the reason. (ii) Focal-Loss
is able to correctly classify some of the minority class, but the False Positive is still very high:
106
117
= 90.6%. (iii) ROS shows the best True Negative, but nearly 1/3 majority of samples
are misclassified. (iii) Both SuperCon-CE and SuperCon have high True Positive and True
Negative, resulting in significant improvement.
We utilize t-SNE [138] to analyze the learned feature representation of different ap-
proaches on ResNet50. We can observe that: (i) For Vanilla and Focal-Loss, the extracted
feature representation from different classes are overlapped and not independent. It means

79
Benign
Malignant
(a) (b) (c)

(d) (e)

Figure 4.4 t-SNE visualization of extracted features from fθ , using (a) Vanilla; (b)
Focal-Loss; (c) ROS; (d) SuperCon-CE; (e) SuperCon.

that the model fails to distinguish one from another. (ii) ROS has a clustered malignant
feature representation, but it overlaps with the benign one. This can also explain the high
False Positive in Figure 4.3 (c). (iii) Since both SuperCon-CE and SuperCon have processed
the same first stage. They look close to each other and both yield the best separation of
different classes.

4.4.5 Summary of Experimental Evaluation

To summarize the experimental evaluation, we investigate the ability of SuperCon to

handle the class imbalance on the skin lesion dataset. The F1-score and AUC score show
that our SuperCon is outstanding among existing state-of-the-art approaches. Confusion
matrices demonstrate that our SuperCon not only yields high True Positives but also high

80
True Negatives, whereas other approaches hardly reach high True Negatives or reach with
a trade-off (ROS). Extracted feature visualization further illustrates the success of using
our SuperCon, where an independent and non-overlapping distribution is learned. However,
SuperCon requires two stages for training, and the performance is sensitive to the number
of training epochs in the first stage (Representation training). A proper number of train-
ing epochs will result in better performance. One can be considered as the limitation of
SuperCon.

4.5 Conclusion

In this paper, we proposed SuperCon, a two-stage training strategy for class imbalance
problems on skin lesions, representation training, and classifier fine-tuning. The representa-
tion training tries to obtain a feature representation that is closely aligned among intra-class
and distantly apart from inter-class. While the classifier fine-tuning learns a classifier to
address label prediction tasks on the basis of the obtained representation. Extensive exper-
iments have been conducted to show that SuperCon is able to discriminate different classes
under class imbalance settings. The experimental results also show that our SuperCon con-
sistently and significantly beat the existing approaches. For instance, while using a more
imbalanced dataset (ISIC 2020), SuperCon outperforms the state-of-the-art Focal-Loss by
averaging 0.34 and 0.3, in terms of F1-score and AUC score respectively. Currently, the
number of training epochs in the first stage requires a manual set based on experience to ob-
tain distinguishable representation. In future work, we plan to design an end-to-end solution
that handles the representation training and classifier fine-tuning simultaneously.

81
Chapter 5: CS-AF: A Cost-sensitive Multi-classifier Active Fusion Framework
for Skin Lesion Classification10

Convolutional neural networks (CNNs) have achieved the state-of-the-art performance in

skin lesion analysis. Compared with single CNN classifier, combining the results of multi-
ple classifiers via fusion approaches shows to be more effective and robust. Since the skin
lesion datasets are usually limited and statistically biased, while designing an effective fu-
sion approach, it is important to consider not only the performance of each classifier on the
training/validation dataset, but also the relative discriminative power (e.g., confidence) of
each classifier regarding an individual sample in the testing phase, which calls for an active
fusion approach. Furthermore, in skin lesion analysis, the data of certain classes (e.g., the
benign lesions) is usually abundant which makes them an over-represented majority, while
the data of some other classes (e.g., the cancerous lesions) is deficient which makes them
an underrepresented minority. It is more crucial to precisely identify the samples from an
underrepresented (i.e., in terms of the amount of data) but more important minority class
(e.g., cancerous skin lesions). In other words, misclassifying a more severe skin lesion to a
benign or less severe skin lesion should have relatively more cost (e.g., money, time and even
lives). To address such challenges, we present CS-AF, a cost-sensitive multi-classifier active
fusion framework for skin lesion classification. In the experimental evaluation, we prepared
96 base classifiers (of 12 CNN architectures) on the ISIC Challenge 2019 research dataset.
Our experimental results show that our framework consistently outperforms both the static
and the active fusion competitors in terms of the accuracy and total costs.
10
This chapter was published in Neurocomputing Volume 491 (refer to [172]). Permission is included in
Appendix A.

82
5.1 Introduction

Deep learning (DL) has achieved great success in many applications related to skin lesion
analysis. For instance, Zhang et al. [160] has shown that convolutional neural networks
(CNNs) have achieved the state-of-the-art performance in skin lesion classification. Also, as
the development of various deep learning techniques, numerous different designs of classifiers,
that might have different CNN architectures, use different sizes of the training data, use
different subsets or classes distributions of the training data or use different feature sets,
were proposed to tackle the skin lesion classification problem. For instance, as shown in
the ISIC Challenges [57, 30, 29], several CNN architectures have been used in skin lesion
analysis, including ResNet, Inception, DenseNet, PNASNet, etc. Because of such difference
(i.e., CNN architectures, subset of the training data, feature sets, etc.), those classifiers tend
to have distinct performance under different conditions (e.g., different subsets or classes
distributions of different datasets). There is no one-size-fits-all solution to design a single
classifier for skin lesion classification. It is necessary to investigate multi-classifier fusion
techniques to perform skin lesion classification under different conditions.
Designing an effective multi-classifier fusion approach for skin lesion classification needs to
address two challenges. First, since the datasets are usually limited and statistically biased
[57, 30, 29], while conducting multi-classifier fusion, it is necessary to consider not only
the performance of each classifier on the training/validation dataset, but also the relative
discriminative power (e.g., confidence) of each classifier regarding an individual sample in the
testing phase. This challenge requires the researchers to design an active fusion approach,
that is capable of tuning the weight assigned to each classifier dynamically and adaptively,
depending on the characteristics of given samples in the testing phase. Second, since in most
of the real-world skin lesion datasets [57, 30, 29] the data of certain classes (e.g., the benign
lesions) is abundant which makes them an over-represented majority, while the data of some
other classes (e.g., the cancerous lesions) is deficient which makes them an underrepresented
minority, it is more crucial to precisely identify the samples from an underrepresented (i.e.,

83
in terms of the amount of data) but more important minority class (e.g., cancerous skin
lesions). For instance, a deadly cancerous skin lesion (e.g., melanoma) that rarely appears
during the examinations should be barely misclassified as benign or other less severe lesions
(e.g., dermatofibroma). Specifically, misclassifying a more severe lesion to a benign or less
severe lesion should have relatively more cost (e.g., money, time and even lives). Hence, it
is also important to enable such “cost-sensitive” feature in the design of an effective multi-
classifier fusion approach for skin lesion classification.
In this work, we propose CS-AF, a cost-sensitive multi-classifier active fusion framework
for skin lesion classification, where we define two types of weights: the objective weights
and the subjective weights. The objective weights are designed according to the classifiers’
reliability to recognize the particular skin lesions, which is determined by the prior knowledge
obtained through the training phase. The subjective weights are designed according to the
relative confidence of the classifiers while recognizing a specific previously “unseen” sample
(i.e., individuality), which are calculated by the posterior knowledge obtained through the
testing phase. While designing the objective weights, we also utilize a customizable cost
matrix to enable the “cost-sensitive” feature in our fusion framework, where given a sample,
different outputs (i.e., correct predictions or all kinds of errors) of a classifier should result in
different costs. For instance, the cost of misclassifying melanoma as benign should be much
higher than misclassifying benign as melanoma. In the experimental evaluation, we trained
96 base classifiers as the input of our fusion framework, utilizing twelve CNN architectures on
the ISIC Challenge 2019 research dataset for skin image analysis [57, 30, 29]. We compared
our approach with two static fusion baseline approaches (i.e., max voting and average fusion)
and two state-of-the-art active fusion approaches (i.e., MCE-DW [114] and DES-MI [51]).
Our experimental results show that our CS-AF framework consistently outperforms the
static fusion baseline approaches and the state-of-the-art competitors in terms of accuracy,
and always achieves lower total cost.

84
To summarize, our work has the following contributions:

• We present a novel and effective multi-classifier active fusion framework, where the
proposed multi-classifier weight assignment not only leverages the “reliability” (i.e.,
the objective weights) extracted from the prior knowledge of the training/validation
dataset, but also take advantages of the “individuality” (i.e., the subjective weights)
computed from the posterior knowledge of the testing dataset.

• We propose an approach to enable the “cost-sensitive” feature of our multi-classifier

active fusion framework, where the proposed multi-classifier weight assignment can
easily actively adapt to different customized cost matrices.

• To the best of our knowledge, our work is the first one that attempts to apply active
fusion for skin lesion analysis, and demonstrates its advantages over the conventional
static fusion and existing active fusion approaches. Specifically, a comprehensive ex-
perimental evaluation using twelve popular and effective CNN architectures has been
conducted on the most popular skin lesion analysis benchmark dataset, ISIC Challenge
2019 research datasets [57, 30, 29]. For the sake of reproducibility and convenience of
future studies about fusion approaches in skin lesion analysis, we have released our
prototype implementation of CS-AF, information regarding the experiment datasets
and the code of our comparison experiments.11

The rest of this paper is organized as follows: Section 5.2 presents the related literature
review. Section 5.3 presents the notations of cost-sensitive active fusion, and describes our
proposed framework. Section 5.4 presents the experimental evaluation. Section 5.5 makes
the conclusion.
11
https://github.com/keyu07/CS-AF

85
5.2 Related Work

5.2.1 Multi-classifier Fusion

Fusion approaches have been widely applied in numerous applications, such as skin lesion
analysis [105, 36, 157], human activity recognition [129, 169], active authentication [152], fa-
cial recognition [37, 98, 173], botnet detection [90, 167, 168], domain generalization [19] and
community detection [127, 170]. In terms of whether the weights are dynamically/adaptively
assigned to each classifier, the multi-classifier fusion approaches are divided into two cate-
gories: (i) static fusion, where the weight assigned to each participating classifier will be fixed
after its initial assignment, and (ii) active fusion, where the weights are adaptively tuned
depending on the characteristics of given samples in the testing phase. Many conventional
approaches, such as the bagging [12], boosting [118, 48] and stacking [151], are static fusion
approaches.
To date, a few methods attempting to conduct active fusion were also proposed [114,
33, 51, 22, 42]. For instance, Chen et al. [22] propose to use an attention model to fuse
the weights of different CNNs trained on different scaled input images. Fang et al. [42]
propose to use a U-shape pyramid neural network structure to facilitate the need of training
multi-scale CNNs on multiple partially labeled datasets. Both approaches [22, 42] present
some interesting ideas in adaptively fusing multiple CNNs trained on different datasets.
However, both solutions mainly focus on multi-scale image datasets, rather than imbalanced
or cost-sensitive datasets. META-DES [33] defines five distinct sets of meta-features to
measure the level of competence of a classifier for the classification of input samples, and
proposes to train a meta-classifier to determine the rank or weight of a base classifier while
facing input samples. However, those meta-classifiers were trained on the same dataset
(i.e., the training dataset) as the base classifiers, which would make the meta-classifiers
be less effective or generalized to the “unseen” dataset (i.e., the testing dataset). Also,
META-DES has only been evaluated on several small sample size datasets, which didn’t

86
demonstrate its effectiveness, scalability and generalizability towards more complex datasets
or problems, e.g., skin lesion analysis. DES-MI [51] propose an active fusion approach where
the weights are determined via emphasizing more on the classifiers that are more capable
of classifying examples in the region of underrepresented area among the whole sample
distribution. However, DES-MI only focuses on identifying the most competent classifiers on
the training dataset, which cannot provide enough adaptivity towards the “unseen” testing
dataset. MCE-DW [114] proposes to use the decision credibility that is evaluated by fuzzy
set theory and cloud model, to determine the real-time weight of a base classifier regarding
the current sample in the testing phase. However, both DES-MI and MCE-DW are designed
to work on imbalanced dataset, rather than providing the adaptivity and flexibility to a
cost-sensitive dataset with customized cost matrices, e.g., skin lesion analysis.
In our work, we propose a novel multi-classifier active fusion framework, that leverages
the “reliability” (Section 5.3.3) and the “individuality ” (Section 5.3.4) of the base classifiers
to assign the weights dynamically and adaptively. Also, we propose an approach to enable
the “cost-sensitive” feature of our framework, where the proposed multi-classifier weight
assignment can easily actively adapt to different customized cost matrices.

5.2.2 Fusion of CNNs for Skin Lesion Analysis

Convolutional neural networks (CNNs) have achieved the state-of-the-art performance

[57, 30, 29] in skin lesion analysis since 2016 (i.e., ISIC 2016 Challenge [57]), where nearly all
the teams employed CNNs in either feature extraction or classification procedure. Recently,
several approaches attempting to apply fusion on CNNs to tackle the skin lesion classification
problems are proposed [91, 11, 105]. For instance, Marchetti et al. [91] presents a fusion
of CNNs framework for the classification of melanomas versus nevi or lentigines, where five
fusion approaches were implemented to fuse 25 different CNN classifiers trained on the same
dataset of the same problem to make a single decision. Bi et al. [11] proposes another CNNs
fusion framework to tackle the classification of melanomas versus seborrheic keratosis versus

87
nevi, where three ResNet classifiers were trained for three different classification problems
via fine-tuning pre-traine ImageNet CNNs: the original three-class problem and two binary
classifiers (i.e., melanoma versus both other lesion classes and seborrheic carcinoma versus
both other lesion classes). Perez et al. [105] conducts a comparison study between two
fusion strategies for melanoma classification: selecting the classifiers at random (i.e., among
125 models over 9 CNNs architectures), and selecting the classifiers depending on their
performance on a validation dataset.
To summarize, most of the existing approaches use static fusion approaches for skin
lesion analysis. However, as discussed in Section 5.1, since the skin lesion datasets are
usually limited and statistically biased [57, 30, 29], it is necessary to enable active fusion
in such problem. To the best of our knowledge, our work is the first to design, apply and
evaluate active fusion approaches in the skin lesion classification problems.

5.2.3 Cost-sensitive Machine Learning

A variety of cost-sensitive machine learning approaches have been proposed to tackle

the class imbalance issue in pattern classification and learning problems. Mollineda et al.
[94], a comprehensive study on the class imbalance issue, divides most of the cost-sensitive
machine learning approaches into two categories: the data-level and the algorithmic-level.
The data-level approaches usually manipulate the data distribution via over-sampling the
samples of the minority classes or under-sampling the samples of the majority classes. For
instance, SMOTE [17] is an over-sampling technique proposed to address the over-fitting
problem via synthesizing more of the samples of the minority classes. Several variants of the
SMOTE approach [145, 58, 13, 89] are also proposed to solve this issue.
The algorithmic-level approaches directly re-design the machine learning algorithms to
minimize a customizable loss function, that enables the “cost-sensitive” feature, of the clas-
sifier on certain dataset (e.g., improving the sensitivity of the classifier towards minority
classes). For instance, Importance-weighted risk minimization has been proposed in many

88
machine learning algorithms and implementations, such as LibSVM [16], weighted cross en-
tropy loss functions [102, 107]. However, as pointed out by [14], the weighted cross entropy
is only more effective in the early stage (e.g., a few epochs) of training CNNs, and its im-
pact diminishes quickly as the number of epochs successively increasing. Therefore, it calls
for alternative solutions, compared with directly applying weighted cross entropy, for the
fusion of CNNs for imbalanced or cost-sensitive datasets. Zhang et al. [161] proposes an
extreme learning machine (ELM) based evolutionary cost-sensitive classification approach,
where the cost matrix would be automatically identified given a specific task (i.e., which
error cost more). Iranmehr et al. [64] extends the standard loss function of support vector
machine (SVM) to consider both the class imbalance (i.e., the cost) and the classification
loss. Khan et al. [68] proposes a cost-sensitive deep neural network framework that could
automatically learn the “cost-sensitive” feature representations for both the majority and
minority classes, where during the training phase, the proposed framework would perform
a joint-optimization on the class-dependent costs and the deep neural network parameters.
In this work, we enable the “cost-sensitive” feature in the process of multi-classifier fusion,
and employ it in the skin lesion classification problem.

𝑤𝑤1
Classifier 1 (𝑀𝑀1 )
Cost-sensitive Active Fusion

𝑤𝑤2
Classifier 2 (𝑀𝑀2 )
Prediction of
Sample j
…
…
…

Sample j

𝑤𝑤𝑘𝑘
Classifier k (𝑀𝑀𝑘𝑘 )
1 2 𝑚𝑚
Evaluating/testing sample j 𝑝𝑝𝑘𝑘𝑘𝑘 𝑝𝑝𝑘𝑘𝑘𝑘 ⋯ 𝑝𝑝𝑘𝑘𝑘𝑘
𝑜𝑜𝑘𝑘 + 𝑠𝑠𝑘𝑘
𝑤𝑤𝑘𝑘 =
2

Figure 5.1 The overview of CS-AF framework.

89
5.3 Methodology

5.3.1 Multi-classifier Fusion

In multi-classifier fusion, we define a classification space, as shown in Figure 5.1, where

there are m classes and k classifiers. Let M = {M1 , M2 , ... , Mk } denote the set of base
classifiers and C = {C1 , C2 , ... , Cm } denote the set of classes. Let pkjm denote the posterior
probability of given sample j identified by classifier Mk as belonging to class Cm , where
Pkj = {pkj1 , pkj2 , ... , pkjm } and m l
P
l=1 pkj = 1. Hence, all the posterior probabilities form a k × m

decision matrix as follows:

 
1 2 m
p1j p1j · · · p1j 
 
p 1 p 2 · · · p m 
2j 2j 2j
Pj =  . (5.1)
 
 .. .. . . .. 
 . . .  
 
1 2 m
pkj pkj · · · pkj

Since the importance of different classifiers might be different, we assign a weight wi to the
decision vector (i.e., posterior probabilities vector) of each classifier Ci , where i ∈ {1, 2, ... , k}.
Let Pm (j) denote the sum of the posterior probabilities, that sample j belonging to class m,
of all the classifiers. Then, we have

k
X
Pm (j) = wi · pijm (5.2)
i=1

The final decision (i.e., class) D(j) of sample j is determined by the maximum posterior
probabilities sum:

D(j) = max Pi (j), i ∈ {1, 2, ... , m} (5.3)

Conventional multi-classifier fusion approaches either use the same weight for all the
classifiers (i.e., average fusion) or use static weights that will not be changed after its initial

90
assignment during the training phase. As illustrated in Figure 5.1, our weights (i.e., wk =
Ok +Sk
2
) contains two components: (i) the objective weight Ok that is static and determined
by the prior knowledge obtained through the training phase (Section 5.3.3), and (ii) the
subjective weight Sk that is dynamic and calculated by the posterior knowledge obtained
through the testing phase (Section 5.3.4). For different applications, we can assign different
weights toward Ok and Sk . To be simplified and for demonstration purposes, in this work,
we assign the same weight, i.e., 0.5, on both Ok and Sk , while combining them together.
validation data

Classifier 1 (𝑀𝑀1 ) Reliability 1 (𝑟𝑟1 ) Objective Weight 1 (𝑜𝑜1 )

validation

Classifier 2 (𝑀𝑀2 ) Reliability 2 (𝑟𝑟2 ) Objective Weight 2 (𝑜𝑜2 )

validation

…
…
…

Data

Classifier k (𝑀𝑀𝑘𝑘 ) Reliability k (𝑟𝑟𝑘𝑘 ) Objective Weight k (𝑜𝑜𝑘𝑘 )

validation

Cost Matrix

Classifier Build Reliability Validation Cost-sensitive Adjustment

Figure 5.2 The calculation of the objective weights.

Confusion Matrix Cost Matrix Cost-sensitive Confusion Matrix

Classifier i (𝑀𝑖 )

Objective Weight i (𝑜𝑖 )

Micro-average F1-Score

p
q
𝑝𝑞 𝑝𝑞
𝑟𝑖 × 𝑐𝑝𝑞 = 𝑜𝑖

Figure 5.3 The calculation of the cost-sensitive confusion matrix.

5.3.2 Cost-sensitive Problem Formulation

As discussed in Section 5.1, given a sample, different outputs (i.e., the correct prediction
or all kinds of errors) of a classifier should result in different costs. For instance, misclassifying
a more severe lesion to a benign or less severe lesion should have relatively higher cost. Let

91
cpq denote the cost of classifying an instance belonging to class p into class q. Then, we
obtain a cost matrix as follows:
 
 c11 c12 · · · c1m 
 
 c21 c22 · · · c2m 
CM =  . (5.4)
 
 .. .. .. .. 
 . . . 

 
cm1 cm2 · · · cmm

Let W = {w1 , w2 , ... , wk } be a fusion weight vector, and W be the fusion weight vector
space, where W ∈ W. The goal of cost-sensitive multi-classifier fusion is to find the W ∗ ∈ W,
that can minimize the average cost of the fusion approach’s outcomes over all the testing
samples.

5.3.2.1 Examples of the Design of Cost Matrix

In this section, we would like to show examples of the design of cost matrices. To demon-
strate the “cost-sensitive” feature in our CS-AF framework, here we design two different cost
matrices for the application of skin lesion analysis. There are eight classes, i.e., melanoma
(MEL), squamous cell carcinoma (SCC), basal cell carcinoma (BCC), melanocytic nevus
(NV), actinic keratosis (AK), dermatofibroma (DF), vascular lesion (VASC), benign kerato-
sis (BKL) in our skin lesion classification problem, where MEL, SCC and BCC are cancerous,
and the rest are benign. We would like to demonstrate our work by designing two cost ma-
trices: Cost Matrix A, which emphasizes on the identification of cancerous skin lesions (i.e.,
the cost of misclassifying a cancerous skin lesion is much more than a benign one); and Cost
Matrix B (the opposite of Cost Matrix A), which emphasizes on the identification of benign
skin lesions.
We propose to follow the principles below to design our experimental cost matrices:

92
• All the costs should be positive, since it will be item-wise multiplied with the confusion
matrix. As such, it will not result in negative values in the cost-sensitive confusion
matrix.

• The cost of the correct predictions should depend on the relative severeness of the
corresponding disease. For instance, it should be more valuable (i.e., less cost) to
classify a more severe disease (i.e., melanoma) correctly. To figure out the relative
severeness relationships among all eight skin lesion classes and design our cost matrix
(i.e., Cost Matrix A) in a better way, we referred to the American Academy of Der-
matology Association’s guidance [1]. To be simplified and enable the evaluation of our
work, based on the reference, we heuristically ordered the severeness of the 8 skin lesion
classes (from the most severe one to the least severe one) as follows: melanoma (MEL),
squamous cell carcinoma (SCC), basal cell carcinoma (BCC), melanocytic nevus (NV),
actinic keratosis (AK), dermatofibroma (DF), vascular lesion (VASC), benign keratosis
(BKL). It is worth noting that the absolute cost (i.e., quantitative evaluation) for each
disease is non-trivial to decide, but the relative severeness (i.e., qualitative evaluation)
is able to determine.

• The relative costs of different incorrect predictions should be based on their relative
severeness. For instance, misclassifying melanoma (i.e., a deadly cancerous skin lesion)
as benign keratosis should result in much more cost than the opposite scenario.

• The maximum cost of correct predictions should be no more than the minimum cost
of incorrect predictions.

Figure 5.4 illustrates the Cost Matrix A and Cost Matrix B that we utilized to evaluate
our framework in the experimental evaluation. Let us take the design of Cost Matrix A as
an example. Firstly, we assign the cost of correct prediction of each skin lesion class, i.e., cii ,
i = 1, 2, ... , m (as defined in Section 5.3.2), according to the relative disease severeness, where
predicting a more severe skin lesion class correctly should result in less cost. For instance,

93
MEL 1 60 40 86 200 119 156 25 MEL 8 42 29 57 125 77 99 21
NV 42 4 17 18 25 20 23 21 NV 60 5 19 16 21 18 19 25
BCC 29 19 3 22 34 25 30 18 BCC 40 17 6 19 26 21 23 20
AK 57 16 19 5 21 18 19 25 AK 86 18 22 4 18 16 17 32
True label

True label
BKL 125 21 26 18 8 17 16 42 BKL 200 25 34 21 1 19 18 60
DF 77 18 21 16 19 6 18 29 DF 119 20 25 18 17 3 16 40
VASC 99 19 23 17 18 16 7 35 VASC 156 23 30 19 16 18 2 49
SCC 21 25 20 32 60 40 49 2 SCC 25 21 18 25 42 29 35 7
L
NV

C
AK

L
DF

L
NV

C
AK

L
DF

C
ME

BK
BC

SC
VA

VA
Prediction Prediction
(a) (b)

Figure 5.4 Two examples of cost matrices.

we set the cost of correct prediction of MEL (i.e., the most severe one) as 1, and the cost of
correct prediction of BKL (i.e., the least severe one) as 8. Secondly, to calculate the relative
cost of each incorrect prediction, we follow the equation below:

cjj 2
cij = , i ̸= j (5.5)
cii

where as defined in Section 5.3.2, cij denote the cost of classifying an instance belonging to
class i into class j. For instance, if the cost of correct prediction of MEL is 1 and the cost
of correct prediction of BKL is 8, the cost of misclassifying an instance belonging to MEL
into BKL would be ( 18 )2 = 64. Last but not least, to ensure the cost of correct predictions
are always no more than the cost of incorrect predictions, without loss of generality, we
normalized the costs of misclassifications to integers between 16 and 200, using min-max
scaling. Figure 5.4(a) shows the final result of our designed Cost Matrix A.
To evaluate our framework under different cost matrices, we also designed a Cost Matrix B
(as shown in Figure 5.4(b)), that emphasizes on benign lesions (i.e., the cost of misclassifying
a benign lesion is much more than a cancerous lesion). Cost Matrix B follows the same design

94
steps as Cost Matrix A, other than considering an exactly reverse order of the severeness.
For instance, in the design of Cost Matrix B, melanoma became the “least severe” one while
benign keratosis became the “most severe” one.
Decision Vector (Soft Label)
Subjective
Classifier 1 (𝑀1 ) Individuality 1 (𝑖1 )
Weight 1 (𝑠1 )

Normalization to [0,1]
Subjective
Classifier 2 (𝑀2 ) Individuality 2 (𝑖2 )
Weight 2 (𝑠2 )

…
…
Sample j

Subjective
Classifier k (𝑀𝑘 ) Individuality k (𝑖𝑘 )
Weight k (𝑠𝑘 )
1 2 𝑚
Evaluating/testing sample j 𝑝𝑘𝑗 𝑝𝑘𝑗 ⋯ 𝑝𝑘𝑗

Figure 5.5 The calculation of the subjective weights.

5.3.3 Computing the Objective Weights

The objective weights are designed according to the classifiers’ reliability to recognize
the particular skin lesions, which is determined by the prior knowledge obtained through
the training phase. In the training phase, we separate all the labeled data into two parts:
training dataset and validation dataset. The training dataset will be used to train/build
the base classifiers, while the validation dataset will be used to evaluate the performance
of each base classifier. The reliability of each base classifier depends on its performance
on the validation dataset. Therefore, in order to get an effective and unbiased reliability
of each base classifier, the training dataset and validation dataset cannot have overlapped
data. Specifically, as shown in Figure 5.2, computing the objective weights in our framework
contains three steps:

• Classifier build: We prepare a set of base classifiers, where all the classifiers might
have different CNN architectures, use different size of the training data, or use different
subset or classes distributions of the training data. In this step, we trained 96 base
classifiers, more details are introduced in Section 5.4.2.

95
• Reliability validation: Let ri denote the reliability of a base classifier Mi , that is designed
to describe the average recognition performance of the classifier on the validation data.
Higher accuracy and less error on the validation data usually means higher reliability.
Hence, we use the confusion matrix result of each base classifier on the same validation
dataset as its reliability, where a confusion matrix [141] is a table that is often used
to describe the performance of a classifier on a set of validation data for which the
true values are known. It allows easy identification of confusion between classes, e.g.,
one class is commonly mislabeled as the other. Many performance measures could be
computed from the confusion matrix (e.g., F-scores). As such, we use ripq to denote
the probability of a base classifier Mi classifying an instance belonging to class p into
class q.

• Cost-sensitive adjustment: As described in Section 5.3.2, we would like to enable the

“cost-sensitive” feature in the design of our objective weights. As shown in Figure 5.3,
for each classifier Mi , we use an element-wise multiplication between its reliability ri
(confusion matrix) and the customized cost matrix (Section 5.3.2) to formulate a cost-
sensitive confusion matrix, where all the results/errors in the confusion matrix have
been adjusted based on the cost matrix. Then, we use the micro-average F1-score
[137] of the cost-sensitive confusion matrix to define the objective weight of each base
classifier, and all the object weights are automatically normalized to (0, 1].

5.3.4 Computing the Subjective Weights

The subjective weights are designed according to the relative confidence of the classifiers
while recognizing a specific previously “unseen” image (i.e., individuality), which are calcu-
lated by the posterior knowledge obtained through the testing phase. The individuality of
each base classifier is dynamically computed from its discriminant confidence towards each
previously “unseen” testing data, to capture the posterior knowledge that a base classifier

96
cannot obtain from the training and validation datasets. Specifically, as shown in Figure 5.5,
computing the subjective weights in our framework contains three steps:

• Sample evaluating/testing: Each testing data is evaluated/tested through all the clas-
sifiers to obtain the corresponding decision vectors (i.e., the soft labels).

• Individuality calculation: We consider the individuality of a classifier as its relative

class discriminative power regarding a given testing data. A classifier can easily iden-
tify the class of a given testing data in the testing phase, if its posterior probabilities of
the corresponding decision vector is highly concentrated in one class, and the misclas-
sification rate would also be low. On the contrary, if the distribution of the posterior
probabilities is close to uniform, the classifier shows its difficulty in discriminating the
class of the given testing data. Also, different classifiers would present different dis-
tribution of the posterior probabilities in the decision vectors while testing the same
testing data. Hence, we define the the individuality ik of a classifier Mk using the
posterior probabilities distribution as follows:

m
1 X ∗ l
m · pkj∗ − 1
ik = (p − pkj ) = (5.6)
m − 1 l=1 kj m−1

where pkj∗ is the largest posterior probability value in Pkj . Based on equation (5.6),
the individuality of each base classifier on a given testing data depends on its output
probability of the most probable class. For each testing data, the base classifier that
has the highest output probability of the most probable class achieves the highest
individuality.

• Normalization: Since the subjective weights are relative values among all the base
classifiers, we normalize each individuality ik to the subjective weight Sk ∈ [0, 1] as
follows:
ik − i min
Sk = (5.7)
i max − i min

97
where i min = min ij (i.e., the minimum individuality among all base classifiers), and
j∈1,2,...,k

i max = max ij (i.e., the maximum individuality among all base classifiers).
j∈1,2,...,k

5.4 Experimental Evaluation

We conducted our experiments on the ISIC Challenge 2019 dataset [31, 134, 30] and uti-
lized 12 CNN architectures to evaluate the performance of our proposed CS-AF framework.
Two examples of cost matrices, that emphasize on different skin lesion classes (i.e., cancer-
ous lesion classes vs. benign lesion classes), have been designed to evaluate the effectiveness
of the “cost-sensitive” feature of our proposed CS-AF framework. Furthermore, extensive
comparisons have been made among two static fusion approaches (i.e., Max Voting Fusion
and Average Fusion), two state-of-the-art active fusion approaches (i.e., DES-MI [51] and
MCE-DW [114]), AF (i.e., active fusion without the “cost-sensitive” feature) and our CS-AF
framework. The presented results show that our approach consistently outperforms both the
static and the active fusion approaches in terms of the overall accuracy and the total cost, is
more adaptive to the customized cost matrices than the other two active fusion competitors,
and consistently better than AF in terms of the total cost under different conditions.

5.4.1 Experiment Dataset

In our experiments, we utilized the well-known ISIC Challenge 2019 dataset [31, 134, 30].
Since the ground truth of the original testing data was not available, we only employed
the original training data without meta-data in our experimental evaluation. This dataset
(i.e., the original training data of the ISIC Challenge 2019) contains 25,331 images in total,
coming from 3 source datasets: BCN 20000 [31], HAM10000 [134] and MSK [30]. It depicts
8 skin lesion diseases (i.e., 8 classes): melanoma (MEL, 4,522 images), melanocytic nevus
(NV, 12,875 images), basal cell carcinoma (BCC, 3,323 images), actinic keratosis (AK, 867
images), benign keratosis (BKL, 2624 images), dermatofibroma (DF, 239 images), vascular

98
Table 5.1 The statistics of different split training datasets.

Skin Lesion Dist-1 Dist-2 Dist-3 Dist-4

MEL 3,662 (18.1%) 2,509 (12.4%) 5,052 (22.5%) 604 (2.7%)
SCC 502 (2.5%) 2,510 (12.4%) 4,331 (19.3%) 1,200 (5.4%)
BCC 2,670 (13.2%) 2,494 (12.4%) 3,781 (16.9%) 1,812 (8.1%)
NV 10,235 (50.5%) 2,512 (12.4%) 3,032 (13.5%) 2,529 (11.3%)
AK 705 (3.5%) 2,564 (12.5%) 2,463 (11.0%) 3,150 (14.1%)
DF 188 (1%) 2,444 (12.3%) 1,871 (8.3%) 3,702 (16.9%)
VASC 194 (1%) 2,522 (12.5%) 1,262 (5.6%) 4,334 (19.3%)
BKL 2,099 (10.4%) 2,612 (13.0%) 626 (2.8%) 5,056 (22.6%)

lesion (VASC, 253 images) and squamous cell carcinoma (SCC, 628 images). We split the
entire 25,331 images into training (80%), validation (5%) and testing (15%) datasets.
To evaluate the performance of our proposed CS-AF framework using the base classifiers
that are trained from the datasets with different classes distributions, we designed 4 training
datasets that have different classes distributions. For instance, one training dataset could
have balanced classes distribution, and the other training datasets could have unbalanced
classes distributions in different ways. The details (i.e., classes distributions) of each training
dataset are shown in Table 5.1 and described as below:

• Dist-1: This training dataset follows the classes distribution of the original training
dataset of the ISIC Challenge 2019 dataset.

• Dist-2: This training dataset contains evenly distributed number of samples for all
classes.

• Dist-3: This training dataset contains more samples for cancerous lesion classes, and
less samples for benign lesion classes.

99
• Dist-4: This training dataset contains less samples for cancerous lesion classes, and
more samples for benign lesion classes (i.e., the opposite order of classes distributions
as in Dist-3).

To generate different training datasets satisfying different classes distributions described

above, we utilized data augmentation techniques to generate more images for the skin lesion
classes lacking of images, and randomly sampled smaller portions from the classes with
superfluous images. The main data augmentation techniques utilized are rotation (for 45,
90, 135, 180, 225, 270 and 315 degrees, respectively), horizontal flip or the combination of
both. We also utilized the same strategy to generate the validation and testing datasets, to
ensure the numbers of samples of all classes are equal, where there are approximately 200
samples of each class in the validation dataset, and approximately 500 samples of each class
in the testing dataset.
In addition, to evaluate the performance of our proposed CS-AF framework using the
base classifiers that are trained from different subsets of the training dataset, for each of
those four training datasets that have different classes distributions, we randomly select 70%
of its data to produce another sub-dataset, namely, Sub-70. Therefore, in our experimental
evaluation, there are 8 different split training datasets in total (i.e., Dist-1, Dist-1 Sub-70,
Dist-2, Dist-2 Sub-70, Dist-3, Dist-3 Sub-70, Dist-4 and Dist-4 Sub-70).

5.4.2 Base Classifiers Preparation

We chose 12 different CNN architectures to evaluate the fusion approaches performance.

These 12 CNN architectures were popular and have been shown to have good transfer learning
performance on the skin lesion analysis [105]. All the 12 CNN architectures were trained on
the 8 different split training datasets as we mentioned in the previous section. Therefore,
we obtained a pool of 12 × 8 = 96 base classifiers, the corresponding accuracy of each base
classifier is shown in Table 5.2. Notably, different CNN architecture requires different size of
input images as below.

100
Table 5.2 The performance of base classifiers of the 12 CNN architectures.
CNN Architectures Dist-1 Dist-1 Sub-70 Dist-2 Dist-2 Sub-70 Dist-3 Dist-3 Sub-70 Dist-4 Dist-4 Sub-70
PNASNet-5-Large [87] 78.48 76.53 81.14 80.73 77.01 75.21 78.76 75.34
NASNet-A-Large [174] 78.35 76.71 79.80 78.31 76.00 75.36 76.12 74.32
ResNeXt101-32×16d [154] 79.47 76.96 83.09 80.08 80.18 77.14 79.47 77.47
SENet154 [63] 80.31 77.72 81.19 76.33 79.04 74.04 78.76 76.43
Dual Path Net-107 [24] 76.61 74.51 79.10 77.92 76.07 70.88 77.24 74.80
Xception [26] 74.63 74.07 78.53 75.19 76.46 72.93 75.82 74.30
Inception-V4 [125] 76.76 74.22 80.11 77.45 77.09 75.37 75.89 74.10
InceptionResNet-V2 [125] 77.58 76.64 70.81 77.39 77.77 76.12 76.48 74.01
SE-ResneXt101-32×4d [63] 77.45 77.21 79.87 78.41 75.38 74.68 75.60 74.33
ResNet152 [59] 75.69 73.23 79.27 75.01 76.00 74.96 75.77 74.45
Inception-V3 [126] 75.16 73.82 79.52 78.83 73.69 72.41 75.62 72.07
EfficientNet-b7 [128] 67.31 63.07 74.10 71.28 71.81 68.75 71.48 67.37

• 331×331: PNASNet-5-Large, NASNet-A-Large.

• 320×320: ResNeXt101-32×16d.

• 299×299: Xception, Inception-V4, Inception-V3, InceptionResNet-V2.

• 224×224: SENet154, Dual Path Net-107, SE-ResneXt101-32×4d, EfficientNet-b7, Res-

Net152.

It is also worth noting that in our experiments, we treat CNNs with different size of input
images equally during the weight assignment, since our objective weight assignment depends
on the overall performance of each base classifier evaluated on the given validation dataset,
which already considered the specification differences of different CNNs.
All the base classifiers were fine-tuned by stochastic gradient descent (SGD) with learning
rate 10−3 and momentum 0.9. The learning rate degraded in 20 epochs by 0.1. We stopped
the training process either after 40 epochs or while the validation accuracy was failed to
improve for 7 consecutive epochs. Our experiments were implemented using Pytorch, running
on a server with 4 GTX 1080Ti 11 GB GPUs. To keep the same batch size 32 in each
evaluation, and due to the memory constraint of single GPU, certain CNN architectures
were trained with more GPUs:

• 2 GPUs: SENet154, EfficientNet-b7, Dual Path Net-107.

101
92 94

Total Cost (Evaluate with Cost Matrix A)

Total Cost (Evaluate with Cost Matrix B)

Max Voting Max Voting
Average Average
0.85 MCE-DW 90 MCE-DW
88 DES-MI DES-MI
0.84 AF 86 AF
Accuracy

Max Voting CS-AF (Cost Matrix A) CS-AF (Cost Matrix A)

84 CS-AF (Cost Matrix B) CS-AF (Cost Matrix B)
Average
0.83 MCE-DW 82
DES-MI 80
0.82 AF 78
CS-AF (Cost Matrix A)
CS-AF (Cost Matrix B)
76 74
8 16 24 32 40 48 56 64 72 80 88 96 8 16 24 32 40 48 56 64 72 80 88 96 8 16 24 32 40 48 56 64 72 80 88 96
Number of Models Number of Models Number of Models
(a) (b) (c)

Figure 5.6 Evaluate the effectiveness of CS-AF.

• 4 GPUs: PNASNet-5-Large, ResNeXt101-32×16d, NASNet-A-Large.

5.4.3 Experimental Procedure

As described in Section 5.4.2, we have prepared 96 base classifiers. To evaluate the

effectiveness of our active fusion approach extensively, each time we perform the fusion
approaches on a randomly selected subset (i.e., N classifiers) of those 96 base classifiers, where
N = 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96. For each N, we repeat the random selection
experiments for 100 times, and use the averaged performance as the final results.

5.4.4 Evaluate the Effectiveness of CS-AF

To demonstrate the effectiveness of our approach, a comparison between two CS-AF

implementations (using two different cost matrices to compute the objective weights) and
five other fusion approaches has been conducted. The competitors are:

• Max Voting Fusion is a static approach, where the predictions are combined from
multiple base classifiers and only the predicted class with the highest votes will be
included in the final prediction.

102
• Average Fusion is another static approach, where it averages the decision vectors of
multiple base classifiers and uses the averaged decision vector to make the final pre-
diction.

• DES-MI [51] filters the base classifiers by assigning weight to each of them, where
the weight is based on the performance of k-nearest neighbors in the validation set
regarding the current test sample.

• MCE-DW [114] determines the classifier reliability by fuzzy set theory, and combines
decision credibility of each test sample to make the final decision.

• AF is a baseline active fusion approach by removing the cost-sensitive adjustment step

from CS-AF while calculating the objective weights.

• CS-AF (Cost Matrix A) is our approach while computing its objective weights using
Cost Matrix A (Section 5.3.2.1).

• CS-AF (Cost Matrix B) is our approach while computing its objective weights using
Cost Matrix B (Section 5.3.2.1).

Given a competitor fusion approach, we evaluate its effectiveness in terms of (i) its
averaged accuracy on our testing dataset (as shown in Figure 5.6(a)), (ii) its total cost on
our testing dataset specified by Cost Matrix A (as shown in Figure 5.6(b)), and Cost Matrix
B (as shown in Figure 5.6(c)). The total costs are calculated by the sum of the item-wise
product between the confusion matrix resulted from our testing dataset and the specified cost
matrix. Better fusion approach usually leads to higher accuracy on the testing dataset and
lower total cost specified by certain cost matrix. From the results illustrated in Figure 5.6,
we obtain the observations below:

• Compared with the best performed base classifier, ResNeXt101-32×16d, as shown in

Table 5.2, our two implementations of CS-AF and AF consistently achieve over 2%-5%
higher accuracy on the same testing dataset.

103
Max Voting Average MCE-DW DES-MI AF CS-AF (Cost Matrix A) CS-AF (Cost Matrix B)
0.82 Melanoma Squamous cell carcinoma Basal cell carcinoma Melanocytic nevus
0.751 0.903

0.968 0.898
0.80 0.736 0.893
Sensitivity

0.888
0.963
0.78 0.721 0.883
0.878
0.76 8 16 24 32 40 48 56 64 72 80 88 96 0.706 8 16 24 32 40 48 56 64 72 80 88 96 0.958 8 16 24 32 40 48 56 64 72 80 88 96 0.873 8 16 24 32 40 48 56 64 72 80 88 96

0.75 Actinic keratosis Dermatofibroma Vascular lesion Benign keratosis

0.909
0.83
0.972
0.894 0.82
0.74
Sensitivity

0.964 0.81
0.879
0.73 0.80
0.864 0.956
0.79

0.72 8 16 24 32 40 48 56 64 72 80 88 96 0.849 8 16 24 32 40 48 56 64 72 80 88 96 0.948 8 16 24 32 40 48 56 64 72 80 88 96 0.78 8 16 24 32 40 48 56 64 72 80 88 96

Number of Models Number of Models Number of Models Number of Models

Figure 5.7 The sensitivity results of each single class of CS-AF with Cost Matrix A and
Cost Matrix B.
Max Voting Average MCE-DW DES-MI AF CS-AF (Cost Matrix A) CS-AF (Cost Matrix B)
9.78 1e 1 Melanoma 1e 1Squamous cell carcinoma 1e 1 Basal cell carcinoma 9.62 1e 1 Melanocytic nevus
9.50
9.915 9.48 9.60
9.74
9.58
Specificity

9.46
9.905
9.70
9.44 9.56
9.66 9.895
9.42 9.54

9.62 8 16 24 32 40 48 56 64 72 80 88 96 9.885 8 16 24 32 40 48 56 64 72 80 88 96 9.40 8 16 24 32 40 48 56 64 72 80 88 96 9.52 8 16 24 32 40 48 56 64 72 80 88 96

9.87 1e 1 Actinic keratosis 1e 1 Dermatofibroma 1e 1 Vascular lesion 9.80 1e 1 Benign keratosis

9.999 9.994
9.85
9.78
9.83 9.994
Specificity

9.992 9.76
9.81
9.989
9.74
9.79

9.77 8 16 24 32 40 48 56 64 72 80 88 96 9.984 8 16 24 32 40 48 56 64 72 80 88 96 9.990 8 16 24 32 40 48 56 64 72 80 88 96 9.72 8 16 24 32 40 48 56 64 72 80 88 96

Number of Models Number of Models Number of Models Number of Models

Figure 5.8 The specificity results of each single class of CS-AF with Cost Matrix A and
Cost Matrix B.

104
• For all the fusion approaches, as more base classifiers involved, the accuracy tends to
increase and the total cost tends to decrease.

• As illustrated in Figure 5.6, in terms of the accuracy, our two implementations of CS-
AF and AF consistently outperform the static fusion approaches (i.e., Max Voting and
Average). Compared with the active fusion competitors (i.e., MCE-DW and DES-MI),
our CS-AF (Cost Matrix B) consistently obtains the highest accuracy.

• As illustrated in Figure 5.6 and Figure 5.6, in terms of the total cost, CS-AF consis-
tently outperforms the other fusion competitors (i.e., Max Voting, Average MCE-DW,
DES-MI and AF). For instance, while calculating the total cost using Cost Matrix A,
CS-AF (Cost Matrix A) always achieves the lowest total cost, and while calculating
the total cost using Cost Matrix B, CS-AF (Cost Matrix B) always obtains the lowest
total cost. Thus, it demonstrates that our proposed cost-sensitive active fusion ap-
proach could adapt to different customized cost matrices and is optimized to achieve
the lowest total cost.

• DES-MI is more sensitive to the number of base classifiers. For instance, it always has
the worst performance with few classifiers involved, and finally has close performance
with our CS-AF as shown in Figure 5.6. This is because in the implementation of
DES-MI, it only remains the most confident (i.e., top 40%) base classifiers among all
given base classifiers. So, there are only very few base classifiers to be considered to
make the final decision while the number of given base classifiers is low.

5.4.5 Analyze the “Cost-sensitive” of CS-AF

As discussed above, our proposed CS-AF could adapt to different cost matrices and
is optimized to achieve the lowest total cost under a specified cost matrix, namely “cost-
sensitive”. In this section, we would like to analyze how such “cost-sensitive”, while using
certain customized cost matrices, influences the performance of CS-AF on certain single skin

105
lesion classes, thus reducing the total cost. We evaluate single class performances using
sensitivity and specificity, defined as below:

TP
sensitivity = (5.8)
TP + FN

where TP denotes the number of true positives and FN denotes the number of false negatives.

TN
specificity = (5.9)
TN + FP

where TN denotes the number of true negatives and FP denotes the number of false positives.
Figure 5.7 and Figure 5.8 illustrate the sensitivity and specificity results of each single
class of CS-AF (Cost Matrix A), CS-AF (Cost Matrix B) and all the other competitors,
respectively. We can observe that:

• Compared with CS-AF (Cost Matrix B), CS-AF (Cost Matrix A) tends to achieve
higher sensitivity on more severe cancerous skin lesion classes (i.e., melanoma, squa-
mous cell carcinoma and basal cell carcinoma), and lower sensitivity on less severe
benign skin lesion classes (i.e., benign keratosis, vascular lesion, dermatofibroma, ac-
tinic keratosis and melanocytic nevus).

• Compared with CS-AF (Cost Matrix A), CS-AF (Cost Matrix B) tends to achieve
higher specificity on those more severe cancerous skin lesion classes, and lower speci-
ficity on those less severe benign skin lesion classes.

• Compared with the other competitors, our approaches, CS-AF (Cost Matrix A) obtains
the highest sensitivity on the most severe cancerous skin lesion (melanoma), and CS-
AF (Cost Matrix B) consistently obtains the highest sensitivity on the most benign
skin lesion class (benign keratosis).

106
• Compared with the other competitors, our approaches, CS-AF (Cost Matrix B) con-
sistently achieves the highest specificity on melanoma, and CS-AF (Cost Matrix A)
consistently achieves the highest specificity on benign keratosis.

As described in Section 5.3.2.1, Cost Matrix A emphasizes on the cancerous skin lesions
(i.e., the cost of misclassifying a cancerous skin lesion is much more than a benign skin
lesion), while Cost Matrix B emphasizes on the benign lesions (i.e., the cost of misclassifying
a benign skin lesion is much more than a cancerous skin lesion). While using Cost Matrix A to
compute the objective weights of our CS-AF implementation (i.e., CS-AF (Cost Matrix A)),
it tends to increase the TP and FP of cancerous skin lesion classes and decrease the FN and
TN of benign skin lesion classes, thus resulting in higher sensitivity and lower specificity for
cancerous skin lesion classes. CS-AF (Cost Matrix B) also works in such way accordingly.
Therefore, the Figure 5.7 and Figure 5.8 demonstrate that our proposed CS-AF is “cost-
sensitive”, where its performance on certain single skin lesion classes could be adapted to
certain customized cost matrices.

5.5 Conclusion

In this paper, we propose CS-AF, a cost-sensitive multi-classifier active fusion framework

for skin lesion classification, where we define two types of weights: the objective weights that
are designed according to the classifiers’ reliability to recognize the particular skin lesions, and
the subjective weights that are designed according to the relative confidence of the classifiers
while recognizing a specific previously “unseen” image (i.e., individuality). We also enable
the “cost-sensitive” feature in our framework, via incorporating a customizable cost matrix in
the design of the objective weights. In the experimental evaluation, we trained 96 classifiers
of 12 CNN architectures as the base classifiers, and compared our CS-AF framework with
two static fusion approaches (i.e., Max Voting Fusion and Average Fusion), two active fusion
competitors (i.e., DES-MI [51] and MCE-DW [114]), and a baseline active fusion approach,
AF. Our experimental results show that our CS-AF framework consistently outperforms the

107
static and active fusion competitors in terms of accuracy, and always achieves the lowest
total cost. We also demonstrated our “cost-sensitive” feature by using two examples of
cost matrices. In the future work, we plan to (i) investigate and incorporate other metrics
(i.e., other than F1-score) in the design of the objective weights; (ii) design a learning-based
approach to determine the subjective weights; and (iii) employ and evaluate our CS-AF
framework in other medicine-related applications.

108
Chapter 6: General Conclusion

In conclusion, the work presented across the chapters of this report underscores the imper-
ative need to enhance models’ generalization capabilities to address domain shifts and class
imbalances, which are prevalent challenges in machine learning and deep learning applica-
tions. Our innovative approaches, DADG, Epi-Curriculum, SuperCon, and CS-AF demon-
strate robust methodologies that not only address the fundamentals of deep learning but
also significantly improve performance across varied and challenging datasets, showing the
effectiveness in practical scenarios.
The DADG approach leverages discriminative adversarial training to learn domain-invar-
iant features, while Epi-Curriculum employs episodic training in low-resource domain adapta-
tion scenarios to enhance the model’s robustness against domain shifts. Both methodologies
validate their effectiveness through superior performance on benchmark datasets, indicating
their potential in real-world applications. SuperCon, addressing the critical issue of class
imbalance, utilizes supervised contrastive learning to ensure that features within classes are
closely aligned while maintaining clear distinctions between classes, resulting in enhanced
model accuracy and robustness. Lastly, our CS-AF framework introduces a cost-sensitive
multi-classifier fusion strategy that optimizes model contributions dynamically, showcasing
superior performance in skin lesion classification.
Together, these methods form a comprehensive suite of solutions that address funda-
mental challenges in machine learning, from domain generalization and adaptation to class
imbalance and robustness against adversarial scenarios. By focusing on these critical di-
mensions of generalization, our work not only contributes to theoretical advancements but

109
also enhances the practical deployment of machine learning models, ensuring they perform
reliably and effectively in diverse real-world settings.

110
References

[1] Types of skin cancer. https://www.aad.org/public/diseases/skin-cancer/types/

common, 2022. [Online; Accessed 02-04-2022].

[2] Mohammed A Al-Masni, Mugahed A Al-Antari, Mun-Taek Choi, Seung-Moo Han,

and Tae-Seong Kim. Skin lesion segmentation in dermoscopy images via deep full
resolution convolutional networks. Computer methods and programs in biomedicine,
162:221–231, 2018.

[3] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau,
Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient
descent by gradient descent. In Advances in neural information processing systems,
pages 3981–3989, 2016.

[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation
by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[5] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards
domain generalization using meta-regularization. In Advances in Neural Information
Processing Systems, pages 998–1008, 2018.

[6] Ankur Bapna and Orhan Firat. Simple, scalable adaptation for neural machine transla-
tion. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1538–1548, Hong Kong, China, November 2019. Association
for Computational Linguistics.

111
[7] Catarina Barata, Jorge S Marques, and M Emre Celebi. Deep attention model for the
hierarchical diagnosis of skin lesions. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.

[8] Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel,
Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi,
Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. Find-
ings of the 2019 conference on machine translation (WMT19). In Ondřej Bojar, Ra-
jen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow,
Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz,
Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, and Karin
Verspoor, editors, Proceedings of the Fourth Conference on Machine Translation (Vol-
ume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy, August 2019. Asso-
ciation for Computational Linguistics.

[9] Rachel Bawden, Kevin Bretonnel Cohen, Cristian Grozea, Antonio Jimeno Yepes,
Madeleine Kittner, Martin Krallinger, Nancy Mah, Aurelie Neveol, Mariana Neves,
Felipe Soares, Amy Siu, Karin Verspoor, and Maika Vicente Navarro. Findings of the
WMT 2019 biomedical translation shared task: Evaluation for MEDLINE abstracts
and biomedical terminologies. In Proceedings of the Fourth Conference on Machine
Translation (Volume 3: Shared Task Papers, Day 2), pages 29–53, Florence, Italy,
August 2019. Association for Computational Linguistics.

[10] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curricu-
lum learning. In Proceedings of the 26th annual international conference on machine
learning, pages 41–48, 2009.

[11] Lei Bi, Jinman Kim, Euijoon Ahn, and Dagan Feng. Automatic skin lesion analy-
sis using large-scale dermoscopy images and deep residual networks. arXiv preprint
arXiv:1703.04197, 2017.

112
[12] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[13] Chumphol Bunkhumpornpat, Krung Sinapiromsaran, and Chidchanok Lursinsap.

Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling
the class imbalanced problem. In Pacific-Asia conference on knowledge discovery and
data mining, pages 475–482. Springer, 2009.

[14] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep
learning? In International Conference on Machine Learning, pages 872–881. PMLR,
2019.

[15] Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana
Tommasi. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 2229–2238, 2019.

[16] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines.
ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.

[17] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.
Smote: synthetic minority over-sampling technique. Journal of artificial intelligence
research, 16:321–357, 2002.

[18] Boxing Chen, Colin Cherry, George Foster, and Samuel Larkin. Cost weighting for
neural machine translation domain adaptation. In Proceedings of the First Workshop
on Neural Machine Translation, pages 40–46, Vancouver, August 2017. Association for
Computational Linguistics.

[19] Keyu Chen, Di Zhuang, and J Morris Chang. Discriminative adversarial domain gener-
alization with meta-learning based cross-domain validation. Neurocomputing, 467:418–
426, 2022.

113
[20] Keyu Chen, Di Zhuang, and J Morris Chang. Supercon: Supervised contrastive learn-
ing for imbalanced skin lesion classification. arXiv preprint arXiv:2202.05685, 2022.

[21] Keyu Chen, Di Zhuang, Mingchen Li, and J Morris Chang. Epi-curriculum: Episodic
curriculum learning for low-resource domain adaptation in neural machine translation.
IEEE Transactions on Artificial Intelligence, 2024.

[22] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to
scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 3640–3649, 2016.

[23] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple
framework for contrastive learning of visual representations. In International confer-
ence on machine learning, pages 1597–1607. PMLR, 2020.

[24] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng.
Dual path networks. In Advances in neural information processing systems, pages
4467–4475, 2017.

[25] Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan S Willsky. Exploiting hier-
archical context on a large database of object categories. In 2010 IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition, pages 129–136. IEEE,
2010.

[26] François Chollet. Xception: Deep learning with depthwise separable convolutions. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
1251–1258, 2017.

[27] Chenhui Chu and Rui Wang. A survey of domain adaptation for neural machine
translation. In Proceedings of the 27th International Conference on Computational
Linguistics, pages 1304–1319, Santa Fe, New Mexico, USA, August 2018. Association
for Computational Linguistics.

114
[28] Peng Chu, Xiao Bian, Shaopeng Liu, and Haibin Ling. Feature space augmentation for
long-tailed data. In Computer Vision–ECCV 2020: 16th European Conference, Glas-
gow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pages 694–710. Springer,
2020.

[29] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza,
David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti,
et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by
the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368,
2019.

[30] Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti,
Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler,
et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 in-
ternational symposium on biomedical imaging (isbi), hosted by the international skin
imaging collaboration (isic). In 2018 IEEE 15th International Symposium on Biomed-
ical Imaging (ISBI 2018), pages 168–172. IEEE, 2018.

[31] Marc Combalia, Noel CF Codella, Veronica Rotemberg, Brian Helba, Veronica Vila-
plana, Ofer Reiter, Allan C Halpern, Susana Puig, and Josep Malvehy. Bcn20000:
Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288, 2019.

[32] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube rec-
ommendations. In Proceedings of the 10th ACM conference on recommender systems,
pages 191–198, 2016.

[33] Rafael MO Cruz, Robert Sabourin, George DC Cavalcanti, and Tsang Ing Ren. Meta-
des: A dynamic ensemble selection framework using meta-learning. Pattern recognition,
48(5):1925–1935, 2015.

115
[34] Praveen Dakwale and Christof Monz. Fine-tuning for neural machine translation with
limited degradation across in- and out-of-domain data. In Sadao Kurohashi and Pascale
Fung, editors, Proceedings of Machine Translation Summit XVI: Research Track, pages
156–169, Nagoya Japan, September 18 – September 22 2017.

[35] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In 2009 IEEE conference on computer vision
and pattern recognition, pages 248–255. Ieee, 2009.

[36] Nam Nguyen Di Zhuang, Keyu Chen, and J Morris Chang. Saia: Split artificial intel-
ligence architecture for mobile healthcare systems. arXiv preprint arXiv:2004.12059,
2020.

[37] Changxing Ding and Dacheng Tao. Trunk-branch ensemble convolutional neural net-
works for video-based face recognition. IEEE transactions on pattern analysis and
machine intelligence, 40(4):1002–1014, 2017.

[38] Zi-Yi Dou, Junjie Hu, Antonios Anastasopoulos, and Graham Neubig. Unsupervised
domain adaptation for neural machine translation with domain-aware feature embed-
dings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 1417–1422, Hong Kong, China, November 2019.
Association for Computational Linguistics.

[39] Rahul Duggal, Scott Freitas, Sunny Dhamnani, Duen Horng Chau, and Jimeng
Sun. Elf: An early-exiting framework for long-tailed classification. arXiv preprint
arXiv:2006.11979, 2020.

[40] Antonio D’Innocente and Barbara Caputo. Domain generalization with domain-specific
aggregation modules. In German Conference on Pattern Recognition, pages 187–198.
Springer, 2018.

116
[41] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
Zisserman. The pascal visual object classes (voc) challenge. International journal of
computer vision, 88(2):303–338, 2010.

[42] Xi Fang and Pingkun Yan. Multi-organ segmentation over partially labeled datasets
with multi-scale feature abstraction. IEEE Transactions on Medical Imaging,
39(11):3619–3629, 2020.

[43] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few
training examples: An incremental bayesian approach tested on 101 object categories.
In 2004 conference on computer vision and pattern recognition workshop, pages 178–
178. IEEE, 2004.

[44] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In International conference on machine learning, pages
1126–1135. PMLR, 2017.

[45] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings of the 34th International Conference on
Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.

[46] George Foster, Cyril Goutte, and Roland Kuhn. Discriminative instance weighting
for domain adaptation in statistical machine translation. In Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing, pages 451–459,
Cambridge, MA, October 2010. Association for Computational Linguistics.

[47] Alexander Fraser. Findings of the WMT 2020 shared tasks in unsupervised MT and
very low resource supervised MT. In Proceedings of the Fifth Conference on Machine
Translation, pages 765–771, Online, November 2020. Association for Computational
Linguistics.

117
[48] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line
learning and an application to boosting. In European conference on computational
learning theory, pages 23–37. Springer, 1995.

[49] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backprop-
agation. In International conference on machine learning, pages 1180–1189. PMLR,
2015.

[50] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,
François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial
training of neural networks. In Domain Adaptation in Computer Vision Applications,
pages 189–209. Springer, 2017.

[51] Salvador Garcı́a, Zhong-Liang Zhang, Abdulrahman Altalhi, Saleh Alshomrani, and
Francisco Herrera. Dynamic ensemble selection for multi-class imbalanced datasets.
Information Sciences, 445:22–37, 2018.

[52] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain
generalization for object recognition with multi-task autoencoders. In Proceedings of
the IEEE international conference on computer vision, pages 2551–2559, 2015.

[53] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural information processing systems, pages 2672–2680, 2014.

[54] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harness-
ing adversarial examples. In International Conference on Learning Representations
(ICLR), 2015.

118
[55] David Grangier and Dan Iter. The trade-offs of domain adaptation for neural language
models. In Proceedings of the 60th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages 3802–3813, Dublin, Ireland, May
2022. Association for Computational Linguistics.

[56] Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. Meta-
learning for low-resource neural machine translation. In Ellen Riloff, David Chiang,
Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, pages 3622–3631, Brussels,
Belgium, October-November 2018. Association for Computational Linguistics.

[57] David Gutman, Noel CF Codella, Emre Celebi, Brian Helba, Michael Marchetti, Nabin
Mishra, and Allan Halpern. Skin lesion analysis toward melanoma detection: A chal-
lenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the
international skin imaging collaboration (isic). arXiv preprint arXiv:1605.01397, 2016.

[58] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-
sampling method in imbalanced data sets learning. In International conference on
intelligent computing, pages 878–887. Springer, 2005.

[59] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.

[60] Lingxiao He, Xingyu Liao, Wu Liu, Xinchen Liu, Peng Cheng, and Tao Mei. Fastreid:
A pytorch toolbox for general instance re-identification. In Proceedings of the 31st
ACM International Conference on Multimedia, pages 9664–9667, 2023.

[61] Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In
International Conference on Machine Learning, pages 4182–4192. PMLR, 2020.

119
[62] Paulina Hensman and David Masko. The impact of imbalanced training data for con-
volutional neural networks. Degree Project in Computer Science, KTH Royal Institute
of Technology, 2015.

[63] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.

[64] Arya Iranmehr, Hamed Masnadi-Shirazi, and Nuno Vasconcelos. Cost-sensitive sup-
port vector machines. Neurocomputing, 343:50–64, 2019.

[65] Joanna Jaworek-Korjakowska, Pawel Kleczek, and Marek Gorgon. Melanoma thickness
prediction based on convolutional neural network with vgg-19 model transfer learning.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition Workshops, pages 0–0, 2019.

[66] Shafiq Joty, Hassan Sajjad, Nadir Durrani, Kamla Al-Mannai, Ahmed Abdelali, and
Stephan Vogel. How to avoid unwanted pregnancies: Domain adaptation using neu-
ral network models. In Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, pages 1259–1270, Lisbon, Portugal, September 2015.
Association for Computational Linguistics.

[67] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and
Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap
and sharp minima. In International Conference on Learning Representations, 2017.

[68] Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and
Roberto Togneri. Cost-sensitive learning of deep feature representations from imbal-
anced data. IEEE transactions on neural networks and learning systems, 29(8):3573–
3587, 2017.

120
[69] Huda Khayrallah and Philipp Koehn. On the impact of various types of noise on
neural machine translation. In Alexandra Birch, Andrew Finch, Thang Luong, Graham
Neubig, and Yusuke Oda, editors, Proceedings of the 2nd Workshop on Neural Machine
Translation and Generation, pages 74–83, Melbourne, Australia, July 2018. Association
for Computational Linguistics.

[70] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip
Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.
arXiv preprint arXiv:2004.11362, 2020.

[71] Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation.
In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39,
Vancouver, August 2017. Association for Computational Linguistics.

[72] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.

[73] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.

[74] Wen Lai, Jindřich Libovický, and Alexander Fraser. Improving both domain robust-
ness and domain adaptability in machine translation. In Nicoletta Calzolari, Chu-Ren
Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu,
Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen
Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus,
Francis Bond, and Seung-Hoon Na, editors, Proceedings of the 29th International Con-
ference on Computational Linguistics, pages 5191–5204, Gyeongju, Republic of Korea,
October 2022. International Committee on Computational Linguistics.

121
[75] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.

[76] Hansang Lee, Minseok Park, and Junmo Kim. Plankton classification on imbalanced
large scale database via convolutional neural networks with transfer learning. In 2016
IEEE international conference on image processing (ICIP), pages 3713–3717. IEEE,
2016.

[77] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mo-
hamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising
sequence-to-sequence pre-training for natural language generation, translation, and
comprehension. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Com-
putational Linguistics.

[78] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and
artier domain generalization. In Proceedings of the IEEE International Conference on
Computer Vision, pages 5542–5550, 2017.

[79] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to general-
ize: Meta-learning for domain generalization. In Thirty-Second AAAI Conference on
Artificial Intelligence, 2018.

[80] Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, and Timothy M
Hospedales. Episodic training for domain generalization. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 1446–1455, 2019.

[81] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization
with adversarial feature learning. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 5400–5409, 2018.

[82] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint
arXiv:1703.00441, 2017.

122
[83] Rumeng Li, Xun Wang, and Hong Yu. Metamt, a meta learning method leveraging
multiple domain data for low resource machine translation. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 8245–8252, 2020.

[84] Xiaomeng Li, Lequan Yu, Hao Chen, Chi-Wing Fu, Lei Xing, and Pheng-Ann Heng.
Transformation-consistent self-ensembling model for semisupervised medical image seg-
mentation. IEEE Transactions on Neural Networks and Learning Systems, 32(2):523–
534, 2020.

[85] Yiying Li, Yongxin Yang, Wei Zhou, and Timothy Hospedales. Feature-critic networks
for heterogeneous domain generalisation. In The Thirty-sixth International Conference
on Machine Learning, 2019.

[86] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss
for dense object detection. In Proceedings of the IEEE international conference on
computer vision, pages 2980–2988, 2017.

[87] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li,
Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural ar-
chitecture search. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 19–34, 2018.

[88] Minh-Thang Luong and Christopher Manning. Stanford neural machine translation
systems for spoken language domains. In Marcello Federico, Sebastian Stüker, and Jan
Niehues, editors, Proceedings of the 12th International Workshop on Spoken Language
Translation: Evaluation Campaign, pages 76–79, Da Nang, Vietnam, December 3-4
2015.

[89] Tomasz Maciejewski and Jerzy Stefanowski. Local neighbourhood extension of smote
for mining imbalanced data. In 2011 IEEE Symposium on Computational Intelligence
and Data Mining (CIDM), pages 104–111. IEEE, 2011.

123
[90] Long Mai and Dong Kun Noh. Cluster ensemble with link-based approach for botnet
detection. Journal of Network and Systems Management, 26(3):616–639, 2018.

[91] Michael A Marchetti, Noel CF Codella, Stephen W Dusza, David A Gutman, Brian
Helba, Aadi Kalloo, Nabin Mishra, Cristina Carrera, M Emre Celebi, Jennifer L De-
Fazio, et al. Results of the 2016 international skin imaging collaboration international
symposium on biomedical imaging challenge: Comparison of the accuracy of computer
algorithms to dermatologists for the diagnosis of melanoma from dermoscopic images.
Journal of the American Academy of Dermatology, 78(2):270–277, 2018.

[92] Sourav Mishra, Hideaki Imaizumi, and Toshihiko Yamasaki. Interpreting fine-grained
dermatological classification by deep learning. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.

[93] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant
representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 6707–6717, 2020.

[94] R Mollineda, R Alejo, and J Sotoca. The class imbalance problem in pattern classifica-
tion and learning. In II Congreso Espanol de Informática (CEDI 2007). ISBN, pages
978–84, 2007.

[95] Robert C. Moore and William Lewis. Intelligent selection of language model train-
ing data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224,
Uppsala, Sweden, July 2010. Association for Computational Linguistics.

[96] Saeid Motiian, Marco Piccirilli, Donald A. Adjeroh, and Gianfranco Doretto. Unified
deep supervised domain adaptation and generalization. In The IEEE International
Conference on Computer Vision (ICCV), Oct 2017.

124
[97] Mathias Müller, Annette Rios, and Rico Sennrich. Domain robustness in neural ma-
chine translation. In Michael Denkowski and Christian Federmann, editors, Proceed-
ings of the 14th Conference of the Association for Machine Translation in the Americas
(Volume 1: Research Track), pages 151–164, Virtual, October 2020. Association for
Machine Translation in the Americas.

[98] Hung Nguyen, Di Zhuang, Pei-Yuan Wu, and Morris Chang. Autogan-based dimension
reduction for privacy preservation. Neurocomputing, 2019.

[99] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algo-
rithms. arXiv preprint arXiv:1803.02999, 2018.

[100] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv
preprint arXiv:1803.02999, 2, 2018.

[101] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations
by solving jigsaw puzzles. In European conference on computer vision, pages 69–84.
Springer, 2016.

[102] Sankaran Panchapagesan, Ming Sun, Aparna Khare, Spyros Matsoukas, Arindam Man-
dal, Björn Hoffmeister, and Shiv Vitaladevuni. Multi-task learning and weighted cross-
entropy for dnn-based keyword spotting. In Interspeech, volume 9, pages 760–764,
2016.

[103] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th annual meeting
of the Association for Computational Linguistics, pages 311–318, 2002.

125
[104] Cheonbok Park, Yunwon Tae, TaeHee Kim, Soyoung Yang, Mohammad Azam Khan,
Lucy Park, and Jaegul Choo. Unsupervised neural machine translation for low-resource
domains via meta-learning. In Proceedings of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pages 2888–2901, Online, Au-
gust 2021. Association for Computational Linguistics.

[105] Fábio Perez, Sandra Avila, and Eduardo Valle. Solo or ensemble? choosing a cnn
architecture for melanoma classification. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.

[106] MinhQuang Pham, Josep Crego, François Yvon, and Jean Senellart. Generic and
specialized word embeddings for multi-domain machine translation. In Jan Niehues,
Rolando Cattoni, Sebastian Stüker, Matteo Negri, Marco Turchi, Thanh-Le Ha, Eliz-
abeth Salesky, Ramon Sanabria, Loic Barrault, Lucia Specia, and Marcello Federico,
editors, Proceedings of the 16th International Conference on Spoken Language Trans-
lation, Hong Kong, November 2-3 2019. Association for Computational Linguistics.

[107] Trong Huy Phan and Kazuma Yamamoto. Resolving class imbalance in object detec-
tion with weighted cross entropy losses. arXiv preprint arXiv:2006.01413, 2020.

[108] Matt Post. A call for clarity in reporting bleu scores. In Proceedings of the Third
Conference on Machine Translation: Research Papers, pages 186–191, 2018.

[109] Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam
allocation for neural machine translation. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Papers), pages 1314–1324, New Orleans,
Louisiana, June 2018. Association for Computational Linguistics.

126
[110] Samira Pouyanfar, Yudong Tao, Anup Mohan, Haiman Tian, Ahmed S Kaseb, Kent
Gauen, Ryan Dailey, Sarah Aghajanzadeh, Yung-Hsiang Lu, Shu-Ching Chen, et al.
Dynamic sampling in convolutional neural networks for imbalanced data classification.
In 2018 IEEE conference on multimedia information processing and retrieval (MIPR),
pages 112–117. IEEE, 2018.

[111] Zhe Qu, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. Generalized federated
learning via sharpness aware minimization. In Kamalika Chaudhuri, Stefanie Jegelka,
Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the
39th International Conference on Machine Learning, volume 162 of Proceedings of
Machine Learning Research, pages 18250–18280. PMLR, 17–23 Jul 2022.

[112] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning
with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.

[113] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-
learning with implicit gradients. In Advances in Neural Information Processing Sys-
tems, pages 113–124, 2019.

[114] Fuji Ren, Yanqiu Li, and Min Hu. Multi-classifier ensemble based on dynamic weights.
Multimedia Tools and Applications, 77(16):21083–21107, 2018.

[115] Veronica Rotemberg, Nicholas Kurtansky, Brigid Betz-Stablein, Liam Caffery, Em-
manouil Chousakos, Noel Codella, Marc Combalia, Stephen Dusza, Pascale Guitera,
David Gutman, et al. A patient-centric dataset of images and metadata for identifying
melanomas using clinical context. Scientific data, 8(1):1–8, 2021.

[116] Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. La-
belme: a database and web-based tool for image annotation. International journal of
computer vision, 77(1-3):157–173, 2008.

127
[117] Danielle Saunders. Domain adaptation and multi-domain adaptation for neural ma-
chine translation: A survey. arXiv preprint arXiv:2104.06951, 2021.

[118] Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197–227,
1990.

[119] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan
Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised
learning from video. In 2018 IEEE international conference on robotics and automation
(ICRA), pages 1134–1141. IEEE, 2018.

[120] Amr Sharaf, Hany Hassan, and Hal Daumé III. Meta-learning for few-shot NMT adap-
tation. In Proceedings of the Fourth Workshop on Neural Generation and Translation,
pages 43–53, Online, July 2020. Association for Computational Linguistics.

[121] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear
memory cost. In International Conference on Machine Learning, pages 4596–4604.
PMLR, 2018.

[122] Yang Shu, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Transferable curriculum
for weakly-supervised domain adaptation. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 33, pages 4951–4958, 2019.

[123] Ziyue Song, Zhiyuan Ma, Kaiyue Qi, and Gongshen Liu. Dynamic tuning and weight-
ing of meta-learning for nmt domain adaptation. In Artificial Neural Networks and
Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural
Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part V, pages
576–587, 2021.

128
[124] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q.
Weinberger, editors, Advances in Neural Information Processing Systems, volume 27.
Curran Associates, Inc., 2014.

[125] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning. In
Thirty-first AAAI conference on artificial intelligence, 2017.

[126] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.
Rethinking the inception architecture for computer vision. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2818–2826, 2016.

[127] Andrea Tagarelli, Alessia Amelio, and Francesco Gullo. Ensemble-based community
detection in multilayer networks. Data Mining and Knowledge Discovery, 31(5):1506–
1543, 2017.

[128] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional
neural networks. arXiv preprint arXiv:1905.11946, 2019.

[129] Dapeng Tao, Lianwen Jin, Yuan Yuan, and Yang Xue. Ensemble manifold rank pre-
serving for acceleration-based human activity recognition. IEEE transactions on neural
networks and learning systems, 27(6):1392–1404, 2014.

[130] Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn.
Overcoming catastrophic forgetting during domain adaptation of neural machine trans-
lation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of
the 2019 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 2062–2068, Minneapolis, Minnesota, June 2019. Association for Computational
Linguistics.

129
[131] Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Proceedings of the
Eighth International Conference on Language Resources and Evaluation (LREC’12),
pages 2214–2218, Istanbul, Turkey, May 2012. European Language Resources Associ-
ation (ELRA).

[132] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter
Abbeel. Domain randomization for transferring deep neural networks from simulation
to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 23–30. IEEE, 2017.

[133] Antonio Torralba, Alexei A Efros, et al. Unbiased look at dataset bias. In CVPR,
volume 1, page 7. Citeseer, 2011.

[134] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large
collection of multi-source dermatoscopic images of common pigmented skin lesions.
Scientific data, 5:180161, 2018.

[135] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep
transfer across domains and tasks. In Proceedings of the IEEE International Conference
on Computer Vision, pages 4068–4076, 2015.

[136] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discrimina-
tive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 7167–7176, 2017.

[137] Vincent Van Asch. Macro-and micro-averaged evaluation measures [[basic draft]]. Bel-
gium: CLiPS, 49, 2013.

[138] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal
of machine learning research, 9(11), 2008.

130
[139] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, L ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon,
U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
nett, editors, Advances in Neural Information Processing Systems, volume 30. Curran
Associates, Inc., 2017.

[140] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Pan-
chanathan. Deep hashing network for unsupervised domain adaptation. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
5018–5027, 2017.

[141] Sofia Visa, Brian Ramsay, Anca L Ralescu, and Esther Van Der Knaap. Confusion
matrix-based feature selection. MAICS, 710:120–127, 2011.

[142] Haishuai Wang, Zhicheng Cui, Yixin Chen, Michael Avidan, Arbi Ben Abdallah, and
Alexander Kronzer. Predicting hospital readmission via cost-sensitive deep learning.
IEEE/ACM transactions on computational biology and bioinformatics, 15(6):1968–
1978, 2018.

[143] Haohan Wang, Zexue He, Zachary L. Lipton, and Eric P. Xing. Learning robust
representations by projecting superficial statistics out. In International Conference on
Learning Representations, 2019.

[144] Haozhou Wang, James Henderson, and Paola Merlo. Multi-adversarial learning for
cross-lingual word embeddings. In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, pages 463–472, Online, June 2021. Association for Computational
Linguistics.

131
[145] Kung-Jeng Wang, Bunjira Makond, Kun-Huang Chen, and Kung-Min Wang. A hybrid
classifier combining smote with pso to estimate 5-year survivability of breast cancer
patients. Applied Soft Computing, 20:15–24, 2014.

[146] Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, and Eiichiro Sumita. Instance
weighting for neural machine translation domain adaptation. In Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, pages 1482–1488,
Copenhagen, Denmark, September 2017. Association for Computational Linguistics.

[147] Shoujin Wang, Wei Liu, Jia Wu, Longbing Cao, Qinxue Meng, and Paul J Kennedy.
Training deep neural networks on imbalanced data sets. In 2016 international joint
conference on neural networks (IJCNN), pages 4368–4374. IEEE, 2016.

[148] Wei Wang, Isaac Caswell, and Ciprian Chelba. Dynamically composing domain-data
selection with clean-data selection by “co-curricular learning” for neural machine trans-
lation. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 1282–1292, Florence, Italy, July 2019. Association for Computational
Linguistics.

[149] Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, and Zarana Parekh.
Learning a multi-domain curriculum for neural machine translation. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pages
7711–7723, Online, July 2020. Association for Computational Linguistics.

[150] Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2021.

[151] David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.

[152] Pei-Yuan Wu, Chi-Chen Fang, Jien Morris Chang, and Sun-Yuan Kung. Cost-effective
kernel ridge regression implementation for keystroke-based active authentication sys-
tem. IEEE transactions on cybernetics, 47(11):3916–3927, 2016.

132
[153] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learn-
ing via non-parametric instance discrimination. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 3733–3742, 2018.

[154] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated
residual transformations for deep neural networks. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, pages 1492–1500, 2017.

[155] Yading Yuan, Ming Chao, and Yeh-Chi Lo. Automatic skin lesion segmentation using
deep fully convolutional networks with jaccard distance. IEEE transactions on medical
imaging, 36(9):1876–1886, 2017.

[156] Hadi Zanddizari, Nam Nguyen, Behnam Zeinali, and J Morris Chang. A new prepro-
cessing approach to improve the performance of cnn-based skin lesion classification.
Medical & Biological Engineering & Computing, 59(5):1123–1131, 2021.

[157] Behnam Zeinali, Di Zhuang, and J Morris Chang. Esai: Efficient split artifi-
cial intelligence via early exiting using neural architecture search. arXiv preprint
arXiv:2106.12549, 2021.

[158] Runzhe Zhan, Xuebo Liu, Derek F Wong, and Lidia S Chao. Meta-curriculum learning
for domain adaptation in neural machine translation. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 35, pages 14310–14318, 2021.

[159] Chong Zhang, Kay Chen Tan, and Ruoxu Ren. Training cost-sensitive deep belief
networks on imbalance data problems. In 2016 international joint conference on neural
networks (IJCNN), pages 4362–4367. IEEE, 2016.

[160] Jianpeng Zhang, Yutong Xie, Yong Xia, and Chunhua Shen. Attention residual learn-
ing for skin lesion classification. IEEE transactions on medical imaging, 38(9):2092–
2103, 2019.

133
[161] Lei Zhang and David Zhang. Evolutionary cost-sensitive extreme learning machine.
IEEE transactions on neural networks and learning systems, 28(12):3045–3060, 2016.

[162] Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup,
Marianna J Martindale, Paul McNamee, Kevin Duh, and Marine Carpuat. An empir-
ical exploration of curriculum learning for neural machine translation. arXiv preprint
arXiv:1811.00739, 2018.

[163] Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul McNamee, Marine Carpuat, and
Kevin Duh. Curriculum learning for domain adaptation in neural machine translation.
In Proceedings of the 2019 Conference of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), pages 1903–1915, Minneapolis, Minnesota, June 2019. Association
for Computational Linguistics.

[164] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learn-
ing. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4320–4328, 2018.

[165] Yulu Zhang, Liguo Shuai, Yali Ren, and Huilin Chen. Image classification with category
centers in class imbalance situation. In 2018 33rd Youth Academic annual conference
of Chinese Association of Automation (YAC), pages 359–363. IEEE, 2018.

[166] Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch
network with cumulative learning for long-tailed visual recognition. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 9719–
9728, 2020.

[167] Di Zhuang and J Morris Chang. Peerhunter: Detecting peer-to-peer botnets through
community behavior analysis. In 2017 IEEE Conference on Dependable and Secure
Computing, pages 493–500. IEEE, 2017.

134
[168] Di Zhuang and J Morris Chang. Enhanced peerhunter: Detecting peer-to-peer bot-
nets through network-flow level community behavior analysis. IEEE Transactions on
Information Forensics and Security, 14(6):1485–1500, 2018.

[169] Di Zhuang and J Morris Chang. Utility-aware privacy-preserving data releasing. arXiv
preprint arXiv:2005.04369, 2020.

[170] Di Zhuang, Morris J Chang, and Mingchen Li. Dynamo: Dynamic community detec-
tion by incrementally maximizing modularity. IEEE Transactions on Knowledge and
Data Engineering, 2019.

[171] Di Zhuang, Keyu Chen, and J Morris Chang. Cs-af: A cost-sensitive multi-classifier
active fusion framework for skin lesion classification. arXiv preprint arXiv:2004.12064,
2020.

[172] Di Zhuang, Keyu Chen, and J Morris Chang. Cs-af: A cost-sensitive multi-classifier
active fusion framework for skin lesion classification. Neurocomputing, 491:206–216,
2022.

[173] Di Zhuang, Sen Wang, and J Morris Chang. Fripal: Face recognition in privacy
abstraction layer. In 2017 IEEE Conference on Dependable and Secure Computing,
pages 441–448. IEEE, 2017.

[174] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.
arXiv preprint arXiv:1611.01578, 2016.

[175] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for
low-resource neural machine translation. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing, pages 1568–1575, 2016.

135
Appendix A: Copyright Permissions

The permission below is for the reproduction of material in Chapter 2.

11/11/24, 3:31 PM Rightslink® by Copyright Clearance Center

Discriminative adversarial domain generalization with meta-learning

based cross-domain validation
Author: Keyu Chen,Di Zhuang,J. Morris Chang
Publication: Neurocomputing
Publisher: Elsevier
Date: 7 January 2022

Journal Author Rights

Please note that, as the author of this Elsevier article, you retain the right to include it in a thesis or dissertation,
provided it is not published commercially. Permission is not required, but please ensure that you reference the
journal as the original source. For more information on this and on your other retained rights, please
visit: https://www.elsevier.com/about/our-business/policies/copyright#Author-rights
BACK CLOSE WINDOW

Privacy - Terms

https://s100.copyright.com/AppDispatchServlet#formTop 1/1 136

The permission below is for the reproduction of material in Chapter 3.

11/11/24, 3:34 PM Rightslink® by Copyright Clearance Center

Epi-Curriculum: Episodic Curriculum Learning for Low-Resource

Domain Adaptation in Neural Machine Translation
Author: Keyu Chen
Publication: IEEE Transactions on Artificial Intelligence
Publisher: IEEE
Date: 03 May 2024

Thesis / Dissertation Reuse

The IEEE does not require individuals working on a thesis to obtain a formal reuse license, however, you may
print out this statement to be used as a permission grant:

Requirements to be followed when using any portion (e.g., figure, graph, table, or textual material) of an IEEE
copyrighted paper in a thesis:
1) In the case of textual material (e.g., using short quotes or referring to the work within these papers) users must
give full credit to the original source (author, paper, publication) followed by the IEEE copyright line © 2011 IEEE.
2) In the case of illustrations or tabular material, we require that the copyright line © [Year of original publication]
IEEE appear prominently with each reprinted figure and/or table.
3) If a substantial portion of the original paper is to be used, and if you are not the senior author, also obtain the
senior author's approval.

Requirements to be followed when using an entire IEEE copyrighted paper in a thesis:

1) The following IEEE copyright/ credit notice should be placed prominently in the references: © [year of original
publication] IEEE. Reprinted, with permission, from [author names, paper title, IEEE publication title, and month/year
of publication]
2) Only the accepted version of an IEEE copyrighted paper can be used when posting the paper or your thesis on-
line.
3) In placing the thesis on the author's university website, please display the following message in a prominent place
on the website: In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does
not endorse any of [university/educational entity's name goes here]'s products or services. Internal or personal use
of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or
promotional purposes or for creating new collective works for resale or redistribution, please go to
http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License
from RightsLink.

If applicable, University Microfilms and/or ProQuest Library, or the Archives of Canada may supply single copies of
the dissertation.
BACK CLOSE WINDOW

Privacy - Terms

https://s100.copyright.com/AppDispatchServlet#formTop 1/1

137
The permission below is for the reproduction of material in Chapter 4.

11/11/24, 4:48 PM Permissions and Reuse - arXiv info

We gratefully acknowledge support from the Simons Foundation,

member institutions, and all contributors.

Reuse Requests

This FAQ is an attempt to collect answers to your common questions surrounding reusing content from arXiv in your materials.

Can I reuse figures from an arXiv paper?

Do I need arXiv's permission to repost the full text?

How can I determine what license the version was assigned?

I want to include a paper of mine from arXiv in my thesis, do I need specific permission?

I want to include a paper of mine from arXiv in an institutional repository, do I need permission?

Can I harvest the full text of works?

Can I reuse figures from an arXiv paper?

The short answer is "it depends". More specifically: - If the license applied to the work allows for remixing or reuse with citation, then yes. - If not,
then the version is assigned one of the arXiv perpetual non-exclusive licenses, and you will need to contact the submitter or copyright holder (if
published) to determine applicable permissions.

Do I need arXiv's permission to repost the full text?

Note: All e-prints submitted to arXiv are subject to copyright protections. arXiv is not the copyright holder on any of the e-prints in our corpus.

In some cases, submitters have provided permission in advance by submitting their e-print under a permissive Creative Commons license. The
overwhelming majority of e-prints are submitted using the arXiv perpetual non-exclusive license, which does not grant further reuse permissions
directly. In these cases you will need to contact the author directly with your request.

How can I determine what license the version was assigned?

All arXiv abstract pages indicate an assigned license underneath the "Download:" options.

The link may appear as just the text (license) , such as at arXiv:2201.14176. Articles between 1991 and 2003 have an assumed license. These are
functionally equivalent to the arXiv non-exclusive license.

If the license applied by the submitter is one of the Creative Commons licenses, then a "CC" logo will appear, such as at arXiv:2201.04182.

I want to include a paper of mine from arXiv in my thesis, do I need specific permission?
If you are the copyright holder of the work, you do not need arXiv's permission to reuse the full text.

I want to include a paper of mine from arXiv in an institutional repository, do I need permission?
You do not need arXiv's permission to deposit arXiv's version of your work into an institutional repository. For all other institutional repository
cases, see our help page on institutional repositories.

Can I harvest the full text of works?

Plase see our bulk data help page, and the API Terms of Use for specific options. Note that the license for the full text is not a part of the current
search API schema. The license is, however, provided within arXiv's output from the OAI-PMH in either arXiv or arXivRaw formats.

https://info.arxiv.org/help/license/reuse.html#i-want-to-include-a-paper-of-mine-from-arxiv-in-my-thesis-do-i-need-specific-permission 1/2

138
The permission below is for the reproduction of material in Chapter 5.

11/11/24, 3:33 PM Rightslink® by Copyright Clearance Center

CS-AF: A cost-sensitive multi-classifier active fusion framework for skin

lesion classification
Author: Di Zhuang,Keyu Chen,J. Morris Chang
Publication: Neurocomputing
Publisher: Elsevier
Date: 28 June 2022

Journal Author Rights

Privacy - Terms

https://s100.copyright.com/AppDispatchServlet#formTop 1/1

139

Pioneer CDJ-2000NXS2 Service Manual
100% (1)
Pioneer CDJ-2000NXS2 Service Manual
134 pages
GANs for Data Augmentation in Healthcare (1)
No ratings yet
GANs for Data Augmentation in Healthcare (1)
24 pages
Transferability in Deep Learning: A Survey: Junguang Jiang
No ratings yet
Transferability in Deep Learning: A Survey: Junguang Jiang
64 pages
Tech com science
No ratings yet
Tech com science
68 pages
Thesis 2022-Bayesian Convolutional Neural Network With Prediction Smoothing A
No ratings yet
Thesis 2022-Bayesian Convolutional Neural Network With Prediction Smoothing A
65 pages
Tong 2020
No ratings yet
Tong 2020
14 pages
Computer Science 2
No ratings yet
Computer Science 2
66 pages
17.Master2017Liu
No ratings yet
17.Master2017Liu
105 pages
Yinpeng Wang, Qiang Ren - Deep Learning-Based Forward Modeling and Inversion Techniques For Computational Physics Problems (2024, CRC Press) - Libgen - Li
No ratings yet
Yinpeng Wang, Qiang Ren - Deep Learning-Based Forward Modeling and Inversion Techniques For Computational Physics Problems (2024, CRC Press) - Libgen - Li
199 pages
Deep learning in computational mechanics a review
No ratings yet
Deep learning in computational mechanics a review
51 pages
TFG Final
No ratings yet
TFG Final
75 pages
Harsha Thesis
No ratings yet
Harsha Thesis
62 pages
Neural Transfer Learning For NLP
No ratings yet
Neural Transfer Learning For NLP
329 pages
Master Inspera
No ratings yet
Master Inspera
45 pages
Nikhil PHD EG Thesis Final Revision Smallsize noCV
No ratings yet
Nikhil PHD EG Thesis Final Revision Smallsize noCV
164 pages
Validation and Optimization of 3d-Human Body Pose Estimation Approaches For Use in Motion Analysis, Ab. Sami Noorzad and Malek Zedan
No ratings yet
Validation and Optimization of 3d-Human Body Pose Estimation Approaches For Use in Motion Analysis, Ab. Sami Noorzad and Malek Zedan
87 pages
Early Prediction of Student Performance Using Neural Networks FINAL
No ratings yet
Early Prediction of Student Performance Using Neural Networks FINAL
55 pages
Deep Learning Enhancement and Privacy-Preserving Deep Learning - A
No ratings yet
Deep Learning Enhancement and Privacy-Preserving Deep Learning - A
116 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
A COMPARATIVE ANALYSIS OF DEEP LEARNING MODEL FOR FLOWER RECOGNITION AND HEALTH PREDICTION
No ratings yet
A COMPARATIVE ANALYSIS OF DEEP LEARNING MODEL FOR FLOWER RECOGNITION AND HEALTH PREDICTION
87 pages
Jasper Busschers - Thesis Final
No ratings yet
Jasper Busschers - Thesis Final
39 pages
Shreya Ghosh MS Thesis Final Revised
No ratings yet
Shreya Ghosh MS Thesis Final Revised
64 pages
CraterNet - A Fully Convolutional Neural Network For Lunar Crater Detection Based On Remotely Sensed Data
No ratings yet
CraterNet - A Fully Convolutional Neural Network For Lunar Crater Detection Based On Remotely Sensed Data
159 pages
Tfm Fabian Eitel
No ratings yet
Tfm Fabian Eitel
121 pages
Lagergren Rosengren 2020
No ratings yet
Lagergren Rosengren 2020
108 pages
Kirkvik Acit2022
No ratings yet
Kirkvik Acit2022
155 pages
Ali Thesis
No ratings yet
Ali Thesis
125 pages
L D G M: Earning EEP Enerative Odels
No ratings yet
L D G M: Earning EEP Enerative Odels
84 pages
Machine Learning Applications in Graduation Prediction Topic
No ratings yet
Machine Learning Applications in Graduation Prediction Topic
86 pages
MastersThesis%20%284%29
No ratings yet
MastersThesis%20%284%29
51 pages
BBBB
No ratings yet
BBBB
8 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
llm-book (1)
No ratings yet
llm-book (1)
161 pages
Data Augmentation For Supervised Learning With Generative Adversa
No ratings yet
Data Augmentation For Supervised Learning With Generative Adversa
60 pages
tesi
No ratings yet
tesi
106 pages
Deep Learning for Image Spam Detection
No ratings yet
Deep Learning for Image Spam Detection
44 pages
Deep-Learning-book-part1
No ratings yet
Deep-Learning-book-part1
100 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
Wang_asu_0010N_21448
No ratings yet
Wang_asu_0010N_21448
81 pages
Understanding Deep Learning Chitta Ranjan
No ratings yet
Understanding Deep Learning Chitta Ranjan
13 pages
Predicting User Interaction On Social Media Using Machine Learnin
No ratings yet
Predicting User Interaction On Social Media Using Machine Learnin
76 pages
Twitter_Dataset_for_Advanced_Hate_Speech_Identification__Final
No ratings yet
Twitter_Dataset_for_Advanced_Hate_Speech_Identification__Final
78 pages
Deep Learning Based Sentiment
No ratings yet
Deep Learning Based Sentiment
62 pages
Alice's Adventures in A Differentiable Wonderland
No ratings yet
Alice's Adventures in A Differentiable Wonderland
279 pages
FULLTEXT01
No ratings yet
FULLTEXT01
74 pages
Yu Xiaozhuo
No ratings yet
Yu Xiaozhuo
85 pages
Gene Prediction Using Unsupervised Deep Networks
No ratings yet
Gene Prediction Using Unsupervised Deep Networks
49 pages
Modern ML
No ratings yet
Modern ML
146 pages
CS985 Project FrankMitchell BiP Solutions
No ratings yet
CS985 Project FrankMitchell BiP Solutions
66 pages
Seed1.5-VL Technical Report
No ratings yet
Seed1.5-VL Technical Report
77 pages
Final 1
No ratings yet
Final 1
36 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
PhoCLIP_232_Specialized_Project_OFFICIAL
No ratings yet
PhoCLIP_232_Specialized_Project_OFFICIAL
105 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
DSA Book by Narshimha
No ratings yet
DSA Book by Narshimha
76 pages
Moseley 2022 Physics-Informed Machine Learning-1
No ratings yet
Moseley 2022 Physics-Informed Machine Learning-1
268 pages
AI Image Classification With Neural Network 221002564 222002074
No ratings yet
AI Image Classification With Neural Network 221002564 222002074
17 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
From Everand
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
Massimiliano Bocciarelli
No ratings yet
Lifelong Education: Continuous Learning in the Digital Age
From Everand
Lifelong Education: Continuous Learning in the Digital Age
Maia Tobares
No ratings yet
Ai Potential
No ratings yet
Ai Potential
33 pages
2.5.the Histogram Exercise
No ratings yet
2.5.the Histogram Exercise
5 pages
Dileep Kumar KL - Manual Automation Testing QA - 5.2 EXPERIENCE
No ratings yet
Dileep Kumar KL - Manual Automation Testing QA - 5.2 EXPERIENCE
4 pages
Geothermal Energy
93% (15)
Geothermal Energy
27 pages
Log Com - Roblox.client 1686918755
No ratings yet
Log Com - Roblox.client 1686918755
328 pages
file handling
No ratings yet
file handling
23 pages
JustGetIt Internship Tasks
No ratings yet
JustGetIt Internship Tasks
8 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
19 pages
Abc PDF
No ratings yet
Abc PDF
149 pages
What are Database Recovery Techniques
No ratings yet
What are Database Recovery Techniques
11 pages
7 Steps To Upgrade IOS Image On Cisco Catalyst Switch or Router
No ratings yet
7 Steps To Upgrade IOS Image On Cisco Catalyst Switch or Router
5 pages
FCP_FMG_AD-7.4-demo
No ratings yet
FCP_FMG_AD-7.4-demo
8 pages
Structural Engg
No ratings yet
Structural Engg
31 pages
Footing Schedule: 3 Storeyed R.C.C Building
No ratings yet
Footing Schedule: 3 Storeyed R.C.C Building
1 page
As A Software Developer in India
No ratings yet
As A Software Developer in India
85 pages
Kki Catalogue 2023
No ratings yet
Kki Catalogue 2023
104 pages
Muslim Hands YEMEN
No ratings yet
Muslim Hands YEMEN
3 pages
Css Interview Questions
100% (1)
Css Interview Questions
7 pages
Liste Des Defauts Ssam
No ratings yet
Liste Des Defauts Ssam
8 pages
How Corona Virus Effects Malaysia Economy From Microeconomics Perspective
No ratings yet
How Corona Virus Effects Malaysia Economy From Microeconomics Perspective
2 pages
Did603A: Multiplatform Experience Design
No ratings yet
Did603A: Multiplatform Experience Design
35 pages
Get Building Web Apps With WordPress WordPress As An Application Framework Brian Messenlehner Free All Chapters
100% (1)
Get Building Web Apps With WordPress WordPress As An Application Framework Brian Messenlehner Free All Chapters
44 pages
Service Refund Coupon: Coupon Number: 21101603044514 Coupon Date: 16-Oct-2021
No ratings yet
Service Refund Coupon: Coupon Number: 21101603044514 Coupon Date: 16-Oct-2021
1 page
Infrared CO2 Sensor Module (Model: MH-Z19C) : User's Manual
No ratings yet
Infrared CO2 Sensor Module (Model: MH-Z19C) : User's Manual
7 pages
Activation Functions
No ratings yet
Activation Functions
4 pages
iDRAC-G8PM9W2 - iDRAC9 - Systesm PDF
No ratings yet
iDRAC-G8PM9W2 - iDRAC9 - Systesm PDF
5 pages
SL650 Brochure 06 2022 Email
No ratings yet
SL650 Brochure 06 2022 Email
4 pages
01b A Level Mathematics Practice Paper A - Pure Mathematics Mark Scheme
No ratings yet
01b A Level Mathematics Practice Paper A - Pure Mathematics Mark Scheme
17 pages
E-Procurement Government of Karnataka
No ratings yet
E-Procurement Government of Karnataka
93 pages