Adversarial Learning
Adversarial Learning
Zhibo Wang†,‡ , Xiaowei Dong† , Henry Xue⋆ , Zhifei Zhang♯ , Weifeng Chiu⋆ , Tao Wei⋆ , Kui Ren‡
† School of Cyber Science and Engineering, Wuhan University, P. R. China
‡ School of Cyber Science and Technology, Zhejiang University, P. R. China
⋆ Ant Group, ♯ Adobe Research
arXiv:2203.01584v1 [cs.LG] 3 Mar 2022
cess [14, 15], health care [13], etc. However, some existing
Abstract AI systems are found to treat individuals unequally based
on protected attributes, e.g., ethnicity, gender, and nation-
Prioritizing fairness is of central importance in artifi- ality. Such biases are referred to as unfairness. For in-
cial intelligence (AI) systems, especially for those societal stance, Amazon realized that their automatic recruitment
applications, e.g., hiring systems should recommend appli- system presents skewness between male and female candi-
cants equally from different demographic groups, and risk dates [12], i.e., male candidates are with higher probability
assessment systems must eliminate racism in criminal jus- to be hired as compared to female candidates. The COM-
tice. Existing efforts towards the ethical development of AI PAS, which is an assessment system of recidivating risk, is
systems have leveraged data science to mitigate biases in found to have racial prejudice [6]. Such unfairness has been
the training set or introduced fairness principles into the a subtle and ubiquitous nature of AI systems, thus it is non-
training process. For a deployed AI system, however, it may trivial to mitigate the unfairness, ideally without touching
not allow for retraining or tuning in practice. By contrast, the deployed models.
we propose a more flexible approach, i.e., fairness-aware Many works have been proposed to mitigate unfair-
adversarial perturbation (FAAP), which learns to perturb ness/biases, which can be divided into three categories
input data to blind deployed models on fairness-related fea- according to the stage de-biasing is applied, i.e., pre-
tures, e.g., gender and ethnicity. The key advantage is that processing, in-processing, and post-processing. From the
FAAP does not modify deployed models in terms of param- perspective of pre-processing, [8, 16, 17, 27, 31] mitigated
eters and structures. To achieve this, we design a discrimi- biases in the training dataset, thus mitigating the bias dur-
nator to distinguish fairness-related attributes based on la- ing training the model. For the in-processing methods,
tent representations from deployed models. Meanwhile, a [1, 19, 30] introduced fairness-related penalties into the
perturbation generator is trained against the discrimina- learning process to train a fairer model. These methods
tor, such that no fairness-related features could be extracted need to retrain or fine-tune the target models, while these
from perturbed inputs. Exhaustive experimental evaluation are unsuitable if the models are deployed without access to
demonstrates the effectiveness and superior performance of their training set. [7] proposed a boosting method to post-
the proposed FAAP. In addition, FAAP is validated on real- process a deployed deep learning model to produce a new
world commercial deployments (inaccessible to model pa- classifier that has equal accuracy in different people groups.
rameters), which shows the transferability of FAAP, foresee- However, [7] needs to replace the final classifier and cannot
ing the potential of black-box adaptation. ensure statistical and predictive parity, e.g., individuals in
different groups are equally treated in prediction.
To the best of our knowledge, existing works are not
1. Introduction suitable to improve fairness at the inference phase with-
AI systems have been widely deployed in many high- out changing the deep model. Therefore, it is imperative
stakes applications, e.g., face recognition [3,21], hiring pro- to propose a practical approach to mitigate the unfairness
of deployed models without changing their parameters and
This manuscript was accepted by CVPR 2022. structures. Since deep models tend to learn spurious corre-
Original image (FAAP), which designs a discriminator to distinguish
[man] not smiling × fairness-related attributes based on latent representa-
tions from deployed models. Meanwhile, a generator
is trained adversarially to perturb input data to prevent
the deployed models from extracting fairness-related
smile detection
[unknown] smiling √ features. This design effectively decorrelates fairness-
Fairness-aware related/protected attributes from predictions.
adversarial perturbation
• Extensive experiments demonstrate the superior per-
formance of the proposed FAAP. In addition, evalu-
Figure 1. The illustration on a smile detection model. Original
ation on real-world commercial APIs shows the trans-
image is falsely recognized due to model unfairness, i.e., tending
to predict males as “not smiling”. The fairness-aware adversarial ferability of FAAP, which indicates the potential of fur-
perturbation generated by FAAP helps the input image to hide the ther exploring our method in the black-box scenario.
protected attribute and get fair treatment.
2. Related work
lations between protected attributes and target labels from
training data, e.g., the race may correlate to criminal risk, This section overviews related works on unfairness mit-
the key to mitigating unfairness is to break such correla- igation that could be roughly divided according to target-
tion. As we assume not modifying the model, the main ing stages, i.e., pre-processing (data pre-processing before
challenge of achieving this goal is how to prevent the de- training), in-processing (penalty design during training),
ployed model from extracting fairness-related information and post-processing (prediction adjustment after training).
from inputs. Intuitively, the only thing we could modify is Pre-processing methods [16, 17, 31] aim to mitigate bi-
the input data during the inference stage of deployed mod- ases in the training dataset, i.e., fairer training sets would
els, i.e., perturbing the inputs such that the model cannot train fairer models. Many methods have been proposed
recognize those protected attributes. to de-bias training sets by fair data representation trans-
Based on the above idea, we propose the Fairness-Aware formation or data distribution augmentation. Quadrianto et
Adversarial Perturbation (FAAP), which learns to perturb al. [16] used data-to-data translation to find middle-ground
input samples to blind deployed models on fairness-related representation for different gender groups in training data,
features. As shown in Fig. 1, the deployed model can not thus the model will not learn the tendency of gender. Ra-
distinguish the fairness-related feature (e.g., gender) from maswamy et al. [17] generated paired training data to bal-
the perturbed input image. Therefore, the predictions will ance protected attributes, which would remove spurious
not correlate to the protected attributes. The key idea is correlation between target label and protected attributes.
that perturbations can remap samples to tightly distribute Zhang et al. [31] proposed to generate adversarial exam-
along the decision hyperplane of protected attributes in ples to supplement the training dataset, balancing the data
the model latent space, making them difficult to be distin- distribution over different protected attributes.
guished. To achieve this, we train a generator to produce In-processing approaches [1, 18, 19, 29, 30] intro-
adversarial perturbation. During the training process, a dis- duce fairness principles into the training process, i.e.,
criminator is trained to distinguish the protected attributes training models by specifially designed fairness penal-
from the representations of the model, while the generator ties/constraints or adverasial mechinsm. Zafar et al. [29]
learns to deceive the discriminator, thus generating fairness- proposed to maximize accuracy under disparate impact con-
aware perturbation that can hide the information of pro- straints to improve fairness in machine learning. Brian et
tected attributes from the feature extraction process. Ex- al. [1] and Zhang et al. [30] enforced the model to produce
tensive experimental evaluation demonstrates the superior fair outputs with adversarial training techniques by maxi-
performance of the proposed FAAP and shows the poten- mizing accuracy while minimizing the ability of a discrim-
tial in the black-box scenario, i.e., mitigating unfairness of inator to predict the protected attribute. Yuji Roh et al. [18]
models without access to their parameters. provided a mutual information-based interpretation of an
In summary, the main contributions of this paper are in existing adversarial training-based method for improving
three-folds: the disparate impact and equalized odds. Sarhan et al. [19]
• We give the first attempt to mitigate the unfairness imposed orthogonality and disentanglement constraints on
from deployed deep models without changing their pa- the representation and forced the representation to be agnos-
rameters and structures. This pushes the fairness re- tic to protected information by entropy maximization, then
search towards a more practical scenario. the following classifier can make fair predictions based on
learned representation. This line of research aims at getting
• We propose the fairness-aware adversarial perturbation a fairer model by explicitly changing the training procedure.
gender hyperplane gender hyperplane
Different from this line of work, our method is applied af-
male female
ter the training process and can improve fairness without
changing the deployed model. add adversarial perturbation
Post-processing works [7, 10] tend to adjust model pre- -1
dictions according to certain fairness criteria. Lohia et +1
target label target label
al. [10] proposed a post-processing algorithm that helps a false positive hyperplane hyperplane
model meet both individual and group fairness criteria on
tabular data by detecting biases from model outputs and male, +1 male, -1 female, +1 female, -1
correspondingly editing protected attributes to adjust model
predictions. However, this method needs to change pro- Figure 2. The basic idea of the proposed fairness-aware adversar-
tected attributes at the test time which is hard for computer ial perturbation (FAAP). Gender bias exists in the left part, i.e., the
vision applications. Michael et al. [7] proposed a method false positive rate of females is much higher than males. Without
that can post-process a pre-trained deep learning model to adjusting the decision hyperplanes of the deployed model, FAAP
create a new classifier, which has equal accuracy for peo- perturbs samples to decorrelate the target label and gender in latent
ple with different protected attributes. However, [7] needs space. In the right part, perturbed samples tightly distribute along
to replace the final classifier, and equal sub-group accuracy the gender hyperplane, meanwhile, preserving the distinguishabil-
ity along the target label hyperplane.
can not ensure people in different groups have equal chance
to get favorable predictions, e.g., unequal false positive rate
and false negative rate. We borrow ideas from this line of bel. Samples in both the privileged and unprivileged groups
research, but we improve fairness from the data side, instead have the same false positive rate and false negative rate.
of manipulating the model or its prediction.
3.2. Adversarial examples
3. Preliminaries Recent studies show that deep learning models are vul-
nerable to adversarial examples [24]. Given a classification
3.1. Model fairness model C(x), the goal of adversarial attacks is to find a small
In this paper, we focus on visual classification models perturbation to generate an adversarial example x′ , to mis-
because of exhaustive academic efforts on them, as well lead classifier C. More specifically, there are two kinds of
as their broad industrial applications. Moreover, it is im- adversarial example attacks. For an input x with ground
portant to achieve equal treatment for people with different label y, targeted attack will let C(x′ ) = y ′ where y ′ ≠ y
protected attributes, e.g., nationality, gender, and ethnicity. is a label specified by the attacker. On the contrary, in an
Therefore, demographic parity [28] and equalized odds [4] untargeted attack, an attacker will mislead the classification
are adopted to measure model fairness. model as C(x′ ) ≠ y. Typically, the lp norm of the perturba-
In a binary classification task, e.g., criminal prediction, tion should be less than , i.e. ∥x − x′ ∥p ≤ . Many methods
suppose target label y ∈ Y = {−1, 1}, protected attribute have been proposed to generate adversarial examples, such
z ∈ Z = {−1, 1}, where y = 1 is in favourable class (e.g., as PGD [11], CW [2] and GANs based method [26].
lower criminal tendency) and z = 1 is in privileged group
(e.g., Caucasian). 4. Fairness-aware adversarial perturbation
Definition 1 (Demographic Parity). If the value of z does
In this paper, we propose fairness-aware adversarial per-
not influence assigning a sample to the positive class, i.e.
turbation (FAAP) to mitigate unfairness born with deep
model prediction ŷ = 1 á z, then the classifier satisfies de-
models. This section will overview the proposed FAAP and
mographic parity:
detail the design of network and loss functions. Finally, we
P (ŷ = 1∣z = −1) = P (ŷ = 1∣z = 1) (1) will further discuss the training strategy of FAAP.
Table 1. Results of deployed models before and after embedded with the proposed FAAP on CelebA ( Tables 1(a) to 1(c)) and LFW (
Tables 1(d) to 1(f)). For fairness criterion DP and DEO, the lower the fairer. For accuracy ACC, the higher the better.
tions [20]. We visualize a subset of test images that were der tendency, thus we can see that perturbation in “only T”
originally false predicted by the deployed model but have will let the model make correct predictions but mislead the
been successfully recognized after perturbation in Fig. 4. model to focus on the unrelated area (e.g., eyes in Fig. 4(c),
For each deployed model, we provide explanations on raw hair in Fig. 4(d)). In contrast, our method helps the model
data, images perturbed by G trained on LTG without LfGair focus on the right area and make right predictions.
and images perturbed by G optimized on LG . (1) Normal T-SNE is a method to visualize high-dimensional data
training model. As shown in Fig. 4(a), for a normal train- from a low dimension view. To better demonstrate that
ing model, our adversarial perturbation can help the model our method can hide sensitive information for images by
focus on the right area (mouth) and make correct predic- remapping them close to the protected attribute decision hy-
tions. The red area of images in “only T” deviates little perplane while maintaining the distance to the target label
from the mouth. (2) Fair training model. Since this kind decision hyperplane in latent space of the deployed model,
of model has less gender bias than other models, as shown in we utilize T-SNE to get low-dimensional embedding of data
Fig. 4(b), G optimized towards improving target label accu- feature representation. More specifically, we extract feature
racy can get similar heat-maps as “ours”. Both of them can vectors of these images with/without adversarial perturba-
help the deployed model focus on the right area. (3) Unfair tion and visualize them in a two-dimension diagram with
training model. Unfair training models have larger gen- T-SNE. (1) Normal training model. From Fig. 5(a) and
(1) (2) (3) (4) (1) (2) (3) (4)
commercial APIs, including Alibaba3 and Baidu4 . For Al-
ibaba’s face analyze API, it returns binary results in which
“0” means “not smiling” and “1” means “smiling”. For
Baidu’s face analyze API, it returns three categories named
“none”, “smile” and “laugh”. We assume “none” means not
(a) T-SNE for Smiling (b) T-SNE for Attractive smiling and others mean smiling. We find these APIs be-
(1) (2) (3) (4) have some extent of unfairness, i.e., DEO of them are about
0.1. Since this is a totally black-box scenario, we know
nothing about the models behind these APIs. We try to train
the generator with model ensemble techniques, taking the
normal training model and the fair training model in Sec-
tion 5.2 as surrogate models. Then we upload the perturbed
(c) T-SNE for Blond hair images to these APIs and record results.
(male, -1) (male, +1) (female, -1) (female, +1) Table 2 shows the results of these face analyze APIs on
original and perturbed images. From Table 2(a), we can
Figure 5. T-SNE results for three different models on Smiling, see that FAAP improves DP by 0.0293 and decreases DEO
Attractive and Blond Hair in CelebA. The upper row shows the to 0.0368 with only 0.0026 degradation in accuracy. Like-
results of the raw data and the bottom row shows the results of im- wise, Table 2(b) shows 0.0411, 0.0648 improvement in DP
ages perturbed with FAAP. In each sub-figure, the feature repre- and DEO while 0.0289 degradation in accuracy for Baidu.
sentation in column (1) is extracted from a normal training model,
These results show the transferability of FAAP and the po-
and column (2) from a fair training model, while column (3) from
tential usage of FAAP in black-box scenarios.
a LF model, column (4) from a RG model. (Better viewed in color)