0% found this document useful (0 votes)
15 views

Adversarial Learning

Uploaded by

조킹
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Adversarial Learning

Uploaded by

조킹
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fairness-aware Adversarial Perturbation Towards Bias Mitigation for

Deployed Deep Models

Zhibo Wang†,‡ , Xiaowei Dong† , Henry Xue⋆ , Zhifei Zhang♯ , Weifeng Chiu⋆ , Tao Wei⋆ , Kui Ren‡
† School of Cyber Science and Engineering, Wuhan University, P. R. China
‡ School of Cyber Science and Technology, Zhejiang University, P. R. China
⋆ Ant Group, ♯ Adobe Research
arXiv:2203.01584v1 [cs.LG] 3 Mar 2022

{zhibowang, kuiren}@zju.edu.cn, [email protected], {weifeng.qwf, lenx.wei}@antgroup.com,


[email protected], [email protected]

cess [14, 15], health care [13], etc. However, some existing
Abstract AI systems are found to treat individuals unequally based
on protected attributes, e.g., ethnicity, gender, and nation-
Prioritizing fairness is of central importance in artifi- ality. Such biases are referred to as unfairness. For in-
cial intelligence (AI) systems, especially for those societal stance, Amazon realized that their automatic recruitment
applications, e.g., hiring systems should recommend appli- system presents skewness between male and female candi-
cants equally from different demographic groups, and risk dates [12], i.e., male candidates are with higher probability
assessment systems must eliminate racism in criminal jus- to be hired as compared to female candidates. The COM-
tice. Existing efforts towards the ethical development of AI PAS, which is an assessment system of recidivating risk, is
systems have leveraged data science to mitigate biases in found to have racial prejudice [6]. Such unfairness has been
the training set or introduced fairness principles into the a subtle and ubiquitous nature of AI systems, thus it is non-
training process. For a deployed AI system, however, it may trivial to mitigate the unfairness, ideally without touching
not allow for retraining or tuning in practice. By contrast, the deployed models.
we propose a more flexible approach, i.e., fairness-aware Many works have been proposed to mitigate unfair-
adversarial perturbation (FAAP), which learns to perturb ness/biases, which can be divided into three categories
input data to blind deployed models on fairness-related fea- according to the stage de-biasing is applied, i.e., pre-
tures, e.g., gender and ethnicity. The key advantage is that processing, in-processing, and post-processing. From the
FAAP does not modify deployed models in terms of param- perspective of pre-processing, [8, 16, 17, 27, 31] mitigated
eters and structures. To achieve this, we design a discrimi- biases in the training dataset, thus mitigating the bias dur-
nator to distinguish fairness-related attributes based on la- ing training the model. For the in-processing methods,
tent representations from deployed models. Meanwhile, a [1, 19, 30] introduced fairness-related penalties into the
perturbation generator is trained against the discrimina- learning process to train a fairer model. These methods
tor, such that no fairness-related features could be extracted need to retrain or fine-tune the target models, while these
from perturbed inputs. Exhaustive experimental evaluation are unsuitable if the models are deployed without access to
demonstrates the effectiveness and superior performance of their training set. [7] proposed a boosting method to post-
the proposed FAAP. In addition, FAAP is validated on real- process a deployed deep learning model to produce a new
world commercial deployments (inaccessible to model pa- classifier that has equal accuracy in different people groups.
rameters), which shows the transferability of FAAP, foresee- However, [7] needs to replace the final classifier and cannot
ing the potential of black-box adaptation. ensure statistical and predictive parity, e.g., individuals in
different groups are equally treated in prediction.
To the best of our knowledge, existing works are not
1. Introduction suitable to improve fairness at the inference phase with-
AI systems have been widely deployed in many high- out changing the deep model. Therefore, it is imperative
stakes applications, e.g., face recognition [3,21], hiring pro- to propose a practical approach to mitigate the unfairness
of deployed models without changing their parameters and
This manuscript was accepted by CVPR 2022. structures. Since deep models tend to learn spurious corre-
Original image (FAAP), which designs a discriminator to distinguish
[man] not smiling × fairness-related attributes based on latent representa-
tions from deployed models. Meanwhile, a generator
is trained adversarially to perturb input data to prevent
the deployed models from extracting fairness-related
smile detection
[unknown] smiling √ features. This design effectively decorrelates fairness-
Fairness-aware related/protected attributes from predictions.
adversarial perturbation
• Extensive experiments demonstrate the superior per-
formance of the proposed FAAP. In addition, evalu-
Figure 1. The illustration on a smile detection model. Original
ation on real-world commercial APIs shows the trans-
image is falsely recognized due to model unfairness, i.e., tending
to predict males as “not smiling”. The fairness-aware adversarial ferability of FAAP, which indicates the potential of fur-
perturbation generated by FAAP helps the input image to hide the ther exploring our method in the black-box scenario.
protected attribute and get fair treatment.
2. Related work
lations between protected attributes and target labels from
training data, e.g., the race may correlate to criminal risk, This section overviews related works on unfairness mit-
the key to mitigating unfairness is to break such correla- igation that could be roughly divided according to target-
tion. As we assume not modifying the model, the main ing stages, i.e., pre-processing (data pre-processing before
challenge of achieving this goal is how to prevent the de- training), in-processing (penalty design during training),
ployed model from extracting fairness-related information and post-processing (prediction adjustment after training).
from inputs. Intuitively, the only thing we could modify is Pre-processing methods [16, 17, 31] aim to mitigate bi-
the input data during the inference stage of deployed mod- ases in the training dataset, i.e., fairer training sets would
els, i.e., perturbing the inputs such that the model cannot train fairer models. Many methods have been proposed
recognize those protected attributes. to de-bias training sets by fair data representation trans-
Based on the above idea, we propose the Fairness-Aware formation or data distribution augmentation. Quadrianto et
Adversarial Perturbation (FAAP), which learns to perturb al. [16] used data-to-data translation to find middle-ground
input samples to blind deployed models on fairness-related representation for different gender groups in training data,
features. As shown in Fig. 1, the deployed model can not thus the model will not learn the tendency of gender. Ra-
distinguish the fairness-related feature (e.g., gender) from maswamy et al. [17] generated paired training data to bal-
the perturbed input image. Therefore, the predictions will ance protected attributes, which would remove spurious
not correlate to the protected attributes. The key idea is correlation between target label and protected attributes.
that perturbations can remap samples to tightly distribute Zhang et al. [31] proposed to generate adversarial exam-
along the decision hyperplane of protected attributes in ples to supplement the training dataset, balancing the data
the model latent space, making them difficult to be distin- distribution over different protected attributes.
guished. To achieve this, we train a generator to produce In-processing approaches [1, 18, 19, 29, 30] intro-
adversarial perturbation. During the training process, a dis- duce fairness principles into the training process, i.e.,
criminator is trained to distinguish the protected attributes training models by specifially designed fairness penal-
from the representations of the model, while the generator ties/constraints or adverasial mechinsm. Zafar et al. [29]
learns to deceive the discriminator, thus generating fairness- proposed to maximize accuracy under disparate impact con-
aware perturbation that can hide the information of pro- straints to improve fairness in machine learning. Brian et
tected attributes from the feature extraction process. Ex- al. [1] and Zhang et al. [30] enforced the model to produce
tensive experimental evaluation demonstrates the superior fair outputs with adversarial training techniques by maxi-
performance of the proposed FAAP and shows the poten- mizing accuracy while minimizing the ability of a discrim-
tial in the black-box scenario, i.e., mitigating unfairness of inator to predict the protected attribute. Yuji Roh et al. [18]
models without access to their parameters. provided a mutual information-based interpretation of an
In summary, the main contributions of this paper are in existing adversarial training-based method for improving
three-folds: the disparate impact and equalized odds. Sarhan et al. [19]
• We give the first attempt to mitigate the unfairness imposed orthogonality and disentanglement constraints on
from deployed deep models without changing their pa- the representation and forced the representation to be agnos-
rameters and structures. This pushes the fairness re- tic to protected information by entropy maximization, then
search towards a more practical scenario. the following classifier can make fair predictions based on
learned representation. This line of research aims at getting
• We propose the fairness-aware adversarial perturbation a fairer model by explicitly changing the training procedure.
gender hyperplane gender hyperplane
Different from this line of work, our method is applied af-
male female
ter the training process and can improve fairness without
changing the deployed model. add adversarial perturbation
Post-processing works [7, 10] tend to adjust model pre- -1
dictions according to certain fairness criteria. Lohia et +1
target label target label
al. [10] proposed a post-processing algorithm that helps a false positive hyperplane hyperplane
model meet both individual and group fairness criteria on
tabular data by detecting biases from model outputs and male, +1 male, -1 female, +1 female, -1
correspondingly editing protected attributes to adjust model
predictions. However, this method needs to change pro- Figure 2. The basic idea of the proposed fairness-aware adversar-
tected attributes at the test time which is hard for computer ial perturbation (FAAP). Gender bias exists in the left part, i.e., the
vision applications. Michael et al. [7] proposed a method false positive rate of females is much higher than males. Without
that can post-process a pre-trained deep learning model to adjusting the decision hyperplanes of the deployed model, FAAP
create a new classifier, which has equal accuracy for peo- perturbs samples to decorrelate the target label and gender in latent
ple with different protected attributes. However, [7] needs space. In the right part, perturbed samples tightly distribute along
to replace the final classifier, and equal sub-group accuracy the gender hyperplane, meanwhile, preserving the distinguishabil-
ity along the target label hyperplane.
can not ensure people in different groups have equal chance
to get favorable predictions, e.g., unequal false positive rate
and false negative rate. We borrow ideas from this line of bel. Samples in both the privileged and unprivileged groups
research, but we improve fairness from the data side, instead have the same false positive rate and false negative rate.
of manipulating the model or its prediction.
3.2. Adversarial examples
3. Preliminaries Recent studies show that deep learning models are vul-
nerable to adversarial examples [24]. Given a classification
3.1. Model fairness model C(x), the goal of adversarial attacks is to find a small
In this paper, we focus on visual classification models perturbation to generate an adversarial example x′ , to mis-
because of exhaustive academic efforts on them, as well lead classifier C. More specifically, there are two kinds of
as their broad industrial applications. Moreover, it is im- adversarial example attacks. For an input x with ground
portant to achieve equal treatment for people with different label y, targeted attack will let C(x′ ) = y ′ where y ′ ≠ y
protected attributes, e.g., nationality, gender, and ethnicity. is a label specified by the attacker. On the contrary, in an
Therefore, demographic parity [28] and equalized odds [4] untargeted attack, an attacker will mislead the classification
are adopted to measure model fairness. model as C(x′ ) ≠ y. Typically, the lp norm of the perturba-
In a binary classification task, e.g., criminal prediction, tion should be less than , i.e. ∥x − x′ ∥p ≤ . Many methods
suppose target label y ∈ Y = {−1, 1}, protected attribute have been proposed to generate adversarial examples, such
z ∈ Z = {−1, 1}, where y = 1 is in favourable class (e.g., as PGD [11], CW [2] and GANs based method [26].
lower criminal tendency) and z = 1 is in privileged group
(e.g., Caucasian). 4. Fairness-aware adversarial perturbation
Definition 1 (Demographic Parity). If the value of z does
In this paper, we propose fairness-aware adversarial per-
not influence assigning a sample to the positive class, i.e.
turbation (FAAP) to mitigate unfairness born with deep
model prediction ŷ = 1 á z, then the classifier satisfies de-
models. This section will overview the proposed FAAP and
mographic parity:
detail the design of network and loss functions. Finally, we
P (ŷ = 1∣z = −1) = P (ŷ = 1∣z = 1) (1) will further discuss the training strategy of FAAP.

If a model satisfies demographic parity, samples in both 4.1. Overview of FAAP


the privileged and unprivileged groups have the same prob- The unfairness could be caused by the bias in training
ability to be predicted as positive. sets (e.g., skewed data distribution) and/or loose constraints
Definition 2 (Equalized Odds). If the value of z can not in the training process. All of these lead to spurious corre-
influence the positive outcome for samples given y, i.e. ŷ = lations between target labels and protected attributes, e.g.,
1 á z∣y, then the classifier satisfies equalized odds: gender and ethnicity. In a dataset, females may have much
P (ŷ = 1∣y, z = −1) = P (ŷ = 1∣y, z = 1), y = {−1, 1} (2) more positive samples than males. As illustrated in Fig. 2
(left), the model learns such spurious gender correlation so
Equalized odds means that positive output is statistically that the false positive rate of the target label varies signif-
independent to the protected attribute given the target la- icantly for males and females. Therefore, the key of miti-
gating unfairness is to break spurious correlations between deployed model
feature extractor
target labels and protected attributes. training dataset label predictor
In this paper, we propose the fairness-aware adversarial
perturbation (FAAP) to mitigate model unfairness by hid- ℒ𝐺𝐺𝑇𝑇
𝑔𝑔 𝑓𝑓
ing the information of protected attributes from the feature
extraction process, so that the model would not correlate
predictions with protected attributes. The basic idea is to
leverage adversarial perturbation to remap the original sam-
ples to the position close to the decision hyperplane of the
𝐺𝐺 𝐷𝐷 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
ℒ𝐺𝐺
protected attribute in the latent space (e.g., on the surface
of gender hyperplane in the figure). Note that the distin-
guishability of these perturbed samples along the original generator discriminator
target label decision hyperplanes should be preserved, as Forward Propagation Backward Propagation
shown in Fig. 2 (right). In this way, the deployed model can
not distinguish the protected attributes from the perturbed Figure 3. Overview of the proposed FAAP, which consists of two
images during feature extraction. Therefore, the protected learnable components, i.e., a generator for learning fairness-aware
attribute would become uncorrelated to the target label. In perturbation and a discriminator for distinguishing the protected
other words, the model would fairly treat samples with dif- attribute.
ferent protected attributes.
The pipeline of FAAP is overviewed in Fig. 3, where discriminator D aware of the protected attribute z in the la-
there are two learnable components: 1) the generator that tent representation, i.e., perfectly predicting z. With such
perturbs samples to regulate their distribution in the latent awareness, the generator G is able to adversarially perturb
space, and 2) the discriminator that distinguishes the pro- inputs towards hiding the protected attribute in the latent
tected attribute. The deployed model is assumed to be a representation. Therefore, the discriminator loss can be ex-
classification model that could be split into a feature ex- pressed as
tractor (i.e., from image to latent space) and a label pre- LD = J (D (g(x̂)) , z) , (3)
dictor (i.e., from latent space to final label). Please note
that we freeze the parameters of the deployed model. Shar- where J (⋅, ⋅) denotes cross-entropy, x̂ is the perturbed in-
ing the spirit of general GANs during the training process, put, and z indicates the true label of the protected attribute.
the discriminator is trained to distinguish the protected at- Loss functions of G: By contrast, the generator G aims
tribute from representations of the model, while the gen- to fail D, and an intuitive solution is to maximize LD on
erator learns to fail the discriminator, thus synthesizing perturbed samples x̂. However, this will push the latent rep-
fairness-aware perturbation that reduces the information of resentations towards the opposite side of the protected at-
the protected attribute in the latent representations. tribute, e.g., female flips to male. Therefore, we further let
D make random guess on the representation of x̂, increas-
4.2. Loss Functions ing entropy of the protected attribute on perturbed samples.
The fairness loss can be written as
In this part, we detail the loss functions of the above-
mentioned FAAP. As illustrated in Fig. 3, we assume a clas- LfGair = −LD − αH (D (g(x̂))) , (4)
sification model that is divided into a feature extractor g and
a label predictor f . Given an input x, whose true label is y, where H(⋅) calculates the entropy, α > 0 is a relatively
the predicted label ŷ = f (g(x)). The generator G gener- small value controls the regularization of entropy loss. Be-
ates perturbation based on input x to obtain perturbed input sides LfGair that encourages fairness-aware perturbation, at
x̂ = x + G(x) subject to ∥x̂ − x∥∞ ≤ , and the discrimi- the same time we need to preserve the model performance
nator D is applied on the latent representations r̂ = g(x̂) to on the target label. The target label prediction loss is
distinguish a certain protected attribute z.
LTG = J (f (g(x̂)), y). (5)
Loss function of D: Intuitively, with a deployed model,
the unfairness is mainly caused by the feature extraction Above all, the total loss for generator G in FAAP consists
process which tends to correlate the protected attribute to of LfGair and LTG , which can be summarized as the following
those predicted in the target label, i.e., carrying distinguish-
able information from the protected attribute to the latent LG = LfGair + βLTG , (6)
representations. Thus, the label predictor would utilize that
distinguishable sensitive information to bias its final predic- where β > 0 balances the performance of target label pre-
tion. Based on the above hypothesis, we first need to let the diction and fairness.
Algorithm 1 Training of FAAP 5. Experimental Evaluation
Input: Feature extractor g and label prediction f of a de- In this section, we first describe our experimental setup
ployed model, loss weights α and β, learning rates ηD (Section 5.1) . Then, we quantitatively (Section 5.2) and
and ηG , maximum iteration N , and maximum perturba- qualitatively (Section 5.3) evaluate the proposed FAAP on
tion magnitude . The training images x, true labels y, different deployed models. Finally, we investigate the trans-
and protected attribute labels z. ferability of adversarial perturbation generated by FAAP on
Output: Generator G real-world commercial systems (Section 5.4).
Initialize the generator G and discriminator D.
for i = 1, ⋯, N do 5.1. Experimental Setup
Get a batch of n inputs xi and labels yi and zi
Get perturbed inputs x̂i = xi + G(xi ) Datasets. We adopt two face datasets in our evaluation,
Clip x̂i to meet ∥x̂i − xi ∥∞ ≤  i.e., CelebA1 and LFW2 , which carry those commonly pro-
Get model feature rˆi = g(x̂i ) tected attributes like gender. The CelebA dataset consists
Calculate discriminator loss of 202,599 images along with 40 attributes per image, and
LFW has 13,244 images along with 73 attributes per image.
1 n
LD = ∑ J (D (rˆi ) , zi ) We take gender as the protected attribute to measure the fair-
n i=1 ness of model prediction for target labels. In CelebA, the
Smiling, Attractive, and Blond Hair are chosen as target la-
Update D ← D − ηD ∇D LD bels. Similarly, Smiling, Wavy Hair, and Young are selected
Calculate fairness loss as the target labels in LFW. We randomly divide the origi-
1 n nal training set of CelebA into two equal parts for training
LfGair = − ∑ [J (D (rˆi ) , zi ) + αH (D (rˆi ))] the deployed model and our FAAP, respectively. For LFW,
n i=1
it is randomly split to get a 6k training set, a 3.6k validation
Calculate target label prediction loss set, and the rest as the testing set. For convenience, all the
images are resized to 224×224.
1 n Training details. To investigate the effectiveness of FAAP
LTG = ∑ J (f (rˆi ), yi )
n i=1 in de-biasing models with different extent of unfairness, we
train three kinds of models as the deployed models, i.e., nor-
Get total loss of G, LG = LfGair + βLTG mal training model, fair training model, and unfair training
Update G ← G − ηG ∇G LG model. The normal training model is trained normally by
minimizing the loss on target label. This kind of model will
learn the intrinsic bias in the training dataset, e.g., the cor-
4.3. Training of FAAP relation between Smiling and Male. For the fair training
model, we adopt adversarial training techniques [30] to train
Based on Eq. 3 and Eq. 6, in the training phase of FAAP, a fair model, which maximizes the classifier’s ability to pre-
the generator and the discriminator are optimized alterna- dict the target label, while minimizing the discriminator’s
tively. The generator G plays a mini-max game with D ability to predict the protected attribute. This kind of model
where D maximizes the ability to predict protected attribute has better fairness than the normal training model. To valid
z while G tries to minimize its ability. At the same time, G our method against more unfair models, which could be
tries to let f still recognize the right target label for per- from malicious manipulations, e.g., data poison attack [23]
turbed input data. Therefore, the objectives of FAAP can be and malicious training [22], we apply two methods to am-
formulated as follows: plify unfairness in deployed models. One is to flip labels
(denoted as LF), e.g., randomly flipping the target labels.
arg max min J (D (r̂) , z) + αH (D (r̂)) − βLTG , The other is to reverse the gradients of the discriminator in
G D adversarial fair training (denoted as RG). These manipula-
s.t. r̂ = g(x̂) = g (x + G(x)) , (7) tions can strengthen the spurious correlation between target
∥x̂ − x∥∞ ≤ , labels and gender.
For all deployed models, we use ResNet-18 [5] as the
where D and G are updated alternatively during the opti- base architecture. We train all of these models for 30 epochs
mization. Please note that α is set to be 0 during updating 1 http://mmlab.ie.cuhk.edu.hk/projects/CelebA.
D to allow D focus on distinguishing protected attributes. html
More detailed training algorithm of FAAP can be found in 2 http://vis-www.cs.umass.edu/lfw/, attribute annotations

Algorithm 1. are provided in [9]


original only_T ours original only_T ours
with a batch size of 64 using Adam optimizer with a learn-
ing rate of 5e-4. Once the training is finished, we fix the
parameters of the deployed models. The generator G in
FAAP has a similar architecture with [26]. Discriminator
D is connected to the last convolution layer of the feature
extractor. To mitigate unfairness without harming the visual
quality of a specific image, we set the maximum perturba-
tion magnitude  to 0.05. (a) Normal training model (b) Fair training model
Evaluation metrics. For fairness evaluation, we use the dif- original only_T ours original only_T ours
ference in demographic parity (DP) and difference in equal-
ized odds (DEO) to evaluate model fairness. Meanwhile,
the accuracy (ACC) of predicting target labels will also be
reported. The DP calculates the absolute difference between
the acceptance rates for each gender. A larger DP indicates
that samples in the privileged group have higher chances
to be predicted as positive than those in the unprivileged (c) Unfair training model(LF) (d) Unfair training model(RG)
group. Ideally, the DP is equal to zero. By contrast, the
DEO computes the absolute difference between the false Figure 4. Grad-CAM results for three different models when the
negative rates and the false positive rates for each gender. target label is Smiling in CelebA. “orginial” denotes raw data,
A larger DEO means that samples in the privileged group “only T” denotes images perturbed by G which is only opti-
have higher false positive rates and/or lower false negative mized on LTG without LfGair , “ours” denotes images perturbed
with fairness-aware adversarial perturbation generated by G op-
rates than those in the unprivileged group. Therefore, the
timized on LG . (Better viewed in color)
lower DEO the better.
loss less than 0.03 in Table 1(b). (2) Fair training model.
5.2. Quantitative Evaluation When adversarial fair training techniques are applied to the
Tables 1(a) to 1(c) show quantitative results of deployed model training phase, our method can further improve the
models before and after embedded with the proposed FAAP fairness of these models with slight accuracy drop, e.g., in
on CelebA. We evaluate with three different target la- Table 1(c), FAAP still improve fairness (0.0083 and 0.0544
bels named Smiling, Attractive and Blond Hair respectively reduction in DP and DEO respectively) with slight accuracy
with the protected attribute Male (“+1” in Male means male degradation (from 94.41% to 94.05%). (3) Unfair train-
and “-1” means female). Besides, we use three different ing model. For an unfair training model, FAAP can sig-
kinds of models for each target label. As shown in Table 1, nificantly improve its fairness with slight accuracy degrada-
there exists gender bias in normal training models, e.g., DP tion. For instance, in Table 1(a), FAAP can decrease DEO to
and DEO are larger than 0.5 when the target label is At- about 0.04, maintaining ACC above 91%. (4) Comparison
tractive. Fair training can get a fairer model by incorporat- between Normal training+FAAP and Fair training. It is
ing adversarial fairness techniques into training procedures. better to take model fairness into consideration in the train-
For instance, we can see in Table 1(c), fair training models ing phase. However, in Table 1 we can see that a deployed
have much lower DP (reduction from 0.5023 to 0.2745) and normal training model embedded with FAAP can get com-
DEO (reduction from 0.5683 to 0.0724) than normal train- parable fairness performance as a fair training model (e.g.,
ing models with a small drop in ACC (79.56% comparing to FAAP has even better DP and DEO in some cases) with al-
82.43%). In contrast, unfair training amplifies gender bias most the same accuracy (i.e., the difference in ACC is less
and these models (LF and RG) show much more unfairness. than 0.3% in most of the cases). For a deployed model,
For example, as shown in Table 1(a), DP and DEO increase our method works after the training process without chang-
to about 0.25 with relatively high ACC (91.48%, 91.76% ing the model as compared to the fair training that needs to
comparing to 92.61% of normal training model). retrain or fine-tune the model. Similar observation can be
We evaluate our method FAAP on the above deployed observed in Tables 1(d) to 1(f) on LFW dataset as well.
models. Not surprisingly, FAAP can improve fairness and
5.3. Qualitative Evaluation
maintain target label prediction accuracy for a deployed
model. From Tables 1(a) to 1(c), we have the following In this part, we further provide results of model explana-
observations. (1) Normal training model. For a normal tion approaches Grad-CAM [20] and T-SNE [25] to better
training model, FAAP can improve its fairness and keep illustrate the effectiveness of our method.
target label accuracy. We can see our method improves Grad-CAM is a model explanation method by visualiz-
DP and DEO by 0.2319, 0.5062 respectively with accuracy ing the regions of input data that are important for predic-
Smiling ACC ↑ DP ↓ DEO ↓ Smiling ACC ↑ DP ↓ DEO ↓
Normal training 92.61% 0.1748 0.0774 Normal training 90.42% 0.3353 0.1472
Normal training+FAAP 92.46% 0.1426 0.0327 Normal training+FAAP 89.80% 0.2910 0.0534
Fair training 92.55% 0.1275 0.0308 Fair training 90.08% 0.2704 0.0318
Fair training+FAAP 92.49% 0.1326 0.0281 Fair training+FAAP 88.75% 0.2646 0.0136
Unfair training (LF) 91.48% 0.2638 0.2737 Unfair training (LF) 89.23% 0.3678 0.2340
Unfair training (LF)+FAAP 91.87% 0.1268 0.0381 Unfair training (LF)+FAAP 88.10% 0.3026 0.1076
Unfair training (RG) 91.76% 0.2439 0.2306 Unfair training (RG) 90.14% 0.3674 0.2257
Unfair training (RG)+FAAP 91.78% 0.1321 0.0369 Unfair training (RG)+FAAP 89.15% 0.2969 0.0782
(a) Results on CelebA when the target label is Smiling (d) Results on LFW when the target label is Smiling

Attractive ACC ↑ DP ↓ DEO ↓ Wavy Hair ACC ↑ DP ↓ DEO ↓


Normal training 82.43% 0.5023 0.5683 Normal training 78.69% 0.1707 0.1554
Normal training+FAAP 79.73% 0.2704 0.0621 Normal training+FAAP 78.04% 0.1241 0.0651
Fair training 79.56% 0.2745 0.0724 Fair training 77.98% 0.1337 0.0800
Fair training+FAAP 79.31% 0.2244 0.0434 Fair training+FAAP 77.67% 0.1094 0.0595
Unfair training (LF) 81.06% 0.5566 0.7752 Unfair training (LF) 78.35% 0.2383 0.2919
Unfair training (LF)+FAAP 79.08% 0.2890 0.1179 Unfair training (LF)+FAAP 77.19% 0.1765 0.1734
Unfair training (RG) 82.24% 0.5547 0.7217 Unfair training (RG) 77.59% 0.2724 0.3692
Unfair training (RG)+FAAP 79.37% 0.2550 0.0539 Unfair training (RG)+FAAP 77.10% 0.2128 0.2508
(b) Results on CelebA when the target label is Attractive (e) Results on LFW when the target label is Wavy Hair

Blond Hair ACC ↑ DP ↓ DEO ↓ Young ACC ↑ DP ↓ DEO ↓


Normal training 95.63% 0.1787 0.5299 Normal training 83.81% 0.3511 0.5516
Normal training+FAAP 94.52% 0.1345 0.1013 Normal training+FAAP 81.34% 0.2281 0.2914
Fair training 94.41% 0.1319 0.1587 Fair training 83.86% 0.2500 0.2870
Fair training+FAAP 94.05% 0.1236 0.1043 Fair training+FAAP 80.71% 0.1515 0.1141
Unfair training (LF) 95.41% 0.1733 0.6728 Unfair training (LF) 83.04% 0.4813 0.8196
Unfair training (LF)+FAAP 94.49% 0.1449 0.1321 Unfair training (LF)+FAAP 80.40% 0.2550 0.3786
Unfair training (RG) 95.66% 0.2041 0.6200 Unfair training (RG) 83.72% 0.5002 0.8377
Unfair training (RG)+FAAP 94.26% 0.1305 0.1209 Unfair training (RG)+FAAP 82.30% 0.1970 0.3048
(c) Results on CelebA when the target label is Blond Hair (f) Results on LFW when the target label is Young

Table 1. Results of deployed models before and after embedded with the proposed FAAP on CelebA ( Tables 1(a) to 1(c)) and LFW (
Tables 1(d) to 1(f)). For fairness criterion DP and DEO, the lower the fairer. For accuracy ACC, the higher the better.

tions [20]. We visualize a subset of test images that were der tendency, thus we can see that perturbation in “only T”
originally false predicted by the deployed model but have will let the model make correct predictions but mislead the
been successfully recognized after perturbation in Fig. 4. model to focus on the unrelated area (e.g., eyes in Fig. 4(c),
For each deployed model, we provide explanations on raw hair in Fig. 4(d)). In contrast, our method helps the model
data, images perturbed by G trained on LTG without LfGair focus on the right area and make right predictions.
and images perturbed by G optimized on LG . (1) Normal T-SNE is a method to visualize high-dimensional data
training model. As shown in Fig. 4(a), for a normal train- from a low dimension view. To better demonstrate that
ing model, our adversarial perturbation can help the model our method can hide sensitive information for images by
focus on the right area (mouth) and make correct predic- remapping them close to the protected attribute decision hy-
tions. The red area of images in “only T” deviates little perplane while maintaining the distance to the target label
from the mouth. (2) Fair training model. Since this kind decision hyperplane in latent space of the deployed model,
of model has less gender bias than other models, as shown in we utilize T-SNE to get low-dimensional embedding of data
Fig. 4(b), G optimized towards improving target label accu- feature representation. More specifically, we extract feature
racy can get similar heat-maps as “ours”. Both of them can vectors of these images with/without adversarial perturba-
help the deployed model focus on the right area. (3) Unfair tion and visualize them in a two-dimension diagram with
training model. Unfair training models have larger gen- T-SNE. (1) Normal training model. From Fig. 5(a) and
(1) (2) (3) (4) (1) (2) (3) (4)
commercial APIs, including Alibaba3 and Baidu4 . For Al-
ibaba’s face analyze API, it returns binary results in which
“0” means “not smiling” and “1” means “smiling”. For
Baidu’s face analyze API, it returns three categories named
“none”, “smile” and “laugh”. We assume “none” means not
(a) T-SNE for Smiling (b) T-SNE for Attractive smiling and others mean smiling. We find these APIs be-
(1) (2) (3) (4) have some extent of unfairness, i.e., DEO of them are about
0.1. Since this is a totally black-box scenario, we know
nothing about the models behind these APIs. We try to train
the generator with model ensemble techniques, taking the
normal training model and the fair training model in Sec-
tion 5.2 as surrogate models. Then we upload the perturbed
(c) T-SNE for Blond hair images to these APIs and record results.
(male, -1) (male, +1) (female, -1) (female, +1) Table 2 shows the results of these face analyze APIs on
original and perturbed images. From Table 2(a), we can
Figure 5. T-SNE results for three different models on Smiling, see that FAAP improves DP by 0.0293 and decreases DEO
Attractive and Blond Hair in CelebA. The upper row shows the to 0.0368 with only 0.0026 degradation in accuracy. Like-
results of the raw data and the bottom row shows the results of im- wise, Table 2(b) shows 0.0411, 0.0648 improvement in DP
ages perturbed with FAAP. In each sub-figure, the feature repre- and DEO while 0.0289 degradation in accuracy for Baidu.
sentation in column (1) is extracted from a normal training model,
These results show the transferability of FAAP and the po-
and column (2) from a fair training model, while column (3) from
tential usage of FAAP in black-box scenarios.
a LF model, column (4) from a RG model. (Better viewed in color)

Alibaba’s API ACC ↑ DP ↓ DEO ↓


Fig. 5(b) we can see that for a normal training model, sam-
original images 90.20% 0.1768 0.0952
ples with different target labels for smiling and attractive
after perturbation 89.94% 0.1475 0.0368
classification are linearly separable in latent space, mean-
while, samples with different gender before and after pertur- (a) Results on Alibaba’s face analyze API
bation are mixed. In Fig. 5(c), even the feature representa-
tions of samples (yellow and purple points) in normal train- Baidu’s API ACC ↑ DP ↓ DEO ↓
ing model are linearly separated by the protected attribute original images 90.47% 0.1817 0.1035
hyperplane when the target label is Blond Hair, but FAAP after perturbation 87.58% 0.1406 0.0387
can still effectively hide such sensitive information in latent (b) Results on Baidu’s face analyze API
feature space for samples. (2) Fair training model. Ad-
versarial fair training can improve fairness, however, may Table 2. Performance on commercial face analyze APIs.
slightly separate samples with different protected attributes
(as shown in column (2) in Fig. 5(a) and Fig. 5(b)). In
such a situation, our FAAP can make these samples become 6. Conclusion
closer. (3) Unfair training model. In unfair training mod-
This paper introduced the Fairness-Aware Adversarial
els, feature representations of original images with different
Perturbation (FAAP) to mitigate unfairness in deployed
protected attributes are almost linear separable on the pro-
models. More specifically, FAAP learns to perturb in-
tected attribute hyperplane ( (male, -1) with (female, -1),
puts, instead of changing the deployed models as the
(male, +1) with (female, +1) ). Once perturbed with adver-
SOTA works, to disable deployed models from recognizing
sarial perturbation generated by FAAP, samples with differ-
fairness-related features. To achieve this, we employed a
ent gender become almost indistinguishable and mixed but
discriminator to distinguish fairness-related attributes from
they are well separable on the target label hyperplane.
latent representations of deployed models. Meanwhile, a
generator was trained adversarially to deceive the discrimi-
5.4. Transferability of FAAP nator, thus synthesizing fairness-aware perturbation that can
hide the information of protected attributes. Extensive ex-
To demonstrate the transferability of adversarial pertur- periments demonstrated that FAAP can effectively mitigate
bation generated by FAAP, we evaluate them on commercial unfairness, e.g., improve DP and DEO by 27.5% and 66.1%
face analyze APIs. At first, we investigate model fairness
of these APIs in predicting “smiling”. We upload testing 3 https://www.aliyun.com/

dataset (about 20k images) from CelebA dataset to toady’s 4 https://ai.baidu.com/


respectively with only 1.5% accuracy degradation on aver- tics, Speech and Signal Processing (ICASSP), pages 2847–
age for normal training models. 2851, 2019. 3
In addition, evaluation on real-world commercial APIs [11] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt,
showed significantly 19.5% and 61.9% improvement in DP Dimitris Tsipras, and Adrian Vladu. Towards deep learn-
and DEO with less than 1.7% degradation in accuracy, ing models resistant to adversarial attacks. arXiv preprint
which indicates the potential usage of the proposed FAAP in arXiv:1706.06083, 2017. 3
the black-box scenario. However, the black-box exploration [12] David Meyer. Amazon reportedly killed an ai recruitment
system because it couldn’t stop the tool from discriminat-
is a side product of our current design. Since we assume to
ing against women. https://fortune.com/2018/10/10/amazon-
access the deployed models although do not modify them, ai-recruitment-bias-women-sexist/, 2018. Online, Accessed
it is still impractical for certain real cases like those com- June 1, 2021. 1
mercial APIs. Therefore, we are considering future works [13] Beau Norgeot, Benjamin S Glicksberg, and Atul J Butte.
on the black-box setting with more specific designs. A call for deep-learning healthcare. Nature Medicine,
25(1):14–15, January 2019. 1
References [14] Alejandro Peña, Ignacio Serna, Aythami Morales, and Julian
Fierrez. Bias in multimodal ai: Testbed for fair automatic
[1] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. Data deci- recruitment. In 2020 IEEE/CVF Conference on Computer
sions and theoretical implications when adversarially learn- Vision and Pattern Recognition Workshops (CVPRW), pages
ing fair representations. arXiv preprint arXiv:1707.00075, 129–137, 2020. 1
2017. 1, 2
[15] Chuan Qin, Hengshu Zhu, Tong Xu, Chen Zhu, Liang Jiang,
[2] Nicholas Carlini and David Wagner. Towards evaluating the Enhong Chen, and Hui Xiong. Enhancing person-job fit
robustness of neural networks. In IEEE Symposium on Secu- for talent recruitment: An ability-aware neural network ap-
rity and Privacy (SP), pages 39–57. IEEE, 2017. 3 proach. In The 41st International ACM SIGIR Conference
[3] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos on Research & Development in Information Retrieval, page
Zafeiriou. Arcface: Additive angular margin loss for deep 25–34, 2018. 1
face recognition. In Proceedings of the IEEE/CVF Confer- [16] Novi Quadrianto, Viktoriia Sharmanska, and Oliver Thomas.
ence on Computer Vision and Pattern Recognition (CVPR), Discovering fair representations in the data domain. In Pro-
2019. 1 ceedings of the IEEE/CVF Conference on Computer Vision
[4] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of op- and Pattern Recognition (CVPR), June 2019. 1, 2
portunity in supervised learning. In Proceedings of the 30th [17] Vikram V. Ramaswamy, Sunnie S. Y. Kim, and Olga Rus-
International Conference on Neural Information Processing sakovsky. Fair attribute classification through latent space
Systems (NIPS), page 3323–3331, 2016. 3 de-biasing. In IEEE/CVF Conference on Computer Vision
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. and Pattern Recognition (CVPR), 2021. 1, 2
Deep residual learning for image recognition. In Proceed- [18] Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh.
ings of the IEEE Conference on Computer Vision and Pattern Fr-train: A mutual information-based approach to fair and
Recognition (CVPR), 2016. 5 robust training. In International Conference on Machine
[6] Lauren Kirchner Jeff Larson, Surya Mattu and Julia Learning (ICML), pages 8147–8157, 2020. 2
Angwin. How we analyzed the compas recidivism algo- [19] Mhd Hasan Sarhan, Nassir Navab, Abouzar Eslami, and
rithm. https://www.propublica.org/article/how-we-analyzed- Shadi Albarqouni. Fairness by learning orthogonal disentan-
the-compas-recidivism-algorithm, 2016. Online, Accessed gled representations. In European Conference on Computer
June 18, 2021. 1 Vision(ECCV), pages 746–761, 2020. 1, 2
[7] Michael P Kim, Amirata Ghorbani, and James Zou. Multi- [20] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
accuracy: Black-box post-processing for fairness in classifi- Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
cation. In Proceedings of the 2019 AAAI/ACM Conference Grad-cam: Visual explanations from deep networks via
on AI, Ethics, and Society, pages 247–254, 2019. 1, 3 gradient-based localization. In Proceedings of the IEEE In-
[8] Yi Li and Nuno Vasconcelos. Repair: Removing repre- ternational Conference on Computer vision (ICCV), pages
sentation bias by dataset resampling. In Proceedings of 618–626, 2017. 6, 7
the IEEE/CVF Conference on Computer Vision and Pattern [21] Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chan-
Recognition (CVPR), pages 9572–9581, 2019. 1 draker, and Anil K. Jain. Towards universal representa-
[9] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. tion learning for deep face recognition. In Proceedings of
Deep learning face attributes in the wild. In Proceedings the IEEE/CVF Conference on Computer Vision and Pattern
of the IEEE International Conference on Computer Vision Recognition (CVPR), 2020. 1
(ICCV), pages 3730–3738, 2015. 5 [22] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and
[10] Pranay K Lohia, Karthikeyan Natesan Ramamurthy, Man- Himabindu Lakkaraju. Fooling lime and shap: Adversarial
ish Bhide, Diptikalyan Saha, Kush R Varshney, and Ruchir attacks on post hoc explanation methods. In Proceedings of
Puri. Bias mitigation post-processing for individual and the AAAI/ACM Conference on AI, Ethics, and Society, page
group fairness. In Ieee International Conference on Acous- 180–186. Association for Computing Machinery, 2020. 5
[23] David Solans, Battista Biggio, and Carlos Castillo. Poi-
soning attacks on algorithmic fairness. In Frank Hutter,
Kristian Kersting, Jefrey Lijffijt, and Isabel Valera, editors,
Machine Learning and Knowledge Discovery in Databases,
pages 162–177, 2021. 5
[24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan
Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.
Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013. 3
[25] Laurens Van der Maaten and Geoffrey Hinton. Visualizing
data using t-sne. Journal of Machine Learning Research,
9(11), 2008. 6
[26] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan
Liu, and Dawn Song. Generating adversarial examples
with adversarial networks. arXiv preprint arXiv:1801.02610,
2018. 3, 6
[27] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu.
Fairgan: Fairness-aware generative adversarial networks.
In 2018 IEEE International Conference on Big Data (Big
Data), pages 570–575, 2018. 1
[28] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Ro-
driguez, and Krishna P Gummadi. Fairness beyond disparate
treatment & disparate impact: Learning classification with-
out disparate mistreatment. In Proceedings of the 26th In-
ternational Conference on World Wide Web (WWW), pages
1171–1180, 2017. 3
[29] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Ro-
griguez, and Krishna P Gummadi. Fairness constraints:
Mechanisms for fair classification. In Artificial Intelligence
and Statistics, pages 962–970, 2017. 2
[30] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell.
Mitigating unwanted biases with adversarial learning. In
Proceedings of the 2018 AAAI/ACM Conference on AI,
Ethics, and Society, page 335–340, 2018. 1, 2, 5
[31] Yi Zhang and Jitao Sang. Towards accuracy-fairness para-
dox: Adversarial example-based data augmentation for vi-
sual debiasing. In Proceedings of the 28th ACM Interna-
tional Conference on Multimedia (MM), pages 4346–4354,
2020. 1, 2

You might also like