Challenges in Algorithmic Fairness When Using Multi Party Computation Models
Challenges in Algorithmic Fairness When Using Multi Party Computation Models
1 Introduction
Motivation for this position paper Both the use of Secure Multi-Party Com-
putation (MPC) - or more broadly, Privacy-Enhancing Technologies (PETs) -
and techniques for Algorithmic Fairness (or in short, fairness) are important and
upcoming research topics in the research area of Responsible AI. A new paradigm
of being able to get the insights but not sharing data is being researched and
deployed. At the same time, awareness for the need of fair models is growing.
2 C. Wibaut et al.
In general, MPC and fairness pursue similar goals: an ethical way of working
with data. Both research areas contribute to the seven key requirements on
trustworthy AI, set by the high-level expert group of the European Commission
on AI [11], especially privacy and data governance, as well as diversity, non-
discrimination and fairness. However, when we zoom in, the concepts of fairness
and privacy can be contradictory.
First of all, measuring fairness can cause privacy issues. For example, to be
able to assess a model’s fairness with respect to ethnicity, one needs to use data
on ethnic background. This is very sensitive data that needs to be protected.
An overview of techniques to measure fairness without the use of these kind of
sensitive features is given by Ashurst [2]. Note that MPC is actually mentioned
as a solution here. However, this is out of scope for this article.
On the other hand, when protecting the privacy of the input of a model by
using MPC, the model is not yet protected from being unfair. In his article [6],
Calvi has recently started the debate on the potential ‘unfair side of PETs’.
However, it does not address the challenge on measuring fairness in a setting
where the input data is protected, which we do in this paper.
In sections 1.1 and 1.2 we first give short introductions on MPC and fairness.
In chapter 2 we reflect on some existing strategies on fairness in mainly federated
learning settings. In chapter 3 we will describe three potential challenges one
could run into in practice, when one wants to assess fairness in a multi-party
setting. Finally, in chapter 4 we conclude and discuss potential avenues for future
research.
future might be to analyse financial risks with a group of banks [20]. We will use
a specific example of this throughout this article.
(c) Result is a global model that each (d) Parties get high- or low-risk classi-
party can query for credit risk scores fication from the model
multiple parties. In chapter 2 and 3 we will go into this, but first we will introduce
the concept of fairness in general.
Fig. 2: Two possible metrics for fairness. The model classifies each individual
high or low risk, which are put in one figure for fairness assessment. Demographic
parity is met in a), where men and women are classified as high risk in an equal
ratio. False discovery rate parity is met in b), where the relative number of false
positives is equal for both subgroups.
Nevertheless, statistical parity does not always satisfy equal treatment for all
groups. There might be an equal number of people classified as high risk defaulter
for both genders, but if the number of false positives (incorrectly classified as
defaulters) for men is much higher, one can debate how fair the system is. A
vast number of alternative fairness metrics are proposed in research, such as
‘false positive parity’ (false discovery rate) to fit this case [18, 25]. This paper
will not discuss them all, but overviews can be found in e.g. [24, 21, 18, 25]. The
example shows that metrics might be contradicting each other, and are even
at a trade-off[5]. Saleiro et al. and Ruf and Detyniecki provide practical ways
to navigate the different options, in the form of a Fairness tree (Aequitas) and
Fairness Compass respectively [25, 17, 24]. Both tools help selecting which metric
fits which model outcome, with questions such as whether or not the ‘intervention
is punitive or assistive’, whether you want it to fit a certain representation policy
and whom you think is most vulnerable in the situation. This illustrates that
the intention for deployment of the model is determinant for the right fairness
metric. It highlights the importance for the outcome to be clearly defined in
order to choose a suitable fairness metric.
An important final note, however, is that fairness is not just a metric and
that it is not something static [7, 9]. A fairness assessment only holds for the
time, data, model, situation and usage as defined. For example, the predictor-
prediction relationship of the model can change, or the data distribution can
change when retraining with newly acquired data. Moreover, using a model with
a different goal than what it was designed for - such as a group risk assessment
used on an individual - increases the risk of flawed predictions. These exam-
ples illustrate fair AI also includes an investigation and documentation of the
goals, (ground truth) data, usage (environment), time, policy, regulations, etc.
It is essential that continuous monitoring and evaluation of fairness takes place,
throughout the whole AI life cycle.
6 C. Wibaut et al.
On top of that, one should realise that bias can’t be completely mitigated,
so fairness is also a matter of decision-making on what is important. This starts
with agreements on the goals and usage of a system and requires well-informed,
responsible and justifiable decisions. These agreements are necessary to mea-
sure suitable bias risk, as well as to prevent unfairness by misuse of the model.
Nonetheless, in an MPC setting, it can be difficult to exactly define and control
the important factors in such agreements with other parties. This results in the
challenges identified in this paper, which concern both the metric definition, as
well as the bigger context of fairness.
i Each party can locally apply reweighing. This is efficient and fully preserves
the privacy of the parties, but lacks a global view of the weights that should
be assigned to get a globally fair dataset.
1
Note that there is also a different definition of fairness in MPC protocols that is not
the topic of this paper. Namely, fairness in MPC can refer to a security model in
MPC protocols where either all the parties receive the outcome, or none receive the
outcome.
Challenges in fairness when using MPC models 7
ii If the parties are willing to share sensitive attributes and sampling counts
with noise, they can use differential privacy to communicate the local weights
with each other, at the cost of some more communication and information
leakage for global reweighing.
While these are conceptually simple and effective ways to mitigate bias for
virtually any model and PET, they only apply to the training data. Therefore,
potential fairness issues during the inference phase when the model is actually
used cannot be prevented by pre-processing alone. Furthermore, collaboratively
reweighing could prove difficult in an MPC protocol where nothing should leak
during the computation. With federated learning, leaking (some) statistical in-
formation is more common since the weights of the local models are allowed to
be shared as well. Pessach et al. [23] recently proposed the first solution for a
privacy-preserving pre-process mechanism in an MPC setting, where distances
between the distributions of attributes of two groups are decreased on federated
data.
In-processing Instead of altering the training data, in-processing techniques
aim to make the model fair during the training phase. In this strategy, a certain
fairness objective is added to the training process that can be optimized for. One
prominent example in regular centralized machine learning is prejudice remover.
Intuitively, this adds a fairness metric to the loss function of a training procedure
such that the loss is altered in a way that it punishes models that are overfitted
and biased towards a certain sensitive feature. Again, Abay et al. [1] proposed a
way to perform prejudice removal in federated learning. In the straightforward
way, each party simply uses the prejudice remover during their local training
step, after which the aggregation step remains untouched. Similar strategies
have for example been proposed by [26, 27, 10, 13]. This is also known as local
debiasing. While local debiasing is easy to apply in a federated learning setting,
it can be hard to tune the parameters without leaking sensitive information,
such that the model remains accurate enough while mitigating unwanted bias.
It is expected that similar trade-offs will be observed with other in-processing
methods. Furthermore, it is not yet clear how similar strategies can be applied
to other MPC. Therefore, this approach was extended by Ezzeldin et al. in 2023
[12]. Conceptually, they additionally let the parties assess the fairness of the
global model towards their local datasets and update their aggregation weight
accordingly. Intuitively, parties which are more in line with global fairness will
have a higher weight during the next aggregation round.
Post-processing Perhaps the most prominent example of post-processing is
the landmark paper on equality of opportunity by Hardt, Price and Srebo [16]
where a model is first trained using a regular training process, after which the
model is adjusted to be fair by analysing ROC curves. With this strategy, the
training phase remains untouched and thus is it likely to be supported in more
PET settings compared to the in-processing techniques. However, access to the
predicted label, sensitive attribute and target label is assumed, which might not
always be the case in a PET setting. A post-processing solution specifically tai-
8 C. Wibaut et al.
lored to federated learning was proposed in 2021 by Luo et al. [19]. They let each
party share statistics about their dataset to the central server, who can compute
the global distributions. After that, virtual data points are sampled from these
distributions and used to adjust the model. This will be difficult to achieve with
MPC, as parties would need to sample from secret weight distributions.
issue the local data distribution, the model, the metric? In order to investigate
the root of the problem and its consequences for others, the whole setting and
assessment should be disclosed. Then the next problem is figuring out possible
technical adjustments to the model to solve the fairness issue. Modifying the
model might be the solution for one party, yet creates new problems for oth-
ers. How to return to the drawing board with all the necessary information,
representing each parties best interest?
Assessing a fairness metric at once for all the data, is technically possible
as a secure computation. However, one is unable to answer crucial questions to
choose a fairness metric, as illustrated earlier. Take the example with different
outcomes for the same MPC output. For the assistive outcome, you will want
to check the group rates for correct low-risk classification (true positives). Men
and women should be relatively equally given the extra credit. For the punitive
outcome, it is more important to look at the false positives in the high-risk class
(false discovery rate from Figure 2b). Here, nor men or women should be more
often wrongly denied a loan than the other. The example shows how different
fairness metrics would apply. Again, also besides this undesirable setting, these
disagreements on the definition of fairness can exist. For instance, if one of
the banks has the ambition to reach a 50/50 division of issuing loans in terms
of gender. This would demand a new perspective on which fairness metric fits
best. Summarizing, because of their frequent incompatibility, there is no possible
‘general’ fairness for all parties if these parties do not use the model with the
same intention.
Even in a setting where the banks agree on one way to use the model, one
outcome and one fairness metric, there are still some unanswered questions left.
Firstly, a general outcome might not hold for each individual party, because
fairness is relative to data distribution. The further equal setting would make
it easier to discover that the distribution is at the root of this problem, but
doesn’t solve it. It is possibly even undesirable to use a general measurement for
the outcome over a specific subgroup (covered in Challenge 2). Secondly, there
is a matter of ownership: who is responsible for the fairness assessment? Can
one or each bank be part of that, or should an external party be involved as
overlooking eye? The first may be difficult to choose, the second is not ideal in a
PET scenario where no one is supposed to oversee all data. Note that while this
section is highlighted with an example from the financial sector, the challenge is
much broader and will also occur in other settings such as the medical domain.
For example, a similar challenge occurs in an MPC setting where a model is used
to prioritize patients at general practitioners and one GP uses a positive advice
of the model to prioritize patients while another GP uses a negative advice of
the model to further delay seeing a patient.
in the collaborative computation. The issues for the banks illustrates how a col-
laboration in an MPC setting makes it hard to assess fairness for each party
themselves, but also as a collective. An MPC setting requires more elaborate
agreements. In order to perform a general fairness assessment, different parties
should be aligned to the level of using the model in the exact same manner,
for the exact same outcome and with the same intentions and policies. It is a
challenge to get to such an alignment, including responsibilities, and one can
wonder if it is desirable if data distributions differ per party. Measuring fairness
individually would require full disclosure on goals, usage, but also data distri-
bution in order to meaningfully investigate a case of unfairness. This conflicts
with the objective of MPC. Further research should look for the right way to an
agreement, considering all these factors.
As explained in section 1.2 and the first challenge, choosing a fairness metric
is an essential first step. But when a fairness metric is chosen, the way fairness
is measured on distributed data is not straightforward. One practical challenge
that the multiple parties can run into is the difference between global and local
fairness. These concepts have been described mainly in the context of federated
learning, for example in [15] and [26]. In the context of MPC, local fairness
means the fairness is measured over the datasets of one single party and global
fairness is measured over the entire dataset. We will now illustrate that if one
has local or global fairness, the other one is not automatically achieved.
Global but no local fairness Suppose the three banks, in the example in
section 1.1, have agreed upon using the metric false discovery rate (FDR) parity
for men and women. In Figure 2b, we have seen an example where 8 persons
in total are classified as high risk. We see that that the FDR is equal for men
and women (50%) over the entire dataset. Therefore, the model is globally fair.
However, if we look at the population of the individual banks A and B in Figure
3, we get a different view. For bank A we see that the FDR for women is 0 and
for men is 1. Therefore, there is no local fairness for this metric at bank A, and
for bank B vice versa. What the consequences of this difference are in practice,
depends on the context. But in general, one should be aware that a model can be
fair on a global data set, while it is not on the local data set. Local fairness can
be desirable to know that a bank treats different groups of their clients equally.
Intersectional fairness The example might seem extreme, but note that when
data distributions among parties differ, the differences can actually occur because
of different outcomes of the same model. If for example bank A and bank B
have populations of different ages and age and gender are related in the models
prediction, this can cause different predictions among those features. Therefore,
this topic is connected to the topic of intersectional fairness. In the case that
bank A only has younger people and bank B only has older people, the model
is fair on gender and fair on age, but it is not fair on the intersection of those
Challenges in fairness when using MPC models 11
As was noted before, using a model with a different objective than what it was
designed for, increases the risk of flawed predictions and possible unfairness.
A model that was designed for classification on groups should not be used for
an individual case, and vice versa. When retraining the model later, the wrong
classification from this misused model will end-up in the new dataset that is
used. This is an example of how a (polluting) feedback loop can occur, in which
the input of a system changes over time, changing the system itself. These loops
do not only occur through misuse of a model, but also naturally due to time,
environment and usage. It can affect the different elements of an AI pipeline,
from features to user. These feedback loops can cause bias to sneak into the data,
and a static fairness measurement will not hold. Other examples, as described
in Pagan et al. [22] and applied to our example, are:
1. a sampling feedback loop, where the decision whom to issue a loan might
cause one of the gender groups to not apply for a loan any more;
2. an individual feedback loop, if the requester of the loan decides to spend less
money because it knows that it is denied a loan multiple times (assuming
‘money spent’ is a predictor);
3. a feature feedback loop, where a predictor of repaying a loan is the risk
classification a person has received before. In other words, if being classified
Challenges in fairness when using MPC models 13
4 Discussion
In this paper, we have reflected upon the challenges around (measuring) algo-
rithmic fairness when using MPC models. We have concluded that technical
solutions in the context of PETs and fairness are mainly focused on federated
learning, and not yet on MPC protocols. We identified three practical challenges
that can occur in practice. The key takeaway is that there need to be agreements
between the parties upon what fairness means for them. Furthermore:
– To perform one general fairness assessment for all parties involved in MPC,
multiple criteria need to be met. The different parties should be aligned to
the level of using the model in the exact same manner, for the exact same
outcome and under the same intentions and policies. Individual assessment
poses the unsolvable situation in which one party finds unfairness and is not
able to change the model.
– What a fairness metric means in a multi-party setting depends on the data
that is used. In a multi-party setting, this means that in case the federated
data is different, different fairness metrics can be applicable.
– When multiple parties use a model in their own system, it is inevitable that
feedback-loops will occur. To monitor, evaluate and act upon their effects on
the fairness of a model effectively, an alignment and insight on the usage of
each party is needed.
In future research, we hope that technical solutions can be found to some
of the challenges around fairness assessment in an MPC setting. Still, there are
both technical and non-technical open questions, such as:
– How can fairness mitigation techniques (pre-, in- and post-processing) in
Federated Learning be applied in other MPC protocols?
– How can multiple parties agree upon the way they measure fairness and
how they act upon it? The field of research in PETs and Data Spaces might
provide guidelines for building agreements around MPC protocols [4].
– Which mitigation techniques in intersectional fairness can be applied to the
issue of local vs. global fairness in MPC protocols?
– Which techniques in achieving local and/or global fairness in Federated
Learning can be applied in an MPC protocol?
– How to operate on possible feedback loops of separate systems in the MPC,
facilitating the evaluation and monitoring of their effects on fairness?
With this position paper, we hope to have paved the way for a promising
new research field for the necessary integration of fairness techniques into MPC
research.
Challenges in fairness when using MPC models 15
Acknowledgments. The research activities that led to this paper were supported by
TNO’s Appl.AI programme.
References
1. Abay, A., Zhou, Y., Baracaldo, N., Rajamoni, S., Chuba, E., Ludwig, H.: Mitigating
bias in federated learning. arXiv preprint arXiv:2012.02447 (2020)
2. Ashurst, C., Weller, A.: Fairness without demographic data: A survey of ap-
proaches. In: Proceedings of the 3rd ACM Conference on Equity and Access in
Algorithms, Mechanisms, and Optimization. pp. 1–12 (2023)
3. Baum, C., Chiang, J.H.y., David, B., Frederiksen, T.K.: Sok: Privacy-enhancing
technologies in finance. Cryptology ePrint Archive (2023)
4. BDVA, DSC, C.: Leveraging the benefits of combining data spaces and privacy en-
hancing technologies (March 2024), https://bdva.eu/news/bdva-and-coe-dsc-joint-
white-paper-on-combining-data-spaces-and-pets/
5. Braun, C.: Fairness in machine learning (Jan 2024), https://dida.do/blog/fairness-
in-ml
6. Calvi, A., Malgieri, G., Kotzinos, D.: The unfair side of privacy enhanc-
ing technologies: addressing the trade-offs between pets and fairness. In:
Proceedings of the 2024 ACM Conference on Fairness, Accountability, and
Transparency. p. 2047–2059. FAccT ’24, Association for Computing Machin-
ery, New York, NY, USA (2024). https://doi.org/10.1145/3630106.3659024,
https://doi.org/10.1145/3630106.3659024
7. D’Amour, A., Srinivasan, H., Atwood, J., Baljekar, P., Sculley, D., Halpern, Y.:
Fairness is not static: deeper understanding of long term fairness via simulation
studies. In: Proceedings of the 2020 Conference on Fairness, Accountability, and
Transparency. pp. 525–534 (2020)
8. Das, S., Stanton, R., Wallace, N.: Algorithmic fairness. Annual Review of Financial
Economics 15(1), 565–593 (2023)
9. Deldjoo, Y., Jannach, D., Bellogín, A., Difonzo, A., Zanzonelli, D.: Fairness in
recommender systems: research landscape and future directions. User Modeling
and User-Adapted Interaction 34, 1–50 (04 2023). https://doi.org/10.1007/s11257-
023-09364-z
10. Du, W., Xu, D., Wu, X., Tong, H.: Fairness-aware agnostic federated learning. In:
Proceedings of the 2021 SIAM International Conference on Data Mining (SDM).
pp. 181–189. SIAM (2021)
11. European Commission, Directorate-General for Communications Net-
works, C., Technology: Ethics guidelines for trustworthy ai (2019),
https://data.europa.eu/doi/10.2759/346720
12. Ezzeldin, Y.H., Yan, S., He, C., Ferrara, E., Avestimehr, A.S.: Fairfed: Enabling
group fairness in federated learning. In: Proceedings of the AAAI conference on
artificial intelligence. vol. 37(6), pp. 7494–7502 (June 2023)
13. Gálvez, B.R., Granqvist, F., van Dalen, R., Seigel, M.: Enforcing fairness in private
federated learning via the modified method of differential multipliers. In: NeurIPS
2021 Workshop Privacy in Machine Learning (2021)
16 C. Wibaut et al.
14. Gohar, U., Cheng, L.: A survey on intersectional fairness in machine learning:
Notions, mitigation, and challenges. arXiv preprint arXiv:2305.06969 (2023)
15. Hamman, F., Dutta, S.: Demystifying local and global fairness trade-offs in fed-
erated learning using information theory. In: Federated Learning and Analytics in
Practice: Algorithms, Systems, Applications, and Opportunities (2023)
16. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning.
Advances in neural information processing systems 29 (2016)
17. Aequitas tool
18. Korteling, W., Drie, R.v., Veenman, C.: Fair ai: State-
of-the-art overview of the literature (December 2022),
https://publications.tno.nl/publication/34640564/UJuRDe/TNO-2023-
R10060.pdf
19. Luo, M., Chen, F., Hu, D., Zhang, Y., Liang, J., Feng, J.: No fear of heterogeneity:
Classifier calibration for federated learning with non-iid data. Advances in Neural
Information Processing Systems 34, 5972–5984 (2021)
20. Maxwell, N.: Innovation and discussion paper: Case studies of the use of privacy
preserving analysis to tackle financial crime (January 2021), https://www.future-
fis.com
21. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey
on bias and fairness in machine learning. ACM Comput. Surv. 54(6) (jul 2022).
https://doi.org/10.1145/3457607, https://doi.org/10.1145/3457607
22. Pagan, N., Baumann, J., Elokda, E., De Pasquale, G., Bolognani, S., Hannák,
A.: A classification of feedback loops and their relation to biases in automated
decision-making systems. In: Proceedings of the 3rd ACM Conference on Equity
and Access in Algorithms, Mechanisms, and Optimization. pp. 1–14 (2023)
23. Pessach, D., Tassa, T., Shmueli, E.: Fairness-driven private collaborative machine
learning. ACM Transactions on Intelligent Systems and Technology 15(2), 1–30
(2024)
24. Ruf, B., Detyniecki, M.: Towards the right kind of fairness in ai. arXiv preprint
arXiv:2102.08453 (2021)
25. Saleiro, P., Kuester, B., Stevens, A., Anisfeld, A., Hinkson, L., London, J., Ghani,
R.: Aequitas: A bias and fairness audit toolkit. CoRR abs/1811.05577 (2018),
http://arxiv.org/abs/1811.05577
26. Wang, G., Payani, A., Lee, M., Kompella, R.: Mitigating group bias in federated
learning: Beyond local fairness. arXiv preprint arXiv:2305.09931 (2023)
27. Zhang, D.Y., Kou, Z., Wang, D.: Fairfl: A fair federated learning approach to
reducing demographic bias in privacy-sensitive classification models. In: 2020 IEEE
International Conference on Big Data (Big Data). pp. 1051–1060. IEEE (2020)