Comparing Fair Ranking Metrics
Comparing Fair Ranking Metrics
AMIFA RAJ, People and Information Research Team, Boise State University
CONNOR WOOD, People and Information Research Team, Boise State University
ANANDA MONTOLY∗ , Smith College
MICHAEL D. EKSTRAND, People and Information Research Team, Boise State University
Ranking is a fundamental aspect of recommender systems. However, ranked outputs can be susceptible to various biases; some of
these may cause disadvantages to members of protected groups. Several metrics have been proposed to quantify the (un)fairness of
arXiv:2009.01311v1 [cs.IR] 2 Sep 2020
rankings, but there has not been to date any direct comparison of these metrics. This complicates deciding what fairness metrics are
applicable for specific scenarios, and assessing the extent to which metrics agree or disagree. In this paper, we describe several fair
ranking metrics in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare
them on the same experimental setup and data set. Our work provides a direct comparative analysis identifying similarities and
differences of fair ranking metrics selected for our work.
Additional Key Words and Phrases: fair ranking, fairness metrics, group fairness
1 INTRODUCTION
Recommender systems are valuable for helping users find information and resources that are relevant to their personal
tastes, and match content providers with the users who will appreciate their work, but like all machine learning systems,
they can also perpetuate biases throughout their design, operation, training, and evaluation [13].
One high-impact way this can manifest in recommender systems is in disparate exposure: the system exposes its items
(and their creators) to users by providing them in recommendation lists (or rankings), and this visibility affects what
users consume, purchase, and know about. Disparate exposure can disadvantage content creators on either an individual
or group basis. Popularity bias, for example, advantages creators based on their prior popularity. The system may also,
however, provide greater or lesser exposure along lines associated with historical and ongoing social discrimination,
such as gender or race.
Several metrics have been proposed to measure the fairness of a recommender system’s ranked outputs with respect
to the people producing the items it recommends, but so far there has been no systematic integration or comparison of
them. In this paper, we fill this gap through three contributions: (1) describing and comparing of exposure- and rank-
fairness metrics in a unified framework; (2) identifying gaps between their original presentation and the practicalities of
applying them to recommender systems; and (3) directly comparing their outcomes with the same data and experimental
setting.
constructs there are a variety of different metrics and techniques [9]. Recommender systems introduce the further
complication of being multisided environments, with different stakeholders having different fairness concerns [3]. One
way these biases can manifest is through disparate exposure of recommended works by different creators or groups of
creators (an aspect of provider fairness [3]).
The ranked nature of recommendation outputs further complicates fairness due to position bias: users are more likely
to see and engage with recommendations at the top of a list [5]. Slight changes to ranking may lead to large changes in
the attention paid to a result and the economic return to its creator.
In this paper, we focus on group-fairness of ranked outputs, adopting the common frame inspired by United States
anti-discrimination law of a “protected group”: a class of people who share a trait upon which a recommendation or
classification should not be mediated [17]. This includes discrimination on the basis of race, gender, religion, and similar
traits. A few of the metrics are also applicable to individual fairness, which we note in their discussion.
different rank positions afford, we use user models (often logarithmic, exponential, or geometric decay) to compute a
position weight vector a®L for ranking L.
N
1 Õ d(Lk , L)
PreFd (L) = (1)
Z loд2k
k =10,20,30
where normalizing scalar Z = maxL′ PreFd′ (L ′ )) (taken over all L ′ with the same length and group composition as
L, where PreFd′ is the prefix fairness function without the normalizer). This scales PreFd to the range [0, 1], with 1
representing maximum unfairness.
Instantiating PreFd with different distance functions yields the different members of this metric family. Two
+
|G 1. |
distance functions compare the proportion and ratio of group memberships, respectively: d ND (Lk , L) = k
. .k
−
+
|G L+ | |G 1. | |G L+ |
N and d RD (Lk , L) = −
|G 1.
. .k
|
− |G L− |
. A third compares lists using K-L divergence: d KL (Lk , L) = D K L (P Lk ||P L ) =
. .k
+ D |G + | |G − | E
P Lk (д) |G 1. −
| |G 1. |
д P L k (д)log2 P L (д) , where P Lk = , k and P L = NL , NL . d KL has the advantage of generalizing
. .k . .k
Í
k
to more than two groups. None of the metrics work when G L− = ∅, and d RD does not work when there are too few
members of G L− .
10-item windows are a coarse means of prioritizing the fairness of early positions in the ranking. Sapiezynski et al.
[12] refine the notion of a fair ranking in two ways: they replace the list prefix averaging with weights attached to each
rank position (which they call attention), and generalize from the distribution of the entire ranked list to a population
estimator p̂, a distribution over groups that is considered optimally fair. p̂ can be computed from the set of all relevant
items, equivalent to the use of the whole ranking in Eq. 1; the set of all content producers; the population at large; or
other means of assessing the target group membership.
3
Raj et al.
Given the alignment matrix and suitably normalized position weight vector, ϵ L = G LTa®L is a distribution that
represents the cumulative exposure of group alignment of the ranked list L. The resulting unfairness metric, which we
call Attention-Weighted Rank Fairness (AWRF), is the difference between this exposure distribution and the population
estimator:
Õ
ϵu = Eπ [ϵ L ] = π (L|u)ϵ L
L
Õ Õ
ϵ = Eπ ρ [ϵ L ] = ρ(u) π (L|u)ϵ L
u L
.
There are several ways to measure the fairness of group exposure operationalized in this manner. Singh and Joachims
[15] define three, which we extend to measure a system’s overall behavior instead of per-query fairness. The first is a
demographic parity metric, measuring the difference in exposure between two groups:1
ϵ(G + )/ϒ(G + )
DTR = (4)
ϵ(G − )/ϒ(G − )
The third, disparate impact ratio, compares the discounted gain contributed by each group’s members to overall
group utility (Γ(G) = i ∈G Eπ ρ [a L (i)yui ]):
Í
1 The original paper presented a constraint, not a metric, for demographic parity; we have implemented it as a ratio to be consistent with the other metrics.
2 We question this choice of terminology, because disparate exposure may result from treatment or it may be an emergent effect, but use the metric names
as described by Singh and Joachims [15]. They justify these terms by considering any discrepancy in exposure to be a treatment.
4
Comparing Fair Ranking Metrics
Γ(G + )/ϒ(G + )
DIR = (5)
Γ(G − )/ϒ(G − )
−1
In each, they used logarithmic decay, with a L (i) = log2 (1 + L(i)) . As ratios, values greater than 1 indicate a bias
towards the protected group.
Biega et al. [2] amortized exposure over a sequence of rankings from the system’s query log in a metric called
amortized attention. As discussed above, we can treat the sequence as a sequence of draws from the distribution of users
and rankings and replace their sums with expectations (as summing over the sequence is equivalent to expectation
divided by sequence length). They also compared rank exposure to the estimated relevance, instead of ground truth
relevance assessments; this avoids sparsity problems, but makes the metric dependent on both the accuracy and the
fairness of the relevance predictions. Given the expected relevance ϒ (computed as above), they set the goal that
ϵ (G 1 ) ϵ (G )
ϒ(G 1 )
= ϒ(G 2 ) for all pairs of groups G 1 , G 2 ; this then becomes the Inequity of Amortized Attention metric:
2
The decomposition in Eq. 8 yields two component metrics, the expected exposure disparity EED = ∥ϵ ∥22 (analogous to
Demographic Parity above) and expected exposure relevance EER = 2ϵ Tϵ ∗ . They propose two models for the position
weights, a cascade model based on expected reciprocal rank and a geometric model from rank-biased precision.
Neither IAA nor the EE metrics distinguish between groups that are over- or under-exposed; for both, 0 is perfectly
fair and larger values are unfair.
The common thread between these algorithms, articulated by Diaz et al. [5], is that for a fixed information need,
exposure difference in items with the same relevance grade results in disparate impact. The only way to address this
inequity is by varying the rankings returned by the system, as with a stochastic policy.
3.4 Discussion
With the same motivation of identifying bias in ranked output, the metrics present different approaches, assumptions,
and implications. Neither PreFd nor AWRF account for relevance, so using these metrics in isolation for evaluation or
optimization may reduce ranking quality; they are best suited for measuring the relative fairness of rankings optimized
for utility, particularly when there are large relevant sets. PreFd also breaks down in more edge cases around the relative
sizes of groups than AWRF, suggesting AWRF should be preferred for this use case.
5
Raj et al.
Most of the metrics in the exposure family account for both fairness and relevance in their final values, e.g. measuring
the extent to which exposure is disproportional to relevance, but use different sources of relevance information (at least
in their original presentations). IAA uses the system’s predicted relevance ŷ; this has the advantage of sidestepping the
sparsity of available relevance judgements (particularly a problem in recommender systems, where the vast majority of
items, including many relevant items, are unrated), but means that if the relevance estimates are biased in an unfair way,
the metric’s assessment of fairness will be impacted. DTR, DIR, and the Expected Exposure metrics use ground-truth
relevance judgements, making them dependent on the availability of ratings or judgements. Most of these metrics can
be implemented with other relevance judgements, although EEL and EER are not applicable to estimated relevance
because the system will generally rank in order of estimated relevance, thus always obtaining ideal rankings.
Some metrics (e.g. AWRF and EE*) represent group membership with alignment vectors, allowing for non-binary
group associations and ambiguity; others (such as most of the PreF family, DP, DTR, and DIR - anything using differences
or ratios) require binary class assignment with an identified protected class, limiting their applicability.
One of the striking things is how deeply similar most of the metrics we consider are. The fundamental construct —
weighted exposure — is the same, and they differ primarily in how they relate exposure to relevance and how they
aggregate and compare exposure distributions.
Two outstanding challenges in applying most of these metrics are handling missing data (both relevance and group
membership) and setting metric parameters. Most metrics depend on parameter values (e.g. stopping probability and
patience parameters for geometric or cascade models) that need to be properly configured, introducing complexity to
their application.
In realistic recommender experiments, relevance and group membership data are missing for many items. For many
metrics, we can treat items with unknown relevance as irrelevant (y = 0), and keep unknown-group items for the
purpose of computing attention weights but exclude them from further analysis, or treat “unknown” as an additional
group identity. This approach is problematic for the PreF family, though, because the metrics treat a list with fewer than
10 known-group items as maximally fair, and the straightforward way of computing Z — make the ranking maximally
unfair by putting all majority items before any protected items — does not work in the face of missing data.
Additional constructs have been proposed that may address some of these concerns or advance the state of fair
ranking in other ways. Beutel et al. [1] present a pairwise definition of rank fairness that may be easier to apply with
missing relevance and/or group membership data. It requires further adaption to fit within our current experimental
setting, which we leave for future work.
4 EMPIRICAL COMPARISON
To empirically compare the different metrics we have discussed, we use them each to measure the fairness of book
recommendations with regards to author gender. Due to the exceptional difficulties applying PreFd discussed in Sec. 3
(difficulties with missing data and numerous edge case breakdowns), we exclude that family from our empirical analysis.
For all other metrics, we use a continuation probability of 0.5 (for both geometric and cascade models) and a cascade
stopping probability of 0.5, following Diaz et al. [5]. For all metrics with a proteceted group, female authors were G + .
For AWRF, we used the distribution of male and female authors among the set of books in the data set as the population
estimator. For IAA and the EE metrics, we included unknown gender as a third author group.
We use LensKit for Python [6] to generate recommendations for users in the GoodReads book data [16], with data
integration, recommendation algorithms, and hyperparameter tunings from Ekstrand and Kluver [7]. We refer the
6
Comparing Fair Ranking Metrics
AWRF DIR DP
4 1 ERR RBP
0.75 0.3
3 0.75
0.50 0.50 0.2
EED
2 0.1
1 0.25 0.25 0
0 0 0 Split Split
0.3
value
DTR IAA 1
value
0.2 1
EEL
0.8 0.100 BPR II UU MF 5 0.1
0.6 0.075 0 5
0.4 0.050 1
0.2 0.025
EER
0.5
0 0 0
BPR II UU MF BPR II UU MF BPR II UU MF BPR II UU MF
Algorithm Algorithm
Fig. 1. Outcomes of fairness metrics. For AWRF, higher is biased towards protected group; DIR, DP, and DTR, 1 is neutral with lower
biased towards protected group; for EED, EEL, and IAA, larger is less fair overall.
reader to that paper for full details on the integration strategy, data set statistics (including gender distributions),
algorithm configuration strategy, and crucial limitations; we use only the implicit-feedback data GoodReads data.
We created two samples of 5000 users each for our experiment; in the ‘Split 1’ sample, users had at least 5 ratings, 1
of which we held out as a test rating, while in ‘Split 5’, users had at least 10 ratings with 5 held out. Author gender
information of the books is extracted from Virtual Internet Authority File (VIAF)3 as described by Ekstrand and Kluver
[7].
We measured the fairness of the recommendations produced for these users by four collaborative filtering (CF) algo-
rithms used: user-based CF (UU, [8]); item based CF (II, [4]); matrix factorization (MF, [10]); and Bayesian Personalized
Ranking (BPR, [11]). Figure. 1 shows the outcome of fairness metrics on the generated recommendations.
The differences in definition of fair ranking and the meaning of direction in these metrics makes it difficult to
interpret and directly compare results. A few points are apparent, though:
• The algorithms do not show large differences on most metrics, with the notable exception of II.
• AWRF and the D* family both agree that item-item is the most advantageous to female authors (as larger values
are more biased towards the protected group). They disagree, however, as to whether it is because it is more fair
(the D* family, approaching 1) or whether it is “unfairly” biased towards female authors (AWRF of about 4; recall
that AWRF is a z-statistic).
• IAA shows one algorithm standing out, but it is a different algorithm for each split, even though the metric does
not use the test data and thus should be unaffected by the size of the relevant set.
• EE is heavily influenced by the size of relevant set, and this was more important than the choice of user model.4
There is no clear consensus or agreement between the metrics on the relative fairness of the algorithms we tested. In
this analysis, we have attempted to follow the original definitions of each metric as closely as possible; as seen in Sec.
3, however, there are numerous places where metrics could be made more similar to each other (e.g. using the same
position weights). For example, AWRF uses a geometric decay, while D* uses logarithmic rank discounting. Future work
will explore the relative impact of these component decisions versus other aspects of the design of these metrics, and
seek to better understand their implications and relative theoretical justifiability to provide a more robust foundation
for measuring fair ranking.
3 http://viaf.org/viaf/data/
4 Single-item relevant sets do not work at all for EE in its original formulation; taking expectation over multiple users enables it to work, but has drawbacks
— shared with IAA — that we are still exploring.
7
Raj et al.
5 CONCLUSION
This paper presents a comparative analysis among several fairness metrics recently introduced to measure fair ranking.
We discussed the metric formulations and implications in an integrated framework and presented the first (to our
knowledge) empirical comparison of fair ranking metrics for multiple recommendation algorithms with a common
data set and fairness goal. Our results did not show any consensus between metrics. They generally agreed that one
algorithm (II ) was different in its fairness dynamics than the others, but disagreed on how that related to equity.
This work opens up several directions for future research. An immediate first advance is to adapt the metrics to be
more similar to each other and study the effects of individual metric design decisions (such as the position weighting or
the source of relevance data) on metric behavior. There is also work to do on the missing or sparse relevance information
of items, and allowing ambiguous or multiple group association. Furthermore, considering alternative ranking models
may introduce more complexity in measuring fair ranking.
Significant progress has been made in the last 2–3 years on measuring the fairness of rankings, but more work is
needed in order to understand how best to design and apply these metrics.
ACKNOWLEDGMENTS
This material is based upon work supported by the National Science Foundation under Grant No. IIS 17-51278.
REFERENCES
[1] Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Li Wei, Yi Wu, Lukasz Heldt, Zhe Zhao, Lichan Hong, Ed H Chi, et al. 2019. Fairness in recommendation
ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
2212–2220.
[2] Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: Amortizing individual fairness in rankings. In Proceedings of the
41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 405–414.
[3] Robin Burke. 2017. Multisided Fairness for Recommendation. (July 2017). arXiv:1707.00093 [cs.CY] http://arxiv.org/abs/1707.00093
[4] Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems (TOIS) 22,
1 (2004), 143–177.
[5] Fernando Diaz, Bhaskar Mitra, Michael D Ekstrand, Asia J Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure.
arXiv preprint arXiv:2004.13157 (2020).
[6] Michael D Ekstrand. 2018. The LKPY package for recommender systems experiments: Next-generation tools and lessons learned from the LensKit
project. arXiv preprint arXiv:1809.03125 (2018).
[7] Michael D. Ekstrand and Daniel Kluver. 2020. Exploring Author Gender in Book Rating and Recommendation. CoRR abs/1808.07586v2 (2020).
arXiv:1808.07586v2 https://md.ekstrandom.net/pubs/bag-extended
[8] Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. 1999. An Algorithmic Framework for Performing Collaborative Filtering. In
Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing
Machinery, New York, NY, USA, 230âĂŞ237.
[9] Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2018. Prediction-Based Decisions and Fairness: A Catalogue of
Choices, Assumptions, and Definitions. (Nov. 2018). arXiv:1811.07867 [stat.AP] http://arxiv.org/abs/1811.07867
[10] István Pilászy, Dávid Zibriczky, and Domonkos Tikk. 2010. Fast als-based matrix factorization for explicit and implicit feedback datasets. In
Proceedings of the fourth ACM conference on Recommender systems. 71–78.
[11] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized ranking from implicit feedback.
arXiv preprint arXiv:1205.2618 (2012).
[12] Piotr Sapiezynski, Wesley Zeng, Ronald E Robertson, Alan Mislove, and Christo Wilson. 2019. Quantifying the Impact of User Attentionon Fair
Group Representation in Ranked Lists. In Companion Proceedings of The 2019 World Wide Web Conference. 553–562.
[13] A Selbst and S Barocas. 2016. Big Data’s Disparate Impact. California Law Review 104 (30 September 2016), 671–732.
[14] Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and Abstraction in Sociotechnical
Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* ’19 (Atlanta, GA, USA). ACM Press, New York, New
York, USA, 59–68. https://doi.org/10.1145/3287560.3287598
[15] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. 2219–2228.
8
Comparing Fair Ranking Metrics
[16] Mengting Wan and Julian McAuley. 2018. Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on
Recommender Systems. 86–94.
[17] A Xiang and I Raji. 2019. On the Legal Compatibility of Fairness Definitions. NeurIPS (2019).
[18] Ke Yang and Julia Stoyanovich. 2017. Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and
Statistical Database Management. 1–6.