0% found this document useful (0 votes)
6 views

Comparing Fair Ranking Metrics

This paper provides a comparative analysis of various fair ranking metrics used in recommender systems, highlighting their similarities and differences. The authors describe these metrics within a unified framework and empirically evaluate them using the same dataset and experimental setup. The study aims to clarify the applicability of different fairness metrics and their effectiveness in addressing biases against protected groups.

Uploaded by

Sherin Naha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Comparing Fair Ranking Metrics

This paper provides a comparative analysis of various fair ranking metrics used in recommender systems, highlighting their similarities and differences. The authors describe these metrics within a unified framework and empirically evaluate them using the same dataset and experimental setup. The study aims to clarify the applicability of different fairness metrics and their effectiveness in addressing biases against protected groups.

Uploaded by

Sherin Naha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Comparing Fair Ranking Metrics

AMIFA RAJ, People and Information Research Team, Boise State University
CONNOR WOOD, People and Information Research Team, Boise State University
ANANDA MONTOLY∗ , Smith College
MICHAEL D. EKSTRAND, People and Information Research Team, Boise State University
Ranking is a fundamental aspect of recommender systems. However, ranked outputs can be susceptible to various biases; some of
these may cause disadvantages to members of protected groups. Several metrics have been proposed to quantify the (un)fairness of
arXiv:2009.01311v1 [cs.IR] 2 Sep 2020

rankings, but there has not been to date any direct comparison of these metrics. This complicates deciding what fairness metrics are
applicable for specific scenarios, and assessing the extent to which metrics agree or disagree. In this paper, we describe several fair
ranking metrics in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare
them on the same experimental setup and data set. Our work provides a direct comparative analysis identifying similarities and
differences of fair ranking metrics selected for our work.

Additional Key Words and Phrases: fair ranking, fairness metrics, group fairness

1 INTRODUCTION
Recommender systems are valuable for helping users find information and resources that are relevant to their personal
tastes, and match content providers with the users who will appreciate their work, but like all machine learning systems,
they can also perpetuate biases throughout their design, operation, training, and evaluation [13].
One high-impact way this can manifest in recommender systems is in disparate exposure: the system exposes its items
(and their creators) to users by providing them in recommendation lists (or rankings), and this visibility affects what
users consume, purchase, and know about. Disparate exposure can disadvantage content creators on either an individual
or group basis. Popularity bias, for example, advantages creators based on their prior popularity. The system may also,
however, provide greater or lesser exposure along lines associated with historical and ongoing social discrimination,
such as gender or race.
Several metrics have been proposed to measure the fairness of a recommender system’s ranked outputs with respect
to the people producing the items it recommends, but so far there has been no systematic integration or comparison of
them. In this paper, we fill this gap through three contributions: (1) describing and comparing of exposure- and rank-
fairness metrics in a unified framework; (2) identifying gaps between their original presentation and the practicalities of
applying them to recommender systems; and (3) directly comparing their outcomes with the same data and experimental
setting.

2 BACKGROUND AND POSITION


In order to correct machine learning reflections of systemic societal biases, it is essential to identify and measure
them. Fairness is hard to quantify; as an essentially contested social construct [14], there is no one correct or objective
definition There are different ways in which a system can be unfair, each leading to different metrics. The goal of
group fairness is to ensure similar outcomes for members of protected groups as members of non-protected groups,
while individual fairness ensures that two similar individuals are also able to achieve similar outcomes; within these
∗ Work conducted during Reseach Experience for Undergraduates with Boise State University.
1
Raj et al.

constructs there are a variety of different metrics and techniques [9]. Recommender systems introduce the further
complication of being multisided environments, with different stakeholders having different fairness concerns [3]. One
way these biases can manifest is through disparate exposure of recommended works by different creators or groups of
creators (an aspect of provider fairness [3]).
The ranked nature of recommendation outputs further complicates fairness due to position bias: users are more likely
to see and engage with recommendations at the top of a list [5]. Slight changes to ranking may lead to large changes in
the attention paid to a result and the economic return to its creator.
In this paper, we focus on group-fairness of ranked outputs, adopting the common frame inspired by United States
anti-discrimination law of a “protected group”: a class of people who share a trait upon which a recommendation or
classification should not be mediated [17]. This includes discrimination on the basis of race, gender, religion, and similar
traits. A few of the metrics are also applicable to individual fairness, which we note in their discussion.

3 FAIR RANKING METRICS


Our first contribution is to describe several fair ranking constructs and metrics from the existing literature in a
common framework and notation to enable direct comparison and illuminate their commonalities and differences. Some
constructs assess fairness within a single ranking, which can then be aggregated to compute system fairness; others
directly assess the fairness of a sequence or distribution of rankings. One other key difference is in how they relate
fairness to relevance: some only measure the fair distribution of opportunity, and must be integrated with other metrics
to account for relevance, while others incorporate relevance into the metric definition itself. Table 1 summarizes the
metrics we consider. Metric names are chosen based on their functionality, purpose, and comparability within our
synthesis; in some cases, we use the original name, but in others we assign a name either because the original paper
used a general name (e.g. “unfairness”) or the original name primarily does not make for clear exposition in comparison
to other metrics.

3.1 Problem Formulation


We consider a recommender system that recommends items i 1 , i 2 , . . . , i n ∈ I to users u 1 , u 2 , . . . , um ∈ U . A recommen-
dation list comes in the form of a ranked list L of N items from I ; L(i) is the ranking position of item i in L. Items may
have an associated relevance score yui , such as rating value or whether the item was consumed, and the system may
estimate this by a predictor ŷui .
To enable the measurement of group fairness where content providers fall into one (or more) of д groups, each item
is associated with an alignment vector G i ∈ [0, 1]д (s.t. ∥G i ∥1 = 1) indicating its group membership; generalizing from
a categorical variable to a vector allows for encoding either multiple membership or uncertainty about membership
[12]. G L is an alignment matrix whose rows correspond to the items of L and columns are groups. In the case of
definitively-known membership in a binary pair of groups, G + denotes the set of items in the “protected” group and G −
the remaining items.
The goal of many of these metrics is to measure the exposure or attention each item, content producer, or group
receives, and assess the fairness of this distribution. Authors vary in their choice of term for the resulting construct;
following Biega et al. [2] we distinguish them with the idea that exposure is the opportunity for attention. The system
exposes items to the user by placing them in the ranked list, and the user may or may not give them attention; with
available data, what we can measure is the fairness of exposure. To incorporate the differing opportunity for attention
2
Comparing Fair Ranking Metrics

different rank positions afford, we use user models (often logarithmic, exponential, or geometric decay) to compute a
position weight vector a®L for ranking L.

Table 1. Summary of fair ranking metrics.

Metric(s) Goal Weighting Relevance?a Binary?


PreFd [18] Each prefix representative of whole ranking — No Dep. on d b
AWRF [12] Weighted representation matches population Geometric No No
DP [15] Exposure equal across groups Logarithmic No No
DTR [15] Exposure proportional to relevance Logarithmic Yes No
DIR [15] Discounted gain proportional to relevance Logarithmic Yes No
IAA [2] Exposure proportional to predicted relevance Geometric Predicted No
EEL, EER [5] Exposure matches ideal (from relevance) Cascade, Geom. Yes No
EED [5] Exposure well-distributed (low ∥ · ∥22 ) Cascade, Geom. No No
a Cascade weighting also incorporates relevance into exposure, even if exposure is not compared to relevance.
bd
RD and d RD both require binary groups, but d KL generalizes.

3.2 Single-List Metrics


Yang and Stoyanovich [18] present three related measures of statistical parity between groups in a single ranking. To
prioritize parity at the top of the list, these measures average the parity over successive prefixes of the ranking; we call
them the prefix fairness family. Given a distance function d that compares the composition of a list prefix Lk to that of
the whole list L, prefix fairness is defined as

N
1 Õ d(Lk , L)
PreFd (L) = (1)
Z loд2k
k =10,20,30
where normalizing scalar Z = maxL′ PreFd′ (L ′ )) (taken over all L ′ with the same length and group composition as
L, where PreFd′ is the prefix fairness function without the normalizer). This scales PreFd to the range [0, 1], with 1
representing maximum unfairness.
Instantiating PreFd with different distance functions yields the different members of this metric family. Two
+
|G 1. |
distance functions compare the proportion and ratio of group memberships, respectively: d ND (Lk , L) = k
. .k

+
|G L+ | |G 1. | |G L+ |
N and d RD (Lk , L) = −
|G 1.
. .k
|
− |G L− |
. A third compares lists using K-L divergence: d KL (Lk , L) = D K L (P Lk ||P L ) =
. .k
+ D |G + | |G − | E
 
P Lk (д) |G 1. −
| |G 1. |
д P L k (д)log2 P L (д) , where P Lk = , k and P L = NL , NL . d KL has the advantage of generalizing
. .k . .k
Í
k
to more than two groups. None of the metrics work when G L− = ∅, and d RD does not work when there are too few
members of G L− .
10-item windows are a coarse means of prioritizing the fairness of early positions in the ranking. Sapiezynski et al.
[12] refine the notion of a fair ranking in two ways: they replace the list prefix averaging with weights attached to each
rank position (which they call attention), and generalize from the distribution of the entire ranked list to a population
estimator p̂, a distribution over groups that is considered optimally fair. p̂ can be computed from the set of all relevant
items, equivalent to the use of the whole ranking in Eq. 1; the set of all content producers; the population at large; or
other means of assessing the target group membership.
3
Raj et al.

Given the alignment matrix and suitably normalized position weight vector, ϵ L = G LTa®L is a distribution that
represents the cumulative exposure of group alignment of the ranked list L. The resulting unfairness metric, which we
call Attention-Weighted Rank Fairness (AWRF), is the difference between this exposure distribution and the population
estimator:

AWRF(L) = d(ϵ L , p̂) (2)


The distance function d depends on application context; for binary groups with an identifiable protected class, they
use the z-statistic, with values greater than 0 representing bias towards the protected group.

3.3 Distribution and Sequence Metrics


Both PreFd and AWRF measure a single list, and neither directly accounts for the relevance of the ranked results. The
exposure family of metrics operates over distributions or sequences of rankings, and directly incorporate relevance in
various ways. The intuition behind the relevance factor, articulated independently by Singh and Joachims [15] and Biega
et al. [2], is that exposure should be proportional to relevance: if an item or a group contributes 10% of the relevance to a
user (and/or query), it should receive 10% of the exposure. Another key insight is that fair exposure cannot be achieved
in a single ranking, because the drop-off in the value of rank positions is not the same as the drop-off in relevance. An
item may be 5% more relevant than another but receive 50% more (likely) attention.
These metrics evaluate a stochastic ranker, modeled as a distribution π (L|u) over rankings [5, 15]. We extend this to
also model the arrival of users as a distribution ρ(u), so a sequence of recommendations L 1 , L 2 , . . . , Lñ in response to
users [2] is a series of draws from the distribution ρ(u)π (L|u). The group exposure within a single ranking from Eq. 2,
ϵ L = GTL a®L , is the fundamental building block of these metrics, along with its expected value

Õ
ϵu = Eπ [ϵ L ] = π (L|u)ϵ L
L
Õ Õ
ϵ = Eπ ρ [ϵ L ] = ρ(u) π (L|u)ϵ L
u L
.
There are several ways to measure the fairness of group exposure operationalized in this manner. Singh and Joachims
[15] define three, which we extend to measure a system’s overall behavior instead of per-query fairness. The first is a
demographic parity metric, measuring the difference in exposure between two groups:1

DP = ϵ(G + )/ϵ(G − ) (3)


The second, which they call disparate treatment ratio2 , measures violation from the goal that each group’s exposure
is proportional to its utility (measured by ϒ(G) = Eρ [ |G1 | i ∈G yui ]):
Í

ϵ(G + )/ϒ(G + )
DTR = (4)
ϵ(G − )/ϒ(G − )
The third, disparate impact ratio, compares the discounted gain contributed by each group’s members to overall
group utility (Γ(G) = i ∈G Eπ ρ [a L (i)yui ]):
Í

1 The original paper presented a constraint, not a metric, for demographic parity; we have implemented it as a ratio to be consistent with the other metrics.
2 We question this choice of terminology, because disparate exposure may result from treatment or it may be an emergent effect, but use the metric names
as described by Singh and Joachims [15]. They justify these terms by considering any discrepancy in exposure to be a treatment.
4
Comparing Fair Ranking Metrics

Γ(G + )/ϒ(G + )
DIR = (5)
Γ(G − )/ϒ(G − )
 −1
In each, they used logarithmic decay, with a L (i) = log2 (1 + L(i)) . As ratios, values greater than 1 indicate a bias
towards the protected group.
Biega et al. [2] amortized exposure over a sequence of rankings from the system’s query log in a metric called
amortized attention. As discussed above, we can treat the sequence as a sequence of draws from the distribution of users
and rankings and replace their sums with expectations (as summing over the sequence is equivalent to expectation
divided by sequence length). They also compared rank exposure to the estimated relevance, instead of ground truth
relevance assessments; this avoids sparsity problems, but makes the metric dependent on both the accuracy and the
fairness of the relevance predictions. Given the expected relevance ϒ (computed as above), they set the goal that
ϵ (G 1 ) ϵ (G )
ϒ(G 1 )
= ϒ(G 2 ) for all pairs of groups G 1 , G 2 ; this then becomes the Inequity of Amortized Attention metric:
2

IAA = ∥ϵ − ϒ∥1 (6)


They use a geometric decay for position weights, so a L (i) = p(1 − p)L(i)−1 .
The primary focus of the amortized attention framework is individual fairness, where each item (or author) is exposed
proportionally to their relevance, but group fairness is a groupwise aggregate of individual fairness here.
Diaz et al. [5] integrate relevance in a different way. Rather than directly relate exposure to relevance, they use
relevance to derive a target exposure, based on an ideal policy τ that assigns equal probability to all rankings that
place items in non decreasing order of relevance and 0 probability to all other rankings. This target exposure ϵ ∗ is the
expected exposure under the ideal policy, so ϵ ∗ = Eτ ρ[ϵ L ]. They then take the squared Euclidean distance between
system expected exposure and target exposure, yielding the Expected Exposure Loss metric:

EEL = ∥ϵ − ϵ ∗ ∥22 (7)


= ∥ϵ ∥22 − 2ϵ Tϵ ∗ + ∥ϵ ∗ ∥22 (8)

The decomposition in Eq. 8 yields two component metrics, the expected exposure disparity EED = ∥ϵ ∥22 (analogous to
Demographic Parity above) and expected exposure relevance EER = 2ϵ Tϵ ∗ . They propose two models for the position
weights, a cascade model based on expected reciprocal rank and a geometric model from rank-biased precision.
Neither IAA nor the EE metrics distinguish between groups that are over- or under-exposed; for both, 0 is perfectly
fair and larger values are unfair.
The common thread between these algorithms, articulated by Diaz et al. [5], is that for a fixed information need,
exposure difference in items with the same relevance grade results in disparate impact. The only way to address this
inequity is by varying the rankings returned by the system, as with a stochastic policy.

3.4 Discussion
With the same motivation of identifying bias in ranked output, the metrics present different approaches, assumptions,
and implications. Neither PreFd nor AWRF account for relevance, so using these metrics in isolation for evaluation or
optimization may reduce ranking quality; they are best suited for measuring the relative fairness of rankings optimized
for utility, particularly when there are large relevant sets. PreFd also breaks down in more edge cases around the relative
sizes of groups than AWRF, suggesting AWRF should be preferred for this use case.
5
Raj et al.

Most of the metrics in the exposure family account for both fairness and relevance in their final values, e.g. measuring
the extent to which exposure is disproportional to relevance, but use different sources of relevance information (at least
in their original presentations). IAA uses the system’s predicted relevance ŷ; this has the advantage of sidestepping the
sparsity of available relevance judgements (particularly a problem in recommender systems, where the vast majority of
items, including many relevant items, are unrated), but means that if the relevance estimates are biased in an unfair way,
the metric’s assessment of fairness will be impacted. DTR, DIR, and the Expected Exposure metrics use ground-truth
relevance judgements, making them dependent on the availability of ratings or judgements. Most of these metrics can
be implemented with other relevance judgements, although EEL and EER are not applicable to estimated relevance
because the system will generally rank in order of estimated relevance, thus always obtaining ideal rankings.
Some metrics (e.g. AWRF and EE*) represent group membership with alignment vectors, allowing for non-binary
group associations and ambiguity; others (such as most of the PreF family, DP, DTR, and DIR - anything using differences
or ratios) require binary class assignment with an identified protected class, limiting their applicability.
One of the striking things is how deeply similar most of the metrics we consider are. The fundamental construct —
weighted exposure — is the same, and they differ primarily in how they relate exposure to relevance and how they
aggregate and compare exposure distributions.
Two outstanding challenges in applying most of these metrics are handling missing data (both relevance and group
membership) and setting metric parameters. Most metrics depend on parameter values (e.g. stopping probability and
patience parameters for geometric or cascade models) that need to be properly configured, introducing complexity to
their application.
In realistic recommender experiments, relevance and group membership data are missing for many items. For many
metrics, we can treat items with unknown relevance as irrelevant (y = 0), and keep unknown-group items for the
purpose of computing attention weights but exclude them from further analysis, or treat “unknown” as an additional
group identity. This approach is problematic for the PreF family, though, because the metrics treat a list with fewer than
10 known-group items as maximally fair, and the straightforward way of computing Z — make the ranking maximally
unfair by putting all majority items before any protected items — does not work in the face of missing data.
Additional constructs have been proposed that may address some of these concerns or advance the state of fair
ranking in other ways. Beutel et al. [1] present a pairwise definition of rank fairness that may be easier to apply with
missing relevance and/or group membership data. It requires further adaption to fit within our current experimental
setting, which we leave for future work.

4 EMPIRICAL COMPARISON
To empirically compare the different metrics we have discussed, we use them each to measure the fairness of book
recommendations with regards to author gender. Due to the exceptional difficulties applying PreFd discussed in Sec. 3
(difficulties with missing data and numerous edge case breakdowns), we exclude that family from our empirical analysis.
For all other metrics, we use a continuation probability of 0.5 (for both geometric and cascade models) and a cascade
stopping probability of 0.5, following Diaz et al. [5]. For all metrics with a proteceted group, female authors were G + .
For AWRF, we used the distribution of male and female authors among the set of books in the data set as the population
estimator. For IAA and the EE metrics, we included unknown gender as a third author group.
We use LensKit for Python [6] to generate recommendations for users in the GoodReads book data [16], with data
integration, recommendation algorithms, and hyperparameter tunings from Ekstrand and Kluver [7]. We refer the
6
Comparing Fair Ranking Metrics

AWRF DIR DP
4 1 ERR RBP
0.75 0.3
3 0.75
0.50 0.50 0.2

EED
2 0.1
1 0.25 0.25 0
0 0 0 Split Split
0.3
value

DTR IAA 1

value
0.2 1

EEL
0.8 0.100 BPR II UU MF 5 0.1
0.6 0.075 0 5
0.4 0.050 1
0.2 0.025

EER
0.5
0 0 0
BPR II UU MF BPR II UU MF BPR II UU MF BPR II UU MF
Algorithm Algorithm

Fig. 1. Outcomes of fairness metrics. For AWRF, higher is biased towards protected group; DIR, DP, and DTR, 1 is neutral with lower
biased towards protected group; for EED, EEL, and IAA, larger is less fair overall.

reader to that paper for full details on the integration strategy, data set statistics (including gender distributions),
algorithm configuration strategy, and crucial limitations; we use only the implicit-feedback data GoodReads data.
We created two samples of 5000 users each for our experiment; in the ‘Split 1’ sample, users had at least 5 ratings, 1
of which we held out as a test rating, while in ‘Split 5’, users had at least 10 ratings with 5 held out. Author gender
information of the books is extracted from Virtual Internet Authority File (VIAF)3 as described by Ekstrand and Kluver
[7].
We measured the fairness of the recommendations produced for these users by four collaborative filtering (CF) algo-
rithms used: user-based CF (UU, [8]); item based CF (II, [4]); matrix factorization (MF, [10]); and Bayesian Personalized
Ranking (BPR, [11]). Figure. 1 shows the outcome of fairness metrics on the generated recommendations.
The differences in definition of fair ranking and the meaning of direction in these metrics makes it difficult to
interpret and directly compare results. A few points are apparent, though:
• The algorithms do not show large differences on most metrics, with the notable exception of II.
• AWRF and the D* family both agree that item-item is the most advantageous to female authors (as larger values
are more biased towards the protected group). They disagree, however, as to whether it is because it is more fair
(the D* family, approaching 1) or whether it is “unfairly” biased towards female authors (AWRF of about 4; recall
that AWRF is a z-statistic).
• IAA shows one algorithm standing out, but it is a different algorithm for each split, even though the metric does
not use the test data and thus should be unaffected by the size of the relevant set.
• EE is heavily influenced by the size of relevant set, and this was more important than the choice of user model.4
There is no clear consensus or agreement between the metrics on the relative fairness of the algorithms we tested. In
this analysis, we have attempted to follow the original definitions of each metric as closely as possible; as seen in Sec.
3, however, there are numerous places where metrics could be made more similar to each other (e.g. using the same
position weights). For example, AWRF uses a geometric decay, while D* uses logarithmic rank discounting. Future work
will explore the relative impact of these component decisions versus other aspects of the design of these metrics, and
seek to better understand their implications and relative theoretical justifiability to provide a more robust foundation
for measuring fair ranking.

3 http://viaf.org/viaf/data/
4 Single-item relevant sets do not work at all for EE in its original formulation; taking expectation over multiple users enables it to work, but has drawbacks
— shared with IAA — that we are still exploring.
7
Raj et al.

5 CONCLUSION
This paper presents a comparative analysis among several fairness metrics recently introduced to measure fair ranking.
We discussed the metric formulations and implications in an integrated framework and presented the first (to our
knowledge) empirical comparison of fair ranking metrics for multiple recommendation algorithms with a common
data set and fairness goal. Our results did not show any consensus between metrics. They generally agreed that one
algorithm (II ) was different in its fairness dynamics than the others, but disagreed on how that related to equity.
This work opens up several directions for future research. An immediate first advance is to adapt the metrics to be
more similar to each other and study the effects of individual metric design decisions (such as the position weighting or
the source of relevance data) on metric behavior. There is also work to do on the missing or sparse relevance information
of items, and allowing ambiguous or multiple group association. Furthermore, considering alternative ranking models
may introduce more complexity in measuring fair ranking.
Significant progress has been made in the last 2–3 years on measuring the fairness of rankings, but more work is
needed in order to understand how best to design and apply these metrics.

ACKNOWLEDGMENTS
This material is based upon work supported by the National Science Foundation under Grant No. IIS 17-51278.

REFERENCES
[1] Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Li Wei, Yi Wu, Lukasz Heldt, Zhe Zhao, Lichan Hong, Ed H Chi, et al. 2019. Fairness in recommendation
ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
2212–2220.
[2] Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: Amortizing individual fairness in rankings. In Proceedings of the
41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 405–414.
[3] Robin Burke. 2017. Multisided Fairness for Recommendation. (July 2017). arXiv:1707.00093 [cs.CY] http://arxiv.org/abs/1707.00093
[4] Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems (TOIS) 22,
1 (2004), 143–177.
[5] Fernando Diaz, Bhaskar Mitra, Michael D Ekstrand, Asia J Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure.
arXiv preprint arXiv:2004.13157 (2020).
[6] Michael D Ekstrand. 2018. The LKPY package for recommender systems experiments: Next-generation tools and lessons learned from the LensKit
project. arXiv preprint arXiv:1809.03125 (2018).
[7] Michael D. Ekstrand and Daniel Kluver. 2020. Exploring Author Gender in Book Rating and Recommendation. CoRR abs/1808.07586v2 (2020).
arXiv:1808.07586v2 https://md.ekstrandom.net/pubs/bag-extended
[8] Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. 1999. An Algorithmic Framework for Performing Collaborative Filtering. In
Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing
Machinery, New York, NY, USA, 230âĂŞ237.
[9] Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2018. Prediction-Based Decisions and Fairness: A Catalogue of
Choices, Assumptions, and Definitions. (Nov. 2018). arXiv:1811.07867 [stat.AP] http://arxiv.org/abs/1811.07867
[10] István Pilászy, Dávid Zibriczky, and Domonkos Tikk. 2010. Fast als-based matrix factorization for explicit and implicit feedback datasets. In
Proceedings of the fourth ACM conference on Recommender systems. 71–78.
[11] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized ranking from implicit feedback.
arXiv preprint arXiv:1205.2618 (2012).
[12] Piotr Sapiezynski, Wesley Zeng, Ronald E Robertson, Alan Mislove, and Christo Wilson. 2019. Quantifying the Impact of User Attentionon Fair
Group Representation in Ranked Lists. In Companion Proceedings of The 2019 World Wide Web Conference. 553–562.
[13] A Selbst and S Barocas. 2016. Big Data’s Disparate Impact. California Law Review 104 (30 September 2016), 671–732.
[14] Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and Abstraction in Sociotechnical
Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* ’19 (Atlanta, GA, USA). ACM Press, New York, New
York, USA, 59–68. https://doi.org/10.1145/3287560.3287598
[15] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. 2219–2228.
8
Comparing Fair Ranking Metrics

[16] Mengting Wan and Julian McAuley. 2018. Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM Conference on
Recommender Systems. 86–94.
[17] A Xiang and I Raji. 2019. On the Legal Compatibility of Fairness Definitions. NeurIPS (2019).
[18] Ke Yang and Julia Stoyanovich. 2017. Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and
Statistical Database Management. 1–6.

You might also like