How Do Recommendation Models Amplify Popularity Bias? An Analysis From The Spectral Perspective
How Do Recommendation Models Amplify Popularity Bias? An Analysis From The Spectral Perspective
Abstract
Recommendation Systems (RS) are often plagued by popularity bias. When train-
ing a recommendation model on a typically long-tailed dataset, the model tends to
not only inherit this bias but often exacerbate it, resulting in over-representation
of popular items in the recommendation lists. This study conducts comprehensive
empirical and theoretical analyses to expose the root causes of this phenomenon,
yielding two core insights: 1) Item popularity is memorized in the principal spec-
trum of the score matrix predicted by the recommendation model; 2) The dimension
collapse phenomenon amplifies the relative prominence of the principal spectrum,
thereby intensifying the popularity bias.
Building on these insights, we propose a novel debiasing strategy that leverages
a spectral norm regularizer to penalize the magnitude of the principal singular
value. We have developed an efficient algorithm to expedite the calculation of the
spectral norm by exploiting the spectral property of the score matrix. Extensive
experiments across seven real-world datasets and three testing paradigms have been
conducted to validate the superiority of the proposed method.
1 Introduction
Recommender Systems (RS), with their capability to offer personalized suggestions, have found
applications across various domains [18, 44, 72]. Nevertheless, their effectiveness in personalization
is significantly compromised by popularity bias [12]. This bias emerges when recommendation data
showcases a long-tailed distribution of item interaction frequencies. Subsequently, recommendation
models trained on such data tend to inherit and even amplify this bias, leading to an overwhelming
presence of popular items in recommendation results [56, 68, 74]. This notorious effect not only
undermines the accuracy and fairness of recommendation [5, 4], but also exacerbates the Matthew
Effect and the filter bubble through the user-system feedback loop [37, 23, 22].
Popularity
2⋯ ++ +
{�2� ⋯��
+
… +++
Users
++ + + ++++
+ ++ + +
+ + ++ ++ +
� = � × � ×��
…
++
++
�
}
Items
popularity �
(a) The principal singular vector �1 captures item (b) Dimension collapse augments the relative
prominence of the principal singular vector �1
Figure 1: Illustration of two core insights.
Given the detrimental impact of popularity bias amplification, a thorough understanding of its root
causes is crucial. Although some recent studies have endeavored to elucidate this, their investigations
exhibit significant limitations: 1) Some researchers [56, 68, 55] have investigated popularity bias
amplification through causal graphs. However, they merely postulate causal relations between item
popularity and model predictions without deeply exploring the underlying mechanisms behind the
relations. Moreover, their analyses depend on hypothesized causal graphs, which may be flawed due
to the widespread presence of unmeasured confounders [21, 67]. 2) Other studies [73, 69, 11, 65]
revealed that graph neural networks (GNNs) can exacerbate popularity bias. However, these analyses
are limited to GNNs as applied in specific graph-based recommendation models, rather than the
mechanisms of generic recommendation models.
To bridge this research gap, we undertake extensive theoretical and empirical studies on popularity
bias amplification. By investigating the spectrum of the ranking score matrix over all users and items
predicted by recommendation models, we present the following insights:
1) Memorization Effect. When training a recommendation model on long-tailed data, the information
of item popularity is memorized in the principal spectrum (Figure 1(a)). Empirically, we observe that
the principal singular vector of the score matrix closely aligns with item popularity, with a cosine
similarity consistently exceeding 0.98 across multiple representative recommendation models and
datasets. Theoretically, we derive the lower bound of this cosine similarity, demonstrating that the
similarity converges to one for highly long-tailed training datasets.
2) Amplification Effect. The phenomenon known as dimension collapse augments the relatively
prominence of the principal spectrum that captures item popularity, leading to bias amplification
(Figure 1(b)). We reveal that dimension collapse is pervasive in RS due to two primary reasons: i)
The deliberate low-rank setting of user/item embeddings, employed either to conserve memory or
to counteract overfitting, amplifies the impact of the principal spectrum; ii) The inherent training
dynamics of gradient-based optimization prioritize the learning of the principal dimension, while the
singular values of other dimensions are easily underestimated. Our further theoretical and empirical
analyses establish the relationship between dimension collapse and popularity bias — larger principal
singular values compared to other singular values lead to more popular items on the recommendations.
Our analysis not only explains the underlying mechanisms of bias amplification but also paves the way
for the development of an innovative strategy to counteract this effect. Recognizing that the essence
of this amplification lies in the undue contribution of the principal spectrum, we introduce a spectral
norm regularizer [60] aimed at directly restraining the magnitude of the principal singular value.
However, the direct computation of the spectral norm necessitates exhaustive processing of a large
score matrix and numerous iterative procedures [60, 52], inducing significant computational costs. To
address this challenge, we further develop an accelerated strategy by leveraging the intrinsic spectrum
properties of the score matrix and matrix transformation techniques. Consequently, our method
effectively mitigates popularity bias while imposing limited computational overhead. It is worth
noting that our method is model-agnostic and can be easily integrated into various recommendation
models. We conduct extensive experiments on seven real-world datasets with three representative
backbones, validating the superiority of our proposed method over existing debiasing methods.
2 Preliminaries
Task Formulation. This work mainly focus on the collaborative filtering (CF) [58], a widely-used
recommendation scenario. Consider a RS with a user set U and an item set I. Let n and m denote the
total number of users and items. Historical interactions can be expressed by a matrix Y ∈ {0, 1}n×m ,
2
where the element yui indicates if user u has interactedP with item i (e.g., click). For convenience,
we define the number of interactions of an item as ri = u∈U yui , and collect ri over all items as a
popularity vector r. RS targets to suggest items to users based on their potential interests.
Recommendation Models. Embedding-based models are widely utilized in RS [58]. Such models
convert user/item attributes (e.g., IDs) into d-dimensional representations (uu , vi ), and make predic-
tions using the embedding similarity [58]. Given that the inner product is a conventional similarity
metric due to its efficiency in retrieval and superior performance [57, 61, 34], this work also focuses
on the inner product for analysis. Specifically, the model’s predicted scores can be formulated as
ŷui = µ(u⊤u vi ), where µ(.) denotes an activation function like Sigmoid. ŷui represents a user’s
preference for an item, which is then used for ranking to generate recommendations. For clarity
of presentation, we also employ matrix notation. Let matrices Ŷ, U, V represent scores over all
user-item combinations, embeddings over all users and items, respectively. Model predictions can be
succinctly expressed as Ŷ = µ(UV⊤ ).
Objective Functions. Common choices of loss functions for training a recommendation model
include point-wise loss such as BCE and MSE [45], and pair-wise loss like BPR [46]. It is worth
noting that BPR can be reconceptualized as a specialized pointwise loss if we construct hyper-item
spaces and consider a pair of items as a unique hyper-item (please refer to Appendix A.1.1 for more
details). As such, for convenience, this work focuses on point-wise loss for analysis. But we also
discuss how our proposed debiased method adapts to BPR loss (Appendix A.1.2) and validate its
effectiveness through experiments (Section 5).
Popularity Bias Amplification. Items’ interaction frequency in recommendation data often follows
a long-tailed distribution [8, 17, 51]. For instance, in a typical Douban dataset, a mere 20% of the
most popular items account for 86.3% of all interactions. When models are trained on such skewed
data, they tend to absorb and amplify this bias, frequently over-prioritizing popular items in their
recommendations. For example, in the Douban dataset using the MF model, 20% of the most popular
items occupy over 99.7% of the recommendation slots, while a mere 0.6% of the most popular items
occupy more than 63% (cf. Appendix B.1 more examples). This notorious effect significantly impacts
the recommendation accuracy and fairness, even potentially posing detrimental effects on the entire
ecosystem of RS[12]. Thus, understanding the underlying mechanisms behind this effect is crucial.
3.1.1 Empirical Study. To discern how recommendation models memorize item popularity, we
designed the following experiment: 1) We well trained three representative recommendation models,
MF [38], LightGCN [27] and XSimGCL [62], on three real-world datasets (cf. Section 5 for
experimental details); 2) We then performed SVD decomposition on the predicted score matrix,
Ŷ = PΣQ⊤ = 1≤k≤L σk pk q⊤
P
k where L = min(n, m) and σ1 ≥ σ2 ≥ ... ≥ σL . We further
computed cosine similarity between the right principal singular vector q1 and the item popularity r.
The outcomes are showcased in Table 1. From these experiments, we draw an impressive observation:
Observation 1. The principal right singular vector q1 of the matrix Ŷ aligns significantly with the
item popularity r. The cosine similarity consistently surpasses 0.98 over multiple recommendation
models and datasets.
Table 1: The cosine similarity between the princi-
Given the orthogonal nature of different singular pal singular vector (q1 ) and the item popularity (r)
vectors, we can deduce that item popularity is al- under different backbones and loss functions.
most entirely captured in the principal spectrum. Movielens Douban Globo
This intriguing phenomenon elucidates how the Backbone MSE BCE BPR MSE BCE BPR MSE BCE BPR
recommendation model assimilates item pop-
MF 0.993 0.988 0.991 0.992 0.991 0.993 0.993 0.989 0.992
ularity from the data and how this popularity LightGCN 0.992 0.991 0.992 0.990 0.988 0.990 0.992 0.990 0.991
influences recommendation outcomes. XSimGCL 0.998 0.994 0.995 0.991 0.990 0.992 0.992 0.985 0.989
3.1.2 Theoretical Analyses. Prior to the theoretical validation of observation 1, we posit a power-law
hypothesis pertaining to recommendation data:
Hypothesis 1. The interaction frequency of items in recommendation data follows a power-law
distribution (a.k.a. Zipf law) described by rg ∝ g −α .
3
1.0 1.0
1 /1 k L k 1 10 100
2 2
1 /1 k L k
2 2
2 40 250
0.9 Proportion 0.9 Proportion 1000
Singular Value
0.8 of Pop Items of Pop Items
0.8
Ratio
Optimal
Ratio
0.7 0.7 100
Epoch Optimal
0.6 Epoch
0.6
0.5 10
0.5
0.4 0
1 4 16 64 256 0 20 40 60 80 100 0 20 40 60 80 100
Dimension Epoch Epoch
(a) Exploring dimensions. (b) Exploring different epochs. (c) Exploring singular values.
Figure 2: Illustration of how dimension collapse impacts popularity bias in Movielens: (a)-(b)
the proportion of popular items in recommendations and the ratio of the largest singular value
(σ12 / 1≤k≤L σk2 ) with varying embedding dimensions and training epochs, respectively; (c) how
P
singular values evolves during training.
Here rg signifies the popularity of the g-th most popular item, and α is a shape parameter indicating
the distribution’s slope. Power-law, as a typical long-tailed distribution, is prevalent across various
natural and man-made phenomena [16]. Recent studies assert that item popularity in RS also aligns
with this ubiquitous principle [8, 17, 51]. Then we have the following important theorem:
Theorem 1 (Popularity Memorization Effect). Given an embedding-based recommendation model
with sufficient capacity, when training the model on the data with power-law item popularity, the
cosine similarity between item popularity r and the principal singular vector q1 of the predicted
score matrix is bounded with: 2
s
σ rmax (ζ(α) − 1)
cos(r, q1 ) ≥ p1 1− (1)
rmax ζ(2α) σ12
For α > 2, this can be further bounded with: s
2 − ζ(α)
cos(r, q1 ) ≥ (2)
ζ(2α)
where rmax is the popularity of the most popular item, and ζ(α) is Riemann zeta function with
∞
1
P
ζ(α) = jα .
j=1
Proof can be found in Appendix A.2. Notably, as the long-tailed nature of item popularity intensifies
(i.e., α → ∞ suggesting ζ(α) → 1), the right side of Eq. (2) converges to one, implying a near-perfect
alignment between r and q1 . Even when the data isn’t markedly skewed and has a considerable ζ(α),
we typically observe σ12 to vastly exceed rmax , e.g., 5.6 × 105 vs. 4.6 × 103 in the dataset Movielens
(with more examples presented in Appendix B.2). Thus, from Eq. (1), a high similarity between r
and q1 emerges. This theorem provides theoretical validation for our observation 1.
Earlier discussions illuminate that the principal spectrum memorizes item popularity. In this sub-
section, we reveal the phenomenon of dimension collapse in recommendation systems (RS), which
amplifies the effect of the principal spectrum, leading to popularity bias amplification.
3.2.1 Empirical Study. The occurrence of dimension collapse in RS is largely attributable to two
factors: 1) explicit low-rank configuration of user/item embeddings [39, 27], and 2) intrinsic training
dynamics associated with gradient-based optimization [15, 48, 6]. Here, we present experiments to
validate these points and examine their impacts on popularity bias.
Impact of Low-Rank Configuration. Figure 2(a) displays the proportion of popular items in
recommendations from well-trained MF models with varying embedding dimensions d. We also
P the magnitude of the largest singular value σ1 compared with other singular values. We report
present
σ12 / 1≤k≤L σk2 as it is easily calculable, where the denominator equals the sum of the diagonal
elements of Ŷ. We observe:
Observation 2. As the embeddingP dimension d decreases, the relative prominence of the principal
singular value increases (σ12 / 1≤k≤L σk2 ↑) and the recommendation increasingly favors popular
items.
4
This observation reveals the impact of low-rank embeddings. A smaller d squeezes the dimensions
(causing singular values of more dimensions to become zero), thereby relatively amplifying the effect
of the principal spectrum. Consequently, item popularity contributes more significantly to ranking,
resulting in more severe popularity bias.
Dimension Collapse from Gradient Optimization. Figure 2(c) illustrates the evolution of singular
values as training progresses using a P
gradient-based optimizer; and Figure 2(b) offers a dynamic view
of popularity bias and the ratio σ12 / 1≤k≤L σk2 over the training procedure. We observe:
Observation 3. The principal singular value grows preferentially and swiftly, while others exhibit a
more gradual increment. Notably, many singular values appear to be far from convergence even at
the end of the training process. Accordingly, popularity bias is severe at the beginning but exhibits a
relative decline as training advances. But even at the end of training, unless an extensive number of
epochs are employed (which could result in computational overhead and potential over-fitting), the
bias remains pronounced.
This phenomenon reveals the dynamic of singular values during gradient optimization — the principal
dimension is prioritized, while singular values of other dimensions are easily under-estimated. This
inherent mechanism could readily lead to dimension collapse, relatively enhancing the impact of the
principal spectrum, and thereby inducing popularity bias.
3.2.2 Theoretical Analyses. In this subsection, we focus on establishing a theoretical relationship
between singular values and the ratio of popular items in recommendations. For readers interested in
the theoretical support of the impact of gradient optimization, we refer them to the Appendix A.4,
which are relatively straightforward by invoking recent gradient theory [15, 48]. For convenience, our
analysis here concentrates on the ratio of the most popular item in top-1 recommendations. We have:
Theorem 2 (Popularity bias amplification). Given hypothesis 1 and nearly perfect alignment between
q1 and r, the ratio of the most popular item in top-1 recommendations over all users is bounded by:
p P !
1 2ζ(2α) 1≤k≤L σk
η≥ ϕ ( − 1) (3)
n 1 − 2−α σ1
P
where ϕ(a) = u∈U I[p1u > a] is an inverse cumulative function calculating the number of elements
p1u in the left principal singular vector p1 exceeding a given value a, and the function I[.] signifies
an indicator function.
The detailed proof is available in Appendix A.3. This theorem vividly showcases the influence of
dimension collapse on popularity bias. Essentially, as dimension collapse intensifies the relative
prominence of the principle singular value (i.e., P σ1 σk ↑), the input of the function ϕ(.) decreases
√ P
1≤k≤L
2ζ(2α) 1≤k≤L σk
(i.e., 1−2−α ( σ1 − 1) ↓). Given the monotonically decreasing nature of ϕ(.), dimension
collapse thus escalates the ratio of most popular items in recommendations. Interestingly, the theorem
√bias. A larger α (indicating a more
illustrates the impact of a long-tailed distribution on popularity
2ζ(2α)
skewed item popularity distribution) decreases the value of 1−2−α , further elevating the lower
bound of the ratio, intensifying bias.
4 Proposed Method
4.1 ReSN: Regulation with Spectral Norm
The above analyses elucidate the essence of the popularity bias amplification — the undue influence
of the principal spectrum. To counteract this, we propose ReSN with leveraging Spectral Norm
Regularizer to penalize the magnitude of principal singular value:
LReSN = LR (Y, Ŷ) + β||Ŷ||22 (4)
where LR (Y, Ŷ) is original recommendation loss, and ||.||2 denote the spectral norm of a matrix
measuring its principle singular value; β controls the contribution from the regularizer.
However, there are practical challenges: 1) the n×m dimensional matrix Ŷ can become exceptionally
large, often comprising billions of entries, making direct calculations computationally untenable; 2)
5
Existing methods to determine the gradient of the spectral norm are iterative [60, 52], which further
adds computational overhead.
To circumvent these challenges, we make two refinements: Firstly, given the alignment of the
principal singular vector q1 with item popularity r, the calculating of the spectral norm can be
simplified into: ||Ŷ||22 = ||Ŷq1 ||2 ≈ ||Ŷr||2 /||r||2 , where ||.|| denotes the L2-norm of a vector. It
transforms the calculation of the complex spectral norm of a matrix to a simple L2-norm of a vector,
avoiding iterative algorithms by leveraging the singular vector property. Further, the item popularity
r can be quickly computed via r = Y⊤ e, where e represents a n-dimension vector filled with ones.
Secondly, we exploit the low-rank nature of the matrix Ŷ. For models based on embeddings, Ŷ
can be expressed as Ŷ = µ(UV⊤ ), where U and V represent the embeddings associated with
users and items, respectively, and µ(.) designates an activation function. Our approach turns to
penalize the spectral norm of the matrix before the introduction of the activation function. This
is motivated by the ease of computation: ||UV⊤ ||22 = ||U(V⊤ q̃1 )||2 , where q̃1 denotes the right
principal vector of the matrix UV⊤ . By adopting this method, we circumvent the computationally-
intensive task of processing the entire matrix Ŷ. Nonetheless, this method introduces a challenge:
accurately computing q̃1 , since it doesn’t inherently align with item popularity. To rectify this, we
Y⊤ e VU⊤ e
may simply mirror the calculation of q1 ← ||Y ⊤ e|| to q̃1 ← ||VU⊤ e|| . This approach is clued by our
Observation 1 and Theorem 1: a matrix’s principal singular vector tends to align with the column sum
vector, especially when the vector showcases a long-tailed distribution. The discussions presented in
Appendix B.3 validate the precision of this strategy. In essence, our ReSN optimizes the following
loss function:
β
L̃ReSN = LR (Y, Ŷ) + ||UV⊤ VU⊤ e||2 (5)
||VU⊤ e||2
4.2 Discussions
The proposed ReSN have the following merits: 1) Model-Agnostic. The proposed ReSN is model-
agnostic and easy to implement. Given that ReSN introduces merely a regularization term, it can
be easily plugged into existing embedding-based methods with minimal code augmentation. 2)
Efficiency. The regularizer can be fast computed from right to left — it predominantly requires the
multiplication of a n × d (or m × d) matrix with a vector. With a time complexity of O((n + m)d),
ReSN is highly efficient. Appendix B.5 also provides empirical evidence. The additional time for
calculating the regularizer is negligible. 3) Suitable for BPR Loss. As delineated in Section 2, while
BPR can be regarded as a specialized point-wise loss, it involves the concept of hyper-items. Our
theoretical analyses presented in Appendix A.1.2 demonstrate ReSN can be a logical regularizer even
for the BPR loss.
Compared with Other Regularizers: 1) recent studies [9, 69, 53] has employed regularizers to
alleviate the dimensional collapse of user/item embeddings. Our ReSN diverges from these methods
in two key aspects: Firstly, ReSN imposes constraints directly onto the prediction matrix, unlike
the embedding matrix constraints utilized in these methods. This distinction is of significance due
to the inherent spectral gap between the embeddings and the prediction matrix. Secondly, ReSN
explicitly modulates the influence of the principal spectrum that captures popularity information,
while these methods mainly focuses on promoting embedding uniformity. ReSN directly and
solely mitigates the impact of the memorized popularity signal, thus demonstrating high efficacy in
mitigating popularity bias; while others may disrupt the spectral structure of the prediction, potentially
compromising model accuracy. 2) Other researchers have introduced various regularizers tailored
to combat popularity bias [74, 36, 47]. However, these approaches are often heuristic, applying
strong constraints to model predictions that may break the model’s original spectrum. While it could
mitigate popularity bias, this approach may also impair the model’s ability to capture other useful
signals, significantly compromising recommendation accuracy. Contrasting this, our ReSN is a light
and theoretic-grounding approach — only modulates the influence of the principle spectrum.
6
Table 2: Performance comparison in terms of NDCG between ReSN and other baselines across seven
datasets and three testing paradigms. The “Com”(refers to “Common”) represents the paradigm
where the training and test datasets are partitioned randomly; “Deb”(refers to “Debiased”) represents
the paradigm where a debiased test dataset is formulated based on item popularity; “Uni”(refers to
“Uniform-exposure”) represents the paradigm where the test data is uniformly-exposed. The best
result is bolded and the runner-up is underlined. The mark ‘*’ denotes the improvement achieved by
ReSN over best baseline is significant with p < 0.05.
Movielens Douban Yelp2018 Gowalla Globo Yahoo Coat
Com Deb Com Deb Com Deb Com Deb Com Deb Uni Uni
MF 0.3572 0.1490 0.0440 0.0116 0.0416 0.0164 0.1182 0.0438 0.1709 0.0028 0.6672 0.5551
Zerosum 0.3309 0.1411 0.0434 0.0110 0.0415 0.0137 0.1063 0.0421 0.1630 0.0036 0.6665 0.5633
MACR 0.3732 0.1647 0.0441 0.0145 0.0404 0.0208 0.1107 0.0545 0.1782 0.0253 0.6714 0.5661
PDA 0.3688 0.1662 0.0446 0.0171 0.0437 0.0229 0.1283 0.0675 0.1725 0.0243 0.6756 0.5676
InvCF 0.3723 0.1567 0.0450 0.0152 0.0433 0.0183 0.1302 0.0592 0.1671 0.0194 0.6519 0.5715
IPL 0.3618 0.1621 0.0442 0.0173 0.0419 0.0219 0.1318 0.0623 0.1715 0.0203 0.6691 0.5602
ReSN 0.3857* 0.1745* 0.0456* 0.0186* 0.0445* 0.0254* 0.1343* 0.0703* 0.1682 0.0256* 0.6792* 0.5871*
5 Experiments
Datasets and Metrics. We adopt seven real-world datasets, Yelp2018 [27], Douban [50], Movielens
[63], Gowalla [26], Globo [20], Yahoo!R3 [38] and Coat [49] for evaluating our model performance.
Details about these datasets refer to C.1.
We adopt three representative testing paradigms for comprehensive evaluations: 1) Common: We
employ the conventional testing paradigm in RS, wherein the datasets are randomly partitioned into
training (70%), validation (10%), and testing (20%). We also report the accuracy-fairness trade-off in
this setting. 2) Debiased: Closely referring to [56, 7, 71], we sample an debiased test set where items
are uniformly distributed, aiming to evaluate the model’s efficacy in mitigating popularity bias. 3)
Uniform-exposure: We also adopt the uniform exposure paradigm for model testing as the recent
work [64]. Notably, the datasets Yahoo!R3 and Coat contain a small dataset collected through a
random recommendation policy. Such data isolate the popularity bias from uneven exposure, offering
a more precise estimation of user preferences. Consequently, we train our recommendation model on
conventionally biased data and then test it on these uniformly-exposed data.
For evaluation metrics, we adopt the widely-used NDCG@K for evaluating accuracy [32]. We
simply adopt K = 5 for Yahoo and Coat datasets and K = 20 for the other datasets as recent
work [27, 64, 62]. We observe similar results with other metrics. We also employ the ratio of
pop/unpopular items for illustrating the severity of popularity bias in recommendations. Here we
closely refer to recent work [68] to define popular and unpopular items. We sort the items according
to their popularity in descending order, and divide items into five groups ensuring the aggregated
popularity of items within each group is the same. We define the items in the most popular groups as
popular items, while the others as unpopular.
Baselines. The following methods are compared: 1) MACR (KDD’21 [56]), PDA (SIGIR’21
[68]): the representative causality-based debiasing methods, which posit a causal graph [42] for the
recommendation procedure and leverage causal inference to mitigate popularity bias accordingly;
2) InvCF (WWW’23 [64]): the SOTA method that addresses popularity bias by disentangling
the popularity from user preference. 3) Zerosum (Recsys’22 [47]) , IPL (SIGIR’23 [36]): the
representative methods based on regularizers, which penalize the score differences or constrain the
ratio of the predicted preference with the exposure.
For fair comparisons, we implement all compared methods with uniform MF backbone and MSE
loss. We also explore the performance with other backbones and losses in subsection 5.3. Besides
above baselines, we also compare our method with the methods on mitigating dimension collapse,
including nCL [9] and DirectAU [53]; and the debiasing methods tailored for GNN-based methods
including APDA [73] and GCFlogdet [69] when using GNN-based backbones.
7
Movielens-1M 0.047
Douban
0.38 Figure 4: NDCG@20 comparison
0.044
0.36 with methods for addressing Dimen-
NDCG@20
0.34 0.041
sion Collapse.
0.32 0.038 ReSN InvCF
0.30 PDA Zerosum Movielens Douban Gowalla
0.28 0.035 MACR MF
IPL MF 0.1529 0.0116 0.0438
0.26 0.032
0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 nCL 0.1572 0.0112 0.0451
Ratio of Non-popular Items Ratio of Non-popular Items DirectAU 0.0169 0.0131 0.0622
Figure 3: Pareto curves of compared methods illustrating the ReSN 0.1788 0.0188 0.0712
trade-off between accuracy and fairness.
Parameter Settings. The embedding dimension d is 256 while other dimensions are explored in B.4
. Grid search is utilized find the optimal hyperparameters. More details refer to Appendix C.2.
Comparison under three testing paradigms. Table 2 showcases the NDCG@K comparison
across seven datasets over three testing paradigms. Under the Common testing paradigm, our
ReSN, with few exceptions, consistently outperforms compared methods. This superior performance
can be attributed to the rigorous theoretical foundations of ReSN, which pinpoint and address the
root cause of bias amplification. By curbing this bias amplification, ReSN achieves significant
improvements in recommendation accuracy. Transitioning to the Debiased and Uniform-exposure
testing paradigms, the improvements by ReSN become even more impressive, demonstrating its
effectiveness in mitigating popularity bias.
Exploring Accuracy-fairness Trade-off. Given the conventional accuracy-fairness trade-off ob-
served in RS, we delve deeper into examining this effect across various methods. After well-training
various methods with differing hyper-parameters (details of hyper-parameters tuning refer to Ap-
pendix C.2), we depict the Pareto frontier in Figure 3. It highlights the relationship between accuracy
(NDCG@20) and fairness (ratio of unpopular items) under the Common testing paradigm. Here,
positions in the top-right corner indicate superior performance. We observe that ReSN exhibits a
more favorable Pareto curve in comparison to other baselines. When fairness is held constant, ReSN
showcases superior accuracy. Conversely, when accuracy is fixed, ReSN delivers enhanced fairness.
This suggests that ReSN effectively navigates the fairness-accuracy trade-off, primarily through its
capability to counteract popularity bias amplification — it only mitigates the effect of the principle
spectrum without disturbing other spectrum.
Compared with the Methods on Tackling Dimension Collapse. Table 4 shows the results of our
ReSN compared with existing methods on tackling dimension collapse on debiased testing paradigm.
nCL and DirectAU can indeed mitigate the popularity bias. However, their performance is inferior to
ReSN. The reason is that our ReSN is designed for debiasing, directly modulating the effect of the
item popularity on predictions, and thus yielding better performance.
8
Figure 5: NDCG@20 comparison with dif- Movielens Douban
0.55 0.38 0.046
ferent Loss functions. 0.60
0.50 0.55 0.045
0.36
NDCG@20
Movielens Douban Gowalla 0.45 0.50
Ratio
0.044
+BCE +BPR +BCE +BPR +BCE +BPR
0.40 0.34 0.45
0.35 0.40 0.043
1 /1 k L k
2 2
MF 0.1529 0.1540 0.0117 0.0120 0.0432 0.0431 0.30 0.32 0.35 Overall 0.042
Zerosum 0.1472 0.1498 0.0109 0.0106 0.0423 0.0425 Ratio of Pop 0.30 Performance
0.25
MACR 0.1682 0.1629 0.0155 0.0149 0.0574 0.0546 0 0.01 0.1 0.5 1 0 0.0010.01 0.1 0.5 1
PDA 0.1635 0.1633 0.0176 0.0173 0.0661 0.0675
InvCF 0.1574 0.1582 0.0153 0.0154 0.0553 0.0583 Figure 6: The proportion of popular items in recom-
IPL 0.1612 0.1628 0.0173 0.0177 0.0612 0.0626
ReSN 0.1788 0.1693 0.0188 0.0180 0.0712 0.0702 mendations and the ratio of the largest singular value
(σ12 / 1≤k≤L σk2 ) and NDCG@20 with varying β.
P
6 Related Work
Analyses on Popularity Bias. In RS, items frequently exhibit a long-tailed distribution in terms of
interaction frequency. Models trained on skewed data are susceptible to inheriting and exacerbating
such bias [28, 74, 4, 5, 75]. The crux of tackling popularity bias lies in understanding why and how
recommendation models intensify popularity bias. Several recent efforts aim to elucidate this. Among
these, causality-based investigations stand out. For instance, Zhang et al. [68] developed a causal
graph of the data generative process, attributing the amplification of popularity bias to a confounding
effect; Wei et al. [56] presented an alternate causal graph, exploring the direct and indirect causal
influence of popularity bias on predictions; Wang et al. [55] used yet another causal graph to highlight
bias amplification, focusing on the impact of users’ historical long-tailed distribution across item
groups. Despite their contributions, a common limitation among these causality-based methods is
their surface-level engagement with the causal relationships among variables, rather than delving
deeper into the underlying mechanisms. For example, these studies usually operate on the assumption
that item popularity directly affects predictions. However, the specifics of how and why predictions
memorize and are influenced by item popularity remain largely unexplored. Worse still, their
effectiveness hinges on the accuracy of their respective causal graphs, which might not always align
with real-world scenarios due to the widespread presence of unmeasured confounders [21, 33].
There have been other theoretical investigations into popularity bias. For instance, Zhu et al. [74]
demonstrate that model predictions inherit item popularity, yet they failed to elucidate the amplifica-
tion. Also, their conclusions rely on a strong assumption that the preference scores maintain same
distribution over different user-item pairs. The study by [41] shed light on the limited expressiveness
of low-rank embeddings, giving clues of popularity bias in recommendations. Yet, they did not
factor in the impact of long-tailed training data. In fact, popularity bias origins from long-tailed
data [68, 74], amplified during training, which would be more serious than the theoretically analyses
presented in [41]. Some efforts Chen et al. [13], Kim et al. [30] examined popularity bias through
embedding magnitude, their theoretical analysis can only applied in the early stages of training.
Other researchers delved into how graph neural networks amplify popularity bias, whether through
influence functions [10], the hub effect [73] or dimensional collapse [69]. However, their conclusions
can not be extended to general recommendation models.
Methods on Tackling Popularity Bias. Recent efforts on addressing popularity bias are mainly four
types: 1) Causality-driven methods assume a causal graph to identify popularity bias and employ
causal inference techniques for rectification. While they have demonstrated efficacy, their success
is closely tied to the accuracy of the causal graph. This poses challenges due to the prevalence of
unmeasured confounders [21, 33, 40]. 2) Propensity-based methods [49, 25, 19, 66, 54] adjust the
9
data distribution by reweighting the training data instances. While this approach directly negates
popularity bias in the data, it may inadvertently obscure other valuable signals, such as item quality.
Consequently, these methods often underperform compared to causality-driven ones. 3) Regularizer-
based methods [3, 74, 47, 36, 29] constrain predictions by introducing regularization terms. For
example, Zhu et al. [74] employs a Pearson coefficient regularizer to diminish the correlation between
item popularity and model predictions; Rhee et al. [47] proposes to regularize the score differences;
[36] constrains the predictions with IPL criterion. As discussed in section 4.2, their constraints are too
strong, may significantly compromising accuracy. 4) Disentanglement-based methods [64, 59, 14]
target at learning disentangled embeddings that segregate the influence of popularity from genuine
user preferences. While promising, achieving a perfect disentanglement of popularity bias from true
preferences remains a formidable challenge in RS.
References
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. {TensorFlow}: a system for {Large-Scale}
machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI
16), pages 265–283, 2016.
[2] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley interdisciplinary reviews:
computational statistics, 2(4):433–459, 2010.
[3] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. Controlling popularity bias in learning-to-
rank recommendation. In Proceedings of the eleventh ACM conference on recommender systems, pages
42–46, 2017.
[4] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher. The impact of popularity
bias on fairness and calibration in recommendation. arXiv preprint arXiv:1910.05755, 2019.
[5] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher. The connection between
popularity bias, calibration, and fairness in recommendation. In Proceedings of the 14th ACM Conference
on Recommender Systems, pages 726–731, 2020.
[6] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization.
Advances in Neural Information Processing Systems, 32, 2019.
[7] Stephen Bonner and Flavian Vasile. Causal embeddings for recommendation. In Proceedings of the 12th
ACM conference on recommender systems, pages 104–112, 2018.
[8] Rodrigo Borges and Kostas Stefanidis. On measuring popularity bias in collaborative filtering data. In
Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, March
30, 2020, volume 2578, 2020.
[9] Huiyuan Chen, Vivian Lai, Hongye Jin, Zhimeng Jiang, Mahashweta Das, and Xia Hu. Towards mitigating
dimensional collapse of representations in collaborative filtering. arXiv preprint arXiv:2312.17468, 2023.
[10] Jiajia Chen, Jiancan Wu, Jiawei Chen, Xin Xin, Yong Li, and Xiangnan He. How graph convolutions
amplify popularity bias for recommendation? arXiv preprint arXiv:2305.14886, 2023.
[11] Jiajia Chen, Jiancan Wu, Jiawei Chen, Xin Xin, Yong Li, and Xiangnan He. How graph convolutions
amplify popularity bias for recommendation? Frontiers of Computer Science, 18(5):185603, 2024.
10
[12] Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. Bias and debias in
recommender system: A survey and future directions. ACM Transactions on Information Systems, 41(3):
1–39, 2023.
[13] Jiawei Chen, Junkang Wu, Jiancan Wu, Xuezhi Cao, Sheng Zhou, and Xiangnan He. Adap-τ : Adaptively
modulating embedding magnitude for recommendation. In Proceedings of the ACM Web Conference 2023,
pages 1085–1096, 2023.
[14] Zhihong Chen, Jiawei Wu, Chenliang Li, Jingxu Chen, Rong Xiao, and Binqiang Zhao. Co-training
disentangled domain adaptation network for leveraging popularity bias in recommenders. In Proceedings
of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,
pages 60–69, 2022.
[15] Hung-Hsu Chou, Carsten Gieshoff, Johannes Maly, and Holger Rauhut. Gradient descent for deep matrix
factorization: Dynamics and implicit bias towards low rank. Applied and Computational Harmonic
Analysis, page 101595, 2023.
[16] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data.
SIAM review, 51(4):661–703, 2009.
[17] Ludovik Çoba, Panagiotis Symeonidis, and Markus Zanker. Visual analysis of recommendation per-
formance. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pages 362–363,
2017.
[18] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In
Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016.
[19] Quanyu Dai, Zhenhua Dong, and Xu Chen. Debiased recommendation with neural stratification. AI Open,
3:213–217, 2022.
[20] Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. News session-based
recommendations using deep neural networks. In Proceedings of the 3rd workshop on deep learning for
recommender systems, pages 15–23, 2018.
[21] Sihao Ding, Peng Wu, Fuli Feng, Yitong Wang, Xiangnan He, Yong Liao, and Yongdong Zhang. Addressing
unmeasured confounder for recommendation with sensitivity analysis. In Proceedings of the 28th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, pages 305–315, 2022.
[22] Chongming Gao, Kexin Huang, Jiawei Chen, Yuan Zhang, Biao Li, Peng Jiang, Shiqi Wang, Zhong
Zhang, and Xiangnan He. Alleviating matthew effect of offline reinforcement learning in interactive
recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and
Development in Information Retrieval, page 238–248, 2023.
[23] Chongming Gao, Shiqi Wang, Shijun Li, Jiawei Chen, Xiangnan He, Wenqiang Lei, Biao Li, Yuan Zhang,
and Peng Jiang. Cirs: Bursting filter bubbles by counterfactual interactive recommender system. ACM
Transactions on Information Systems, 42(1):1–27, 2023.
[24] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
[25] Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien
Tardieu, and Ben Carterette. Offline evaluation to make decisions about playlistrecommendation algorithms.
In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages
420–428, 2019.
[26] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative
filtering. In Proceedings of the 26th international conference on world wide web, pages 173–182, 2017.
[27] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying
and powering graph convolution network for recommendation. In Proceedings of the 43rd International
ACM SIGIR conference on research and development in Information Retrieval, pages 639–648, 2020.
[28] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. What recommenders rec-
ommend: an analysis of recommendation biases and possible countermeasures. User Modeling and
User-Adapted Interaction, 25:427–491, 2015.
11
[29] Jiarui Jin, Zexue He, Mengyue Yang, Weinan Zhang, Yong Yu, Jun Wang, and Julian McAuley. Inforank:
Unbiased learning-to-rank via conditional mutual information minimization. In Proceedings of the ACM
on Web Conference 2024, page 1350–1361, 2024.
[30] Dain Kim, Jinhyeok Park, and Dongwoo Kim. Test time embedding normalization for popularity bias
mitigation. arXiv preprint arXiv:2308.11288, 2023.
[31] DP Kingma. Adam: a method for stochastic optimization. In Int Conf Learn Represent, 2014.
[32] Walid Krichene and Steffen Rendle. On sampled metrics for item recommendation. In Proceedings of the
26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1748–1757,
2020.
[33] Haoxuan Li, Yanghao Xiao, Chunyuan Zheng, and Peng Wu. Balancing unobserved confounding with a
few unbiased ratings in debiased recommendations. In Proceedings of the ACM Web Conference 2023,
pages 1305–1313, 2023.
[34] Zihan Lin, Changxin Tian, Yupeng Hou, and Wayne Xin Zhao. Improving graph collaborative filtering
with neighborhood-enriched contrastive learning. In Proceedings of the ACM Web Conference 2022, pages
2320–2329, 2022.
[35] Seppo Linnainmaa. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16
(2):146–160, 1976.
[36] Yuanhao Liu, Qi Cao, Huawei Shen, Yunfan Wu, Shuchang Tao, and Xueqi Cheng. Popularity debiasing
from exposure to interaction in collaborative filtering. In Proceedings of the 46th International ACM SIGIR
Conference on Research and Development in Information Retrieval, page 1801–1805, 2023.
[37] Masoud Mansoury, Himan Abdollahpouri, Mykola Pechenizkiy, Bamshad Mobasher, and Robin Burke.
Feedback loop and bias amplification in recommender systems. In Proceedings of the 29th ACM interna-
tional conference on information & knowledge management, pages 2145–2148, 2020.
[38] Benjamin M Marlin and Richard S Zemel. Collaborative prediction and ranking with non-random missing
data. In Proceedings of the third ACM conference on Recommender systems, pages 5–12, 2009.
[39] Andriy Mnih and Russ R Salakhutdinov. Probabilistic matrix factorization. Advances in neural information
processing systems, 20, 2007.
[40] Wentao Ning, Reynold Cheng, Xiao Yan, Ben Kao, Nan Huo, Nur Al Hasan Haldar, and Bo Tang.
Debiasing recommendation with personal popularity. In Proceedings of the ACM on Web Conference 2024,
page 3400–3409, 2024.
[41] Naoto Ohsaka and Riku Togashi. Curse of "low" dimensionality in recommender systems. In Proceedings
of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval,
page 537–547, 2023.
[43] S Unnikrishna Pillai, Torsten Suel, and Seunghun Cha. The perron-frobenius theorem: some of its
applications. IEEE Signal Processing Magazine, 22(2):62–75, 2005.
[44] Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. Deepinf: Social influence
prediction with deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pages 2110–2119, 2018.
[45] Steffen Rendle. Item recommendation from implicit feedback. In Recommender Systems Handbook.
Springer, 2022.
[46] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian
personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty
in Artificial Intelligence, pages 452–461, 2009.
[47] Wondo Rhee, Sung Min Cho, and Bongwon Suh. Countering popularity bias by regularizing score
differences. In Proceedings of the 16th ACM Conference on Recommender Systems, pages 145–155, 2022.
[48] Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development
in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019.
12
[49] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Rec-
ommendations as treatments: Debiasing learning and evaluation. In international conference on machine
learning, pages 1670–1679. PMLR, 2016.
[50] Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, and Jian Tang. Session-based
social recommendation via dynamic graph attention networks. In Proceedings of the Twelfth ACM
international conference on web search and data mining, pages 555–563, 2019.
[51] Harald Steck. Item popularity and recommendation accuracy. In Proceedings of the fifth ACM conference
on Recommender systems, pages 125–132, 2011.
[52] Shuhan Sun, Zhiyong Xu, and Jianlin Zhang. Spectral norm regularization for blind image deblurring.
Symmetry, 13(10):1856, 2021.
[53] Chenyang Wang, Yuanqing Yu, Weizhi Ma, Min Zhang, Chong Chen, Yiqun Liu, and Shaoping Ma.
Towards representation alignment and uniformity in collaborative filtering. In Proceedings of the 28th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1816–1825, 2022.
[54] Lei Wang, Chen Ma, Xian Wu, Zhaopeng Qiu, Yefeng Zheng, and Xu Chen. Causally debiased time-aware
recommendation. In Proceedings of the ACM on Web Conference 2024, page 3331–3342, 2024.
[55] Wenjie Wang, Fuli Feng, Xiangnan He, Xiang Wang, and Tat-Seng Chua. Deconfounded recommendation
for alleviating bias amplification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining, pages 1717–1725, 2021.
[56] Tianxin Wei, Fuli Feng, Jiawei Chen, Ziwei Wu, Jinfeng Yi, and Xiangnan He. Model-agnostic counterfac-
tual reasoning for eliminating popularity bias in recommender system. In Proceedings of the 27th ACM
SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1791–1800, 2021.
[57] Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. Self-
supervised graph learning for recommendation. In Proceedings of the 44th international ACM SIGIR
conference on research and development in information retrieval, pages 726–735, 2021.
[58] Le Wu, Xiangnan He, Xiang Wang, Kun Zhang, and Meng Wang. A survey on accuracy-oriented neural
recommendation: From collaborative filtering to information-rich recommendation. IEEE Transactions on
Knowledge and Data Engineering, 35(5):4425–4445, 2022.
[59] Guipeng Xv, Chen Lin, Hui Li, Jinsong Su, Weiyao Ye, and Yewang Chen. Neutralizing popularity bias in
recommendation models. In Proceedings of the 45th International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 2623–2628, 2022.
[60] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep
learning. arXiv preprint arXiv:1705.10941, 2017.
[61] Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. Are graph
augmentations necessary? simple graph contrastive learning for recommendation. In Proceedings of the
45th international ACM SIGIR conference on research and development in information retrieval, pages
1294–1303, 2022.
[62] Junliang Yu, Xin Xia, Tong Chen, Lizhen Cui, Nguyen Quoc Viet Hung, and Hongzhi Yin. Xsimgcl: To-
wards extremely simple graph contrastive learning for recommendation. IEEE Transactions on Knowledge
and Data Engineering, 2023.
[63] Wenhui Yu and Zheng Qin. Graph convolutional network for recommendation with low-pass collaborative
filters. In International Conference on Machine Learning, pages 10936–10945. PMLR, 2020.
[64] An Zhang, Jingnan Zheng, Xiang Wang, Yancheng Yuan, and Tat-Seng Chua. Invariant collaborative
filtering to popularity distribution shift. In Proceedings of the ACM Web Conference 2023, pages 1240–1251,
2023.
[65] An Zhang, Wenchang Ma, Pengbo Wei, Leheng Sheng, and Xiang Wang. General debiasing for graph-
based collaborative filtering via adversarial graph dropout. In Proceedings of the ACM on Web Conference
2024, page 3864–3875, 2024.
[66] Fan Zhang and Qijie Shen. A model-agnostic popularity debias training framework for click-through rate
prediction in recommender system. In Proceedings of the 46th International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 1760–1764, 2023.
13
[67] Xiang Zhang, Douglas E Faries, Hu Li, James D Stamey, and Guido W Imbens. Addressing unmeasured
confounding in comparative observational research. Pharmacoepidemiology and drug safety, 27(4):
373–382, 2018.
[68] Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong
Zhang. Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the
44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages
11–20, 2021.
[69] Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, Irwin King, et al. Mitigating the popularity bias of
graph collaborative filtering: A dimensional collapse perspective. In Thirty-seventh Conference on Neural
Information Processing Systems, 2023.
[70] Zihao Zhao, Jiawei Chen, Sheng Zhou, Xiangnan He, Xuezhi Cao, Fuzheng Zhang, and Wei Wu. Popularity
bias is not always evil: Disentangling benign and harmful bias for recommendation. IEEE Transactions on
Knowledge and Data Engineering, 2022.
[71] Yu Zheng, Chen Gao, Xiang Li, Xiangnan He, Yong Li, and Depeng Jin. Disentangling user interest and
conformity for recommendation with causal embedding. In Proceedings of the Web Conference 2021,
pages 2980–2991, 2021.
[72] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han
Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM
SIGKDD international conference on knowledge discovery & data mining, pages 1059–1068, 2018.
[73] Huachi Zhou, Hao Chen, Junnan Dong, Daochen Zha, Chuang Zhou, and Xiao Huang. Adaptive popularity
debiasing aggregator for graph collaborative filtering. pages 7–17, 2023.
[74] Ziwei Zhu, Yun He, Xing Zhao, Yin Zhang, Jianling Wang, and James Caverlee. Popularity-opportunity
bias in collaborative filtering. In Proceedings of the 14th ACM International Conference on Web Search
and Data Mining, pages 85–93, 2021.
[75] Ziwei Zhu, Yun He, Xing Zhao, and James Caverlee. Evolution of popularity bias: Empirical study and
debiasing. arXiv preprint arXiv:2207.03372, 2022.
14
A Theoretical Analysis
A.1 Analysis of BPR loss
Then we can construct a hyper-item space denoted as I ′ = I × I derived from item pairs, and
′
define the embeddings of hyper-items as vij = vi − vj and assign new observed interactions to
′ ′
the combinations of users and hyper-items, i.e., yu,ij = 1 for yui = 1&yuj = 0 and yu,ij = 0 for
yui = 0&yuj = 1.
With this transformation, the BPR loss can be effectively reframed as:
X X X
LBP R = − log(µ(u⊤ ⊤
u vi − uu vj ))
u∈U i∈I,yui =1 j∈I,yuj =0
X X X
=− log(µ(u⊤
u (vi − vj )))
u∈U i∈I,yui =1 j∈I,yuj =0
1 X X X X X
=− log(µ(u⊤ u (vi − vj ))) + log(µ(u⊤
u (vj − vi )))
2
u∈U i∈I,yui =1 j∈I,yuj =0 i∈I,yui =0 j∈I,yuj =1
1 X X X
=− log(µ(u⊤ ′
u vij )) + log(µ(−u⊤ ′
u vij ))
2 ′ ′ ′ ′
u∈U (i,j)∈I yu,ij =1 (i,j)∈I yu,ij =0
If the activate function µ(.) is taken as the Sigmoid function σ(.), we have:
1 X X X
LBP R = − log(σ(u⊤ ′
u vij )) + log(1 − σ(u⊤ ′
u vij ))
2 ′ ′ ′ ′
u∈U (i,j)∈I yu,ij =1 (i,j)∈I yu,ij =0
(7)
1X X ′
log(σ(u⊤ ′ ′ ⊤ ′
=− yu,ij u vij )) + (1 − yu,ij ) log(1 − σ(uu vij ))
2
u∈U (i,j)∈I ′
From Equation 7, the BPR loss can be reframed as the BCE loss with the hyper-items space I ′ . This
allows the analyses on point-wise losses to be easily extended to BPR loss.
15
Similarly, for the denominator:
X
||V′ U⊤ e||2 = (vi U⊤ e − vj U⊤ e)⊤ (vi U⊤ e − vj U⊤ e)
i,j∈I
X X X
= 2m (vi U⊤ e)⊤ (vi U⊤ e) − 2( vi U⊤ e)⊤ ( vi U⊤ e) (9)
i∈I i∈I i∈I
We can deduce ||V′ U⊤ e||2 can be approximted by ||VU⊤ e||2 . Consequently, ReSN emerges as a
logical regularizer even for the BPR loss. This assertion is also validated by our experiments.
We further utilize the property of Y to give the lower bound of e⊤ ṗ1 as e⊤ ṗ1 ≥
3) q
rmax (ζ(α)−1)
σ̇1 1− σ̇12
;
q
2−ζ(α)
4) Finally, we demonstrate σ̇12 ≥ rmax , and give cos(r, q1 ) = cos(r, q̇1 ) ≥ ζ(2α) when α > 2.
Part 1: the spectral relation between Y and Ŷ. Here we focus on an embedding-based model with
sufficient capacity and optimize it with MSE loss 1 : LR = ||Y − Ŷ||2F . For other losses, like BCE
loss,which can be approximated to MSE Loss via Taylor expansion [35]. Based on the theorem of
PCA (Principal Component Analysis) [2], the optimal Ŷ will have the same spectrum as the principal
dimensions of Y. That is, their principal dimensional singular value and vector match up. Because of
this relation, when analyzing the principle spectral property of Ŷ, we can instead shift our focus and
look at Y, making our job easier.
Part 2: preliminary bound of cos(r, q1 ). Note that r can be written as r = Y⊤ e, where e denotes
the n-dimension vector filled with ones, and Y can be written as:
L
X
Y = ṖΣ̇Q̇T = σ̇k ṗk q̇⊤
k (10)
k=1
16
Part 3: bound of e⊤ ṗ1 . We first demonstrate that for the matrix Y, we can always find a non-
negative principal singular vector q̇1 . Let define matrix Z = Y⊤ Y. We can find each element in zkl
is non-negative. Note that:
m X
X m
σ̇12 = max q̇⊤ Y⊤ Yq̇ = max q̇k zkl q̇l (14)
||q̇||=1 ||q̇||=1
k=1 l=1
Suppose we have a principal singular vector q′ with negative elements. Let positions of these negative
elements be a set S = {k|q′k < 0}. We always can construct a new m-dimensional vector h whose
k-th element be qk′ if k ∈ S, be −qk′ otherwise. we can find h would be non-negative and have:
m X
X m m X
X m
hk zkl hl ≥ qk′ zkl ql′ = σ̇12 (15)
k=1 l=1 k=1 l=1
It means the sum of any row of the matrix B⊤ B is smaller than rmax (ζ(α) − 1). According to
the Perron-Frobenius theorem [43], we have the largest eigenvalue value of B⊤ B is bounded with:
λ1 (B⊤ B) ≤ rmax (ζ(α) − 1). Further, we have:
m
X 2 1 1 rmax
(q̇1i ) = ||B⊤ ṗ1 ||22 ≤ 2 λmax (B⊤ B) ≤ 2 (ζ(α) − 1) (17)
i=2
σ̇12 σ̇1 σ̇1
Considering q̇1 is the normalization term, we can derive the lower bound of the e⊤ ṗ1 as:
r
⊤ rmax
e ṗ1 ≥ σ̇1 q̇11 ≥ σ̇1 1 − 2 (ζ(α) − 1) (18)
σ̇1
Integrating Eq. (18) into Eq. (13), finally we get the lower bound of the cos(r, q̇1 ) as:
s
σ̇12 rmax (ζ(α) − 1)
cos(r, q̇1 ) ≥ 1− (19)
σ̇12
p
rmax ζ(2α)
Part 4: demonstrating σ̇12 ≥ rmax . Let define a matrix Z = Y⊤ Y. Let l be the ID of the item with
the highest popularity. It is easily to find that the ll-th element in Z have zll = rmax . Let v be a
one-hot vector whose l-th element is one. we have:
rmax = v⊤ Zv ≤ max q̇⊤ Zq̇ = σ̇12 (20)
||q̇||=1
Given α > 2, we have ζ(α) ≤ 2. Considering σ̇12 ≥ rmax , Eq. (19) can be further bounded with:
s
2 − ζ(α)
cos(r, q̇1 ) ≥ (21)
ζ(2α)
Due to the alignment of the principal singular values and vectors of Y and Ŷ, we have:
s
2 − ζ(α)
cos(r, q1 ) ≥ (22)
ζ(2α)
17
A.3 Proof of Theorem 2
Let S be a set of users where the most popular item occupies the top-1 recommendation. Let l be the
ID of the most popular item. S can be wrriten as:
S = {u ∈ U|ŷul > ŷui , ∀i ∈ I/l} (23)
The ratio of the most popular item occupying top-1 recommendation can be written as η = |S|/n.
We then do some transformation of the condition:
Here we begin by invoking the gradient dynamic theorem from [15, 48, 6] to elucidate the dimension
collapse phenomenon and then develop a theorem to illustrate how the singular values impact the
popularity bias in recommendations.
18
×103 Movielens ×104 Douban ×104 Globo
2.0
2
Interactions
2 1.5
1.0 1
1
0.5
0 0.0 0
0 2000 4000 0 10000 20000 0 5000 10000
Item Index Item Index Item Index
Figure 7: Long-tailed distribution of item popularity in recommendation datasets.
Table 4: The proportion of interactions of the top 20% of popular items in the total number of
interactions.
Movielens-1M Douban Globo
% 67.3 86.3 90.5
Theorem 3 (Trajectory of singular values (Eq. (6) in [48])). When training an MF model via gradient
d dL d dL
flow (gradient descent with infinitesimally small learning rate), i.e., dt U(t) = − dU , dt V(t) = − dV ,
the trajectory of singular values during the learning process obeys:
sk e2sk t
σk (t) = (30)
e2sk t − 1 + sk /σk (0)
where sk signifies the terminal value of the k-th singular value, i.e., σk (t) → sk as t → ∞.
This theorem illustrates a sigmoidal trajectory that begins at some initial value σk (0) at time t = 0
and rises to sk . The growing trajectory of singular values depends on their respective convergence
values. It is coincident with the phenomenon presented in Figure 2(c) — i.e., larger singular values are
prioritized. Those small singular values require much more time to reach optimum, easily resulting
in dimension collapse.
B Additional Experiments
B.1 Long-tailed Distribution and Bias Amplification in Recommendations
Figure 7 shows the distribution of item popularity (the number of interactions of an item) in the
three benchmark datasets. It presents a significant long-tail distribution: a small portion of popular
items at the head have a high number of interactions, while the majority of items in the tail have very
few interactions. Table 4 shows the proportion of interactions of the top 20% popular items to all
interactions. In all three datasets, the interactions of the top 20% popular items accounted for over
60% of all interactions, and in the Globo dataset, it even exceeded 90%.
Figure 8 shows the popularity bias amplification effect in the three benchmark datasets. In all three
datasets, a mere 3% of the most popular items accounting for 20% of total interactions occupy
over 40% recomendation slots, and in Douban dataset, it even reaches 60%. The disparity in the
proportion of interactions to recommendation results effectively demonstrate the amplification effect
of popularity bias.
Table 5 presents the values of σ12 and rmax on three benchmark datasets and different back-
bone models. It can be observed that in multiple actual datasets, σ12 is significantly larger than
rmax , exceeding 100 times and more. Combining the bounds given by Eq.(1), (cos(r, q1 ) ≥
19
Movielens Douban Globo
0.5 0.632 0.649 0.578 0.597
0.435 0.448 0.6
0.6
0.4
Proportion
0.3 0.4 0.4
0.2 0.200
0.200 0.2 0.200
0.2
0.1
0.028
0.0 0.0 0.006 0.0 0.003
Ratio of Popular Items Ratio in Recommendations (MF)
Ratio of Interactions in Data Ratio in Recommendations (LightGCN)
Figure 8: Illustration of popular bias amplification: We divide items into five groups according to
their popularity as recent work [68], and focus on the most popular group. The chart displays four
bars, representing the ratio of the items in the most popular group, the percentage of interactions
originating from these items in the training set, and the percentages of these items appearing in
recommendations from MF and LightGCN, respectively.
Table 5: The value of σ12 and rmax in different datasets and recommendation models.
MF 5.6 × 105 4.6 × 103 1.2 × 107 4.7 × 104 8.3 × 106 5.0 × 104
LightGCN 5.7 × 105 4.6 × 103 1.2 × 107 4.9 × 104 8.4 × 106 5.1 × 104
2
q
σ1 rmax (ζ(α)−1)
√ 1− σ12
), even if the data is not markedly skewed, i.e., α is not very large and
rmax ζ(2α)
ζ(α) is not very close to 1, there is still a significant similarity between item popularity vector r and
the principal singular vector q1 due to the considerable ratio between σ12 and rmax . This observation
also helps to explain the prevalence of popularity bias memorization effect in recommendation models
and datasets.
estimated regularization term is a accurate surrogate for the spectral norm ||UV⊤ ||22 which validates
the accuracy and rationality of the proposed method.
To further evaluate the performance of ReSN, we explored the performance of the model under
different embedding dimensions, as shown in Figure 9. It can be seen that, with the increase of
embedding dimensions, the performance of all models gradually improves, and our ReSN can
outperform the comparison methods under all different embedding dimensions. This further validates
the effectiveness of ReSN in terms of performance.
20
Table 6: Comparison between the actual spectral norm and the estimated approximation.
MSE BCE
Datasets
||UV⊤ ||22 ||U(V⊤ q̃1 )||2 ||UV⊤ ||22 ||U(V⊤ q̃1 )||2
Movielens-1M Douban
0.018
0.16
0.016
0.14 0.014
NDCG@20
0.12 0.012
ReSN InvCF 0.010
0.10 IPL MACR
PDA MF 0.008
0.08 Zeorsum
4 16 64 256 4 16 64 256
Dimension Dimension
Figure 9: Performance comparison across different embedding dimensions in the Movielens and
Douban datasets.
To further validate the effectiveness of our acceleration strategy, we test the running time per epoch
of ReSN and the original brute-force strategy employed to compute the gradient of the spectral norm.
Also, we present the baseline MF for comparison. The results are presented in Table 7. As can be
seen, our acceleration strategy achieves over 3500 and 27000 times impressive speed-up in both
datasets, respectively. Moreover, compared with MF, our ReSN does not incur much computational
overhead.
C Experimental Settings
C.1 Datasets
• Movielens-1M [63]: Movielens is the widely used dataset from [63] and is collected from
MovieLens2 . We use the version of 1M. We transform explicit data into implicit feedback
by treating all user-item ratings as positive interactions.
• Douban [50]: This dataset is collected from a popular review website Douban3 in China.
We transform explicit data into implicit using the same method as applied in Movielens.
• Globo [20]: This dataset is a popular dataset collected from the news recommendation
website Globo.com4 .
• Yelp2018 [27] & Gowalla [26]: Gowalla is the check-in dataset obtained from Gowalla and
Yelp2018 is from the 2018 edition of the Yelp challenge, containing Yelp’s business reviews
and user data. For a fair comparison, these two datasets are used exactly the same as [27]
used.
2
https://movielens.org/
3
https://www.douban.com/
4
http://g1.globo.com/
21
Table 7: Running time comparison (s / Epoch).
Movielens Douban
MF 0.177 2.098
ReSN 0.181 2.124
ReSN-Direct 649 59239
Speedup Ratio 3585 27890
• Yahoo!R3 [38] & Coat [49]: These two datasets are obtained from the Yahoo music and
Coat shopping recommendation service, respectively. Both datasets contain a training set of
biased rating data collected from normal user interactions and a test set of unbiased rating
data containing user ratings on randomly selected items. The rating data are translated to
implicit feedback,i.e., interactions with ratings larger than 3 are regarded as positive samples.
Following the standard 10-core setting, we filter out users and items with less than 10 interactions,
and we report the statistics of the above datasets after standardization in Table 8.
We implement ReSN in Tensorflow [1] and the initialization is unified with Xavier [24]. We opti-
mize all models with Adam [31]. A grid search is conducted to confirm the optimal parameter
setting for each model. To be more specific, learning rate is searched in {1e−2 , 1e−3 , 2e−4 },
weight decay in {1e−7 , 1e−6 , 1e−5 , 1e−4 , 1e−3 }. As for the backbone of LightGCN, we uti-
lize three layers of graph convolution network to obtain the best results, with or without using
dropout to prevent over-fitting. For ReSN, the coefficient of regularizer β is tuned in the range of
{1e−4 , 1e−3 , 1e−2 , 1e−1 , 5e−1 , 1.0, 5.0}. For compared methods, we closely refer to configurations
provided in their respective publications to ensure their optimal performance.
On the experiments of Pareto curve, besides tuning learning rate and weight decay, we also do
following hyperparamter tuning: 1) For PDA, we selected the results of tuning the γ and γ̃; 2) For
MACR, we selected the results of tuning the coefficient c; 3) For InvCF, since we found that the
differences were not significant after tuning α, λ1 , and λ2, we reported in the form of a point; 4) For
Zerosum, we adjusted its regularization term coefficient, but its results varied greatly and oscillated,
so we only reported its best overall performance as a point. 5) For IPL, we selected the results of
tuning the regularization term cofficient λf . All experiments are conducted on a server with Intel(R)
Xeon(R) Gold 6254 CPUs.
D Notations
We summarize the notations used in this paper as follows: uppercase bold letters represent matri-
ces(e.g., Y); lowercase bold letters represent vectors (e.g., r); ∥ · ∥2 to represent the spectral norm of
a matrix, i.e., the largest singular value of the matrix; and ∥ · ∥ denotes the L2-norm of a vector. Table
9 provides a more detailed enumeration of the notations used in this paper.
22
Table 9: Notations in this paper.
Notations Descriptions
u a user in the user set U
i an item in the item set I
n the number of users in U
m the number of items in I
yui whether user u has interacted with item i
Y the observed interaction matrix
ri the number of interactions of item i, i.e., popularity of item i
r the vector of the item popularity over all items
uu ,vi embedding vector of user u and item i
U,V embedding matrices for all users and items
Ŷ predicted matrix of all user-item pairs,i.e., Ŷ = µ(UV⊤ )
σk ,pk ,qk the k-th largest singular value and its corresponding left and right singular vector of Ŷ
α the shape parameter signifying howPseverity of long-tail of item popularity
∞
ζ(α) Rieman zeta function,i.e., ζ(α) = j=1 j1α
e a n-dimension vector filled with ones
23