0% found this document useful (0 votes)
3 views

2402.06216v2

This document discusses the effectiveness of cross-entropy loss in evaluating large language model (LLM)-based recommendation systems, highlighting that while LLMs have shown promising results, their performance may be overstated due to unfair comparisons with conventional methods. The authors argue that traditional pointwise/pairwise loss functions lead to significant performance degradation and propose that cross-entropy loss can be effectively approximated to enhance ranking capabilities. The findings suggest that existing LLM-based methods are not as effective as claimed, emphasizing the need for objective evaluations in future research.

Uploaded by

Liang Yi Chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2402.06216v2

This document discusses the effectiveness of cross-entropy loss in evaluating large language model (LLM)-based recommendation systems, highlighting that while LLMs have shown promising results, their performance may be overstated due to unfair comparisons with conventional methods. The authors argue that traditional pointwise/pairwise loss functions lead to significant performance degradation and propose that cross-entropy loss can be effectively approximated to enhance ranking capabilities. The findings suggest that existing LLM-based methods are not as effective as claimed, emphasizing the need for objective evaluations in future research.

Uploaded by

Liang Yi Chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Understanding the Role of Cross-Entropy Loss in Fairly

Evaluating Large Language Model-based Recommendation


Cong Xu∗ Zhangchi Zhu∗ Jun Wang
East China Normal University East China Normal University East China Normal University
Shanghai, China Shanghai, China Shanghai, China
[email protected] [email protected] [email protected]

Jianyong Wang Wei Zhang


Tsinghua University East China Normal University
Beijing, China Shanghai, China
arXiv:2402.06216v2 [cs.IR] 22 Feb 2024

[email protected] [email protected]

ABSTRACT %HDXW\ <HOS


 6$65HF  6$65HF
Large language models (LLMs) have gained much attention in the %&( %&(
%35 %35
recommendation community; some studies have observed that  &(  &(

+5#
LLMs, fine-tuned by the cross-entropy loss with a full softmax, /ODPD5HF /ODPD5HF

could achieve state-of-the-art performance already. However, these 


3 &,',,'
claims are drawn from unobjective and unfair comparisons. In view 
(65HF
(65HF
of the substantial quantity of items in reality, conventional recom-  3 &,',,'

menders typically adopt a pointwise/pairwise loss function instead     
for training. This substitute however causes severe performance 1'&*# 1'&*#
degradation, leading to under-estimation of conventional methods
and over-confidence in the ranking capability of LLMs. Figure 1: Recommendation performance comparisons. The
In this work, we theoretically justify the superiority of cross- marker size depicts the number of model parameters: 60M
entropy, and showcase that it can be adequately replaced by some for P5 (CID + IID) [19], 7B for LlamaRec [53] and E4SRec [28],
elementary approximations with certain necessary modifications. and merely ≤ 1M for SASRec [21].
The remarkable results across three public datasets corroborate
that even in a practical sense, existing LLM-based methods are
not as effective as claimed for next-item recommendation. We
Language Model-based Recommendation . In Proceedings of ACM Confer-
hope that these theoretical understandings in conjunction with
ence (Conference’17). ACM, New York, NY, USA, 16 pages. https://doi.org/
the empirical results will facilitate an objective evaluation of LLM- XXXXXXX.XXXXXXX
based recommendation in the future. Our code is available at https:
//github.com/MTandHJ/CE-SCE-LLMRec.
1 INTRODUCTION
CCS CONCEPTS With the growth of the Internet, the amount of information being
• Information systems → Recommender systems; • Computer generated every moment is far beyond human discernment. Rec-
systems organization → Neural networks. ommender systems are thus developed to help humans quickly and
accurately ascertain the items of interest, and have played impor-
KEYWORDS tant roles in diverse applications, including e-commerce [55], online
recommendation, large language model, cross-entropy, evaluation news [13], and education [50]. Due to inherent differences in data
types and recommendation goals, different tasks are typically han-
ACM Reference Format:
dled using various techniques and separate models. For example,
Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang. 2024.
graph neural networks [23] have dominated collaborative filtering
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large
[15, 32], while Transformer [48] becomes increasingly popular in
∗ Equal contribution sequential recommendation [21, 44].
Recently, the prosperity of Large Language Models (LLMs) [35,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed 38, 46, 47] suggests a promising direction towards universal recom-
for profit or commercial advantage and that copies bear this notice and the full citation menders [12, 27]. Equipped with carefully designed prompts, they
on the first page. Copyrights for components of this work owned by others than ACM show great potential in explainable and cross-domain recommen-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a dations [9, 10]. Nevertheless, there still exist non-negligible gaps
fee. Request permissions from [email protected]. [1, 22] between LLMs and conventional methods unless domain-
Conference’17, July 2017, Washington, DC, USA specific knowledge is injected. Some researches have observed
© 2024 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 ‘compelling’ results after fine-tuning [28, 53], and hastily affirmed
https://doi.org/XXXXXXX.XXXXXXX LLM-based recommenders’ ranking capability.
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

However, the comparisons therein are not objective and fair the magnitude loss caused by sampling, we multiply it by an
enough, leading to under-estimation of conventional recommenders additional weight so the sampled term is scaled up. Indeed, this
and over-confidence in LLMs. Recall that the next-token predic- modification can also be understood as a special case of impor-
tion objective used for LLM pre-training (and fine-tuning), by its tance sampling [2], in which the proposal distribution assigns a
nature, is a cross-entropy loss that needs a full softmax over the higher probability mass to the target item. Unlike NCE, this Scaled
entire corpus. In view of the substantial quantity of items in reality, Cross-Entropy (dubbed SCE) yields a bound mainly determined
conventional methods typically adopt a pointwise/pairwise loss by the current rank of the target item, making it meaningful even
function (e.g., BCE and BPR). This compromise however causes in the early training stages. Empirically, sampling a very few
significant performance degradation. As shown in Figure 1, SASRec negative samples per iteration is sufficient to achieve comparable
trained with cross-entropy outperforms LLMs by a large margin, results to using cross-entropy with a full softmax.
while falling behind with BCE or BPR. Such superior results relying
Based on these approximations for cross-entropy, we conduct a
on cross-entropy however cannot serve as direct evidence to chal-
comprehensive investigation to assess the true ranking capability
lenge the ranking capability of existing LLM-based recommenders,
of both conventional and LLM-based recommenders. The experi-
since the full softmax is intractable to calculate in practice.
mental results presented in Section 5 suggest the over-confidence in
In this work, we re-emphasize the ability to optimize ranking
existing LLM-based methods. Even without considering the model
metrics for a desired recommendation loss, and then unveil the
sizes, they are still far inferior to conventional methods in next-item
corresponding limitations of some approximations to cross-entropy.
recommendation. Apart from the potential of explainability and
To achieve effective and practical approximations, we introduce
cross-domain transferability, further investigation and exploration
some novel alternatives with theoretical analysis. In summary, the
are necessary to assess the true ranking capability of LLM-based
innovative insights and technical contributions are as follows:
recommenders.
• Minimizing cross-entropy is equivalent to maximizing a
lower bound of Normalized Discounted Cumulative Gain 2 RELATED WORK
(NDCG) and Reciprocal Rank (RR). One can thus expect that Recommender systems are developed to enable users to quickly
the ranking capability would be gradually enhanced as cross- and accurately ascertain relevant items. The primary principle is
entropy is optimized during training. We further show that dy- to learn underlying interests from user information, especially his-
namic truncation on the normalizing term yields a tighter bound torical interactions. Collaborative filtering [16, 40] performs per-
and potentially better performance. This fact highlights the im- sonalized recommendation by mapping users and items into the
portance of optimizing these ranking metrics, and the cross- same latent space in which interacted pairs are close. Beyond static
entropy loss is arguably adequate for this purpose. The challenge user representations, sequential recommendation [26, 42] focuses
that remains unsolved is how to realize the approximation in on capturing dynamic interests from item sequences. Early efforts
an effective and practical manner, so the comparison with LLM- such as GRU4Rec [17] and Caser [45] respectively apply recur-
based recommenders is meaningful in reality. After revisiting the rent neural networks (RNNs) and convolutional neural networks
limitations of some well-known approximations, a rather simple (CNNs) to sequence modeling. Recently, Transformer [8, 48] be-
solution will be presented. comes increasingly popular in recommendation due to its parallel
• Noise contrastive estimation (NCE) [14] with the default efficiency and superior performance. For example, SASRec [21]
setting fails to optimize a meaningful bound in the early and BERT4Rec [44] respectively employ unidirectional and bidirec-
stages of training. Before the advent of subword segmentation tional self-attention. Differently, Zhou et al. [56] present FMLP-Rec
algorithms [25], the training of neural language models also to denoise the item sequences through learnable filters so that state-
struggles to circumvent an explicit normalizing over the entire of-the-art performance can be obtained by mere MLP modules.
vocabulary. Mnih et al. [34] thus resorted to a simplified NCE LLM for recommendation has gained a lot of attention re-
that fixes the normalizing term estimate as a constant value cently because: 1) The next-token generation feature is technically
of 1. This suggestion however introduces training difficulties easy to extend to the next-item recommendation (i.e., sequential rec-
in recommendation: sampling more negative samples should ommendation); 2) The immense success of LLM in natural language
accelerate the training yet the opposite occurs. This intriguing processing promises the development of universal recommenders.
phenomenon is attributed to the weak connection between NCE Some studies [7, 10] have demonstrated the powerful zero/few-
and NDCG (RR). Because NCE grows exponentially fast w.r.t. shot ability of LLMs (e.g., GPT [35]), especially their potential in
the number of positively scored items, a meaningless bound is explainable and cross-domain recommendations [9, 10]. Neverthe-
encountered in the early training stages. This conclusion suggests less, there is a consensus [1, 22, 54] that without domain-specific
adjusting the estimate of the normalizing term to a slightly larger knowledge learned by fine-tuning, LLM-based recommenders still
value, which shows promising empirical performance but lacks stay far behind conventional models.
consistent applicability. Next, we introduce a more reliable loss. As an early effort, P5 [12] unifies multiple recommendation tasks
• Scaling up the sampled normalizing term provides an ef- into a sequence-to-sequence paradigm. Based on the foundation
fective and practical approximation to cross-entropy. Since model of T5 [38], each task can be activated through some specific
the normalizing term of cross-entropy is intractable in reality, prompts. Hua et al. [19] takes a further step beyond P5 by examin-
a direct way is to approximate it by (uniformly) sampling part ing the impact of various ID indexing methods, and a combination
of items (a.k.a. sampled softmax loss [49]). To further mitigate of collaborative and independent indexing stands out. Recently,
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation Conference’17, July 2017, Washington, DC, USA

more LLM recommenders [28, 30, 37, 53] based on Llama [46] or Beauty MovieLens-1M

Llama2 [47] are developed. For example, LlamaRec [53] proposes 0.060 0.18

NDCG@10
a two-stage framework based on Llama2 to rerank the candidates 0.058
retrieved by conventional models. To enable LLM to correctly iden-
0.056 0.17

Loss

Loss
tify items, E4SRec [28] incorporates ID embeddings trained by con-
ventional sequential models through a linear adaptor, and applies 0.054
Epoch 0.16 Epoch
LORA [18] for parameter-efficient fine-tuning. 0.052
Cross-entropy and its approximations [2, 3, 14, 31, 40] have 0.1 1 5 0.1 1 5
been extensively studied. The most related works are: 1) Bruch η η

et al. [4] theoretically connected cross-entropy to some ranking


metrics; 2) Wu et al. [49] further found its desirable property in Figure 2: Performance comparison based on tighter bounds
alleviating popularity bias; and recently, 3) Klenitskiy et al. [24] and for NDCG. The dashed line represents the results trained by
Petrov et al. [36] respectively applied cross-entropy and a general- CE (namely the case of 𝜂 → +∞).
ized BCE loss to eliminate the performance gap between SASRec
and BERT4Rec. Differently, we are to 1) understand the superiority
of cross-entropy as well as the limitations of its approximations; 2) 4 THE ROLE OF CROSS-ENTROPY LOSS IN
identify a viable approximation according to these findings; and 3) OPTIMIZING RANKING CAPABILITY
facilitate an objective evaluation of LLM-based recommendation BCE and BPR are commonly employed in training recommender
by acknowledging the true capability of conventional models. systems due to their high efficiency. However, the substantial per-
formance gaps shown in Figure 1 suggest their poor alignment with
3 PRELIMINARIES cross-entropy. Needless to say, claims based on the comparison with
these inferior alternatives are unconvincing. In this section, we are
Given a query 𝑞 that encompasses some user information, a recom-
to showcase the substitutability of cross-entropy by 1) highlighting
mender system aims to retrieve some items 𝑣 ∈ I that would be
the importance of implicitly optimizing the ranking metrics for
of interest to the user. In sequential recommendation, the recom-
a recommendation loss; 2) introducing some practical modifica-
mender predicts the next item 𝑣𝑡 +1 based on historical interactions
tions to boost the effectiveness of some elementary approximations.
𝑞 = [𝑣 1, 𝑣 2, . . . , 𝑣𝑡 ]. The crucial component is to develop a scoring
Due to space constraints, the corresponding proofs are deferred to
function 𝑠𝑞𝑣 := 𝑠𝜃 (𝑞, 𝑣) to accurately model the relevance of a query
Appendix C.
𝑞 to a candidate item 𝑣. A common paradigm is to map them into
SASRec [21], one of the most prominent sequential models, will
the same latent space through some models parameterized via 𝜃 ,
serve as the baseline to empirically elucidate the conclusions in this
followed by an inner product operation for similarity calculation.
part. All results are summarized based on 5 independent runs.
Then, top-ranked items based on these scores will be prioritized for
recommendation. Typically the desired recommender is trained to
4.1 Cross-Entropy for Some Ranking Metrics
minimize an objective function over all observed interactions D:
The capability to prioritize items aligning with the user’s interests is
min E (𝑞,𝑣+ )∼D [ℓ (𝑞, 𝑣 + ; 𝜃 )], essential for recommender systems. Denoted by 𝑟 + := 𝑟 (𝑣 + ) = |{𝑣 ∈
𝜃 I : 𝑠 𝑣 ≥ 𝑠 𝑣+ }| the predicted rank of the target item 𝑣 + , the metric
of Normalized Discounted Cumulative Gain (NDCG)1 is often em-
where 𝑣 + indicates the target item for the query 𝑞, and the loss ployed to assess the sorting quality. For next-item recommendation
function ℓ considered in this paper is in the form of considered in this paper, NDCG is simplified to
1
exp(𝑠𝜃 (𝑞, 𝑣 + )) NDCG(𝑟 + ) = .
ℓ (𝑞, 𝑣 + ; 𝜃 ) := − log = −𝑠𝜃 (𝑞, 𝑣 + ) + log 𝑍𝜃 (𝑞). (1) log2 (1 + 𝑟 + )
𝑍𝜃 (𝑞)
It increases as the target item 𝑣 + is ranked higher, and reaches
Here 𝑍𝜃 (𝑞) depicts the ‘normalizing’ term specified to the query 𝑞. the maximum when 𝑣 + is ranked first (i.e., 𝑟 + = 1). Consequently,
For clarity, we will omit 𝑞 and 𝜃 hereafter if no ambiguity is raised. the average quality computed over the entire test set serves as
It is worth noting that the reformulation of Eq. (1) makes it an indicator of the ranking capability. Notably, Reciprocal Rank
easy to understand the subtle differences between a wide range of (RR) is another popular ranking metric, and we leave the definition
loss functions. Table 1 covers a selection of approximations: Binary and results in Appendix since the corresponding findings are very
Cross-Entropy (BCE) and Bayesian Personalized Ranking (BPR) [40] similar to those of NDCG. The following proposition suggests that
are widely used in recommendation for their low costs; Importance cross-entropy is a soft proxy to these ranking metrics.
Sampling (IS) [2, 20], Noise Contrastive Estimation (NCE) [14, 34],
Proposition 4.1. For a target item 𝑣 + which is ranked as 𝑟 + , the
and NEGative sampling (NEG) [33] are the cornerstones of the
following inequality holds true for any 𝑛 ≥ 𝑟 +
subsequent methods [3, 6, 11, 29, 43, 51, 52]. Additional details
regarding their mechanisms are provided in Appendix B. Note that − log NDCG(𝑟 + ) ≤ ℓCE-𝑛 , (2)
in this work we only involve some elementary approximations, as 1 In practice, it is deemed meaningless when 𝑟 exceeds a pre-specified threshold 𝑘
+
the primary purpose is not to develop a complex loss. For more (e.g., 𝑘 = 1, 5, 10). Hence, the widely adopted NDCG@𝑘 metric is modified to assign
advanced loss functions, please refer to [5]. zero reward to these poor ranking results.
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

Table 1: Cross-entropy loss and its approximations. The bounding probabilities are obtained in the case of uniform sampling.
More conclusions in terms of Reciprocal Rank (RR) can be found in Appendix C.


Loss Formulation ‘Normalizing’ term 𝑍 Complexity P − log NDCG(𝑟 + ) ≤ ℓ∗ ≥
exp(𝑠 𝑣+ ) Í
ℓCE − log Í exp(𝑠 𝑣 ∈ I exp(𝑠 𝑣 ) O (|I|𝑑) 1
𝑣 ∈I 𝑣)

ℓBCE − log 𝜎 (𝑠 𝑣+ ) − log(1 − 𝜎 (𝑠 𝑣− )) (1 + exp(𝑠 𝑣+ )) (1 + exp(𝑠 𝑣− )) O (𝑑) -


ℓBPR − log 𝜎 (𝑠 𝑣+ − 𝑠 𝑣− ) exp(𝑠 𝑣+ ) + exp(𝑠 𝑣− ) O (𝑑) -
 ⌊𝐾/𝑚⌋
− log 𝜎 (𝑠 𝑣′ + ) − 𝑖=1 log(1 − 𝜎 (𝑠 𝑣′𝑖 )) (1 + exp(𝑠 𝑣′ + )) 𝑖=1 (1 + exp(𝑠 𝑣′𝑖 )) 1 − 𝑚 1 − |S+′ |/|I|
Í𝐾 Î𝐾
ℓNCE O (𝐾𝑑)
Í𝐾 Î𝐾  ⌊𝐾/𝑚⌋
ℓNEG − log 𝜎 (𝑠 𝑣+ ) − 𝑖=1 log(1 − 𝜎 (𝑠 𝑣𝑖 )) (1 + exp(𝑠 𝑣+ )) 𝑖=1 (1 + exp(𝑠 𝑣𝑖 )) O (𝐾𝑑) 1 − 𝑚 1 − |S+ |/|I|
exp(𝑠 𝑣+ −log 𝑄 (𝑣+ ) ) Í𝐾  ⌊𝐾/2𝑚 ⌋
ℓIS − log Í𝐾 𝑖=1 exp(𝑠 𝑣𝑖 − log 𝑄 (𝑣𝑖 ) + log 𝑄 (𝑣 + )) O (𝐾𝑑) 1 − 2𝑚 1 − 𝑟 + /|I|
𝑖=1 exp(𝑠 𝑣𝑖 −log 𝑄 (𝑣𝑖 ) )
exp(𝑠 𝑣+ )  ⌊𝛼𝐾/2𝑚 ⌋
1 − 𝛼1 2𝑚 1 − 𝑟 + /|I|
Í𝐾
ℓSCE − log Í𝐾 exp(𝑠 𝑣+ ) + 𝛼 𝑖=1 exp(𝑠 𝑣𝑖 ) O (𝐾𝑑)
exp(𝑠 𝑣+ )+𝛼 𝑖=1 exp(𝑠 𝑣𝑖 )

where ∑︁ leads to a similar effect as cross-entropy, thereby along with a


ℓCE-𝑛 := 𝑠 𝑣+ + log exp(𝑠 𝑣 ). slightly degenerate performance due to the suboptimal tightness.
𝑟 (𝑣) ≤𝑛 In spite of the minor performance gains, the complexity of these
We can draw from Proposition 4.1 that − log NDCG(𝑟 + ) would be tighter bounds is still equal to or even higher than that of cross
strictly bounded by CE-like losses, as long as all items ranked before entropy. The challenge that remains unsolved is how to realize the
𝑣 + are retained in the normalizing term. In other words, minimizing approximation in an effective and practical manner. To this end, we
these CE-like losses is equivalent to maximizing a lower bound of will introduce two practical alternatives to cross-entropy, one based
NDCG. Because cross-entropy is a special case that retains all items on noise contrastive estimation [14, 34], and the other based on
(i.e., 𝑛 = |I|), we readily have the following corollary: sampled softmax loss [49]. Their different ways of approximating
the cross-entropy loss lead to distinct properties during training.
Corollary 4.2 ([4]). Minimizing the cross-entropy loss ℓCE is Some specific modifications focusing on the normalizing term es-
equivalent to maximizing a lower bound of normalized discounted timates are then developed to enhance their ability to optimize
cumulative gain. NDCG and RR.
Therefore, satisfactory ranking capability can be expected if ℓCE
for all queries are minimized. Since the superiority of cross-entropy 4.2 Revisiting Noise Contrastive Estimation
possibly stems from its connection to some ranking metrics, one Noise Contrastive Estimation (NCE) is widely used in training
may hypothesize that optimizing a tighter bound with a smaller neural language models for bypassing an explicit normalizing over
value of 𝑛 ≪ |I| allows greater performance gains. However, the the entire vocabulary. It requires the model to discriminate the
condition 𝑛 ≥ 𝑟 + cannot be consistently satisfied via a constant target from an easy-to-sample noise distribution. In the case of the
value of 𝑛 since 𝑟 + dynamically changes during training. Alterna- uniform sampling, it can be formulated as follows
tively, an adaptive truncation can be employed for this purpose:  𝐾 

ℓNCE := 𝑠 𝑣+ + log 1 + exp(𝑠 𝑣′ + ) 1 + exp(𝑠 𝑣′𝑖 ) ,
∑︁ 
ℓCE-𝜂 := 𝑠 𝑣+ + log exp(𝑠 𝑣 ), 𝜂 ≥ 0. (3)
𝑖=1
𝑠 𝑣 −𝑠 + ≥ −𝜂 |𝑠 + |
Note that this 𝜂-truncated loss retains only items whose scores are where 𝑠 𝑣′
= 𝑠 𝑣 − 𝑐 − log 𝐾
|I| .
In the original implementation of
not lower than 𝑠 + − 𝜂|𝑠 + |, so a tighter bound will be obtained as 𝜂 NCE [14], 𝑐 is a trainable parameter as an estimate of log 𝑍 CE . How-
drops to 0. Specifically, this 𝜂-truncated loss becomes ℓCE-𝑟 + (i.e., ever, this strategy is infeasible for conditional probability models
the tightest case) when 𝜂 = 0, and approaches ℓCE when 𝜂 → +∞. like language models and sequential recommendation, where they
Figure 2 illustrates how NDCG@10 varies as 𝜂 gradually increases need to determine one 𝑐𝑞 for each text (query). Mnih et al. [34]
from 0.1 to 5. There are two key observations: therefore fixed 𝑐 ≡ 1 for all texts during training, and NEGative
sampling (NEG) used in Word2Vec [33] further simplifies it by
1. SASRec performs worst in the tightest case of 𝜂 ≈ 0. This can be
replacing 𝑠 ′ with 𝑠 directly; that is,
attributed to the instability of the 𝜂-truncated loss. On the one
hand, ℓCE-𝜂 will rapidly collapse to 0 for those easily recognized  𝐾 
Ö 
targets, in which case all other items are excluded in the nor- ℓNEG := 𝑠 𝑣+ + log 1 + exp(𝑠 𝑣+ ) 1 + exp(𝑠 𝑣𝑖 ) .
malizing term except the target itself. On the other hand, due to 𝑖=1
this strict truncation, only a few negative items are encountered However, we observe that both NCE (𝑐 = 1) and NEG introduce
during training, and thus over-fitting is more likely to occur. training difficulties as the number of negative samples increases.
2. Once 𝜂 is large enough to overcome the training instability, SAS- The NDCG@10 metric shown in Figure 3 remains almost constant
Rec begins to enjoy the benefits from tightness and achieves its at the beginning of training, and consumes more iterations to con-
best performance around 𝜂 ≈ 0.7. Further increasing 𝜂 however verge as 𝐾 increases. This contradicts the usual understanding
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation Conference’17, July 2017, Washington, DC, USA

K = 10 K = 50 K = 100 K = 500 K = 10 K = 50 K = 100 K = 500


NCE (c = 1) NEG NCE (c = 1) NEG

0.15
NDCG@10

0.04

NDCG@10
0.10
0.02
0.05

0.00 0.00
0 200 400 0 200 400 0 100 200 300 0 100 200 300
Epoch Epoch Epoch Epoch
(a) Beauty (b) MovieLens-1M

Figure 3: NDCG@10 performance of NCE and NEG across different number of negative samples.

that sampling more negative samples would accelerate the con- Table 2: The convergence epoch and the highest NDCG@10
vergence of training. We believe that this intriguing phenomenon metric achieved among the 500 epochs (with 𝐾 = 500).
stems mainly from the exponential growth of the normalizing term,
which consequently yields a rather weak bound: Beauty MovieLens-1M
Theorem 4.3. Let 𝑣 + be a target item which is ranked as 𝑟 + ≤ NDCG@10 Epoch NDCG@10 Epoch
𝑚
22 − 1 for some 𝑚 ∈ N, and
NCE (𝑐 = 1) 0.0547 400-470 0.1817 280-440
S+′ := {𝑣 ∈ I : 𝑠 𝑣′ ≥ 0}.
S+ := {𝑣 ∈ I : 𝑠 𝑣 ≥ 0}, NCE (𝑐 = 5) 0.0545 195-280 0.1812 100-200
If we uniformly sampling 𝐾 items for training, then with probability NCE (𝑐 = 10) 0.0571 95-115 0.1817 80-160
at least NCE (𝑐 = 50) 0.0410 ≥ 500 0.1816 ≥ 420
(  ⌊𝐾/𝑚⌋ NCE (𝑐 = 100) 0.0171 ≥ 500 0.1676 ≥ 500
1 − 𝑚 1 − |S+′ |/|I| , if ℓ∗ = ℓNCE
 ⌊𝐾/𝑚⌋ , (4)
1 − 𝑚 1 − |S+ |/|I| , if ℓ∗ = ℓNEG
we have Recall that 𝑐 is an estimate of log 𝑍 CE . Setting 𝑐 ≥ 50 implies a
− log NDCG(𝑟 + ) ≤ ℓ∗ . (5) hypothesis of 𝑍 CE ≥ 𝑒 50 , which is obviously difficult to achieve for
From Theorem 4.3, we have the following conclusions: most models. As a rule of thumb, the hyper-parameter 𝑐 should be
1. Notice that Eq. (4) for NCE (NEG) is mainly determined by the size chosen carefully, and 𝑐 = 10 appears a good choice according to the
of |𝑆 +′ | (|𝑆 + |) rather than the current rank 𝑟 + . As a result, NCE and experiments in Section 5. Next we will introduce a more reliable
NEG can easily bound NDCG as long as the number of items with variant of the sampled softmax loss [49]. As opposed to NCE and
non-negative scores is large enough. This is more common in the NEG, it yields a bound determined by the current rank 𝑟 + .
early stages of training, in which case item representations are
poorly distributed in the latent space. In other words, the bounds 4.3 Scaling Up the Sampled Normalizing Term
at the beginning are too weak to be meaningful for improving Since the normalizing term of cross-entropy is intractable in reality,
the model ranking capability. The so-called training difficulties a direct way is to approximate it by (uniformly) sampling part of
are actually a stage in narrowing the gap. items from I:
2. Figure 3 also suggests that the standstill duration of NCE is signif- 𝐾
∑︁
icantly longer than that of NEG, for example, 150 epochs versus 𝑍ˆ (𝛼) = exp(𝑠 𝑣+ ) + 𝛼 exp(𝑠 𝑣𝑖 ), 𝛼 ≥ 1. (6)
70 epochs on Beauty if 500 negative samples are sampled for 𝑖=1
training. Note that NEG can be regarded as a special case of NCE
by fixing 𝑐 = log(|I|/𝐾), a value higher than 1 if the experimen- The resulting Scaled Cross-Entropy (SCE) becomes
tal settings described in Figure 3 are applied. As such, according
ℓSCE := −𝑠 𝑣+ + log 𝑍ˆ (𝛼).
to Theorem 4.3, NCE with 𝑐 = 1 will suffer from a weaker bound
than NEG, thereby more iterations are required for convergence. Note that here we scale up the sampled normalizing term by a pre-
Overall, NCE and NEG converge slower as 𝐾 increases because specific weight 𝛼. For the case of 𝛼 = 1, this approximation (a.k.a.
of the exponential growth of their normalizing terms w.r.t. the sizes sampled softmax loss [49]) implies a (𝐾 + 1)-class classification task,
of |S+′ | and |S+ |. One feasible modification is to adopt a moderately and has been used in previous studies [24, 49]. But it is worth noting
large 𝑐 so that the sizes remain acceptable even if numerous negative that they have inherent differences. We modify it via a weight 𝛼 to
items are sampled. As shown in Table 2, the number of epochs remedy the magnitude loss resulted from sampling, so the scaled
required for convergence decreases as the value of 𝑐 increases from loss is more likely to bound NDCG and RR:
1 to 10. But a larger value once again hinders the training process.
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

α =1 α =5 α =100 Table 3: Dataset statistics.


Beauty MovieLens-1M
0.06
0.18
0.05 Dataset #Users #Items #Interactions Density Avg. Length
NDCG@10

0.16 Beauty 22,363 12,101 198,502 0.07% 8.9


0.04
MovieLens-1M 6,040 3,416 999,611 4.84% 165.5
0.03 0.14 Yelp 30,431 20,033 316,354 0.05% 10.4

0.02 0.12
1 10 50 100 500 1 10 50 100 150
K K Table 4: Model statistics. The number of parameters is esti-
mated based on the Beauty dataset.
Figure 4: NDCG@10 performance under various weight 𝛼.
Model Foundation Model Architecture Embedding Size #Params
P5(CID+IID) [19] T5 Transformer 512 60M
POD [27] T5 Transformer 512 60M
Theorem 4.4. Under the same conditions as stated in Theorem 4.3, LlamaRec [53] Llama2 Transformer 4096 7B
the inequality (5) holds for SCE with a probability of at least E4SRec [28] Llama2 Transformer 4096 7B
GRU4Rec [17] - RNN 64 0.80M
1 𝑚  ⌊𝛼𝐾/2𝑚 ⌋
1− 2 1 − 𝑟 + /|I| . (7) Caser [45] - CNN 64 3.80M
𝛼 SASRec [21] - Transformer 64 0.83M
BERT4Rec [44] - Transformer 64 1.76M
In addition to the promising bounding probability achieved FMLP-Rec [56] - MLP 64 0.92M
through the weight 𝛼, we can also observe from Theorem 4.4 that
Eq. (7) is directly governed by the current rank 𝑟 + . In contrast to
NCE (𝑐 = 1) and NEG, SCE is expected to be meaningful even in the 4.4 Computational Complexity
early stages of training. Certainly, SCE is not perfect: the scaling op-
eration inevitably raises concerns about the high variance problem. The major cost of cross-entropy lies in the inner product and soft-
As depicted in Figure 4, if negative samples are very rare, a larger max operations. It has a complexity of O (|I|𝑑), where 𝑑 denotes
weight of 𝛼 tends to worsen the ranking capability. Fortunately, the embedding size before similarity calculation. In contrast, the
the high variance problem appears less significant as 𝐾 slightly approximations require a lower cost O (𝐾𝑑), correspondingly an
increases (e.g., 𝐾 ≥ 50 for Beauty and 𝐾 ≥ 10 for MovieLens-1M). additional overhead O (𝐾) for uniform sampling. Overall, it is prof-
Notably, sampling 100 negative samples for 𝛼 = 100 produces com- itable if 𝐾 ≪ |I|.
parable performance to using 500 negative samples for 𝛼 = 1 on
the Beauty dataset. 5 EXPERIMENTS
Connection to importance sampling. While SCE does not In this section, we are to reveal the true ranking capability of con-
make sense at first glance, it is indeed closely related to importance ventional recommenders by using the modified Noise Contrastive
sampling [2, 41], a widely used technique for cross-entropy approx- Estimation (NCE) and the proposed Scaled Cross-Entropy (SCE).
imation. Given a proposal distribution 𝑄 over all items, it corrects Thus, the current advancements made by LLM-based recommenders
the approximation as follows can also be assessed objectively.
exp(𝑠 𝑣+ − log 𝑄 (𝑣 + ))
ℓIS := − log Í𝐾 , 𝑣𝑖 ∼ 𝑄, 𝑖 = 1, 2, . . . , 𝐾 . 5.1 Experimental Setup
𝑖=1 exp(𝑠 𝑣𝑖 − log 𝑄 (𝑣𝑖 )) This part introduces the datasets, evaluation metrics, baselines, and
It allows for an unbiased estimation if the proposal distribution 𝑄 implementation details.
is precisely identical to the underlying data distribution. But for Datasets. To ensure the reliability of the conclusions, we se-
the conditional probability model like sequential recommendation, lect three public datasets from different scenarios, including the
achieving the optimum requires additional overhead for each query. Beauty, MovieLens-1M, and Yelp datasets. Beauty is an e-commerce
Hence, some heuristic designs based on popularity sampling [6, dataset extracted from Amazon reviews known for high sparsity;
29] are adopted more often in practice. We point out that SCE is MovieLens-1M is a movie dataset with a much longer sequence
morphologically equivalent to these designs with length; Yelp collects abundant meta-data suitable for multi-task
( training. Following [12, 56], we filter out users and items with less
𝛼 if 𝑣 = 𝑣 +
| I | −1+𝛼 than 5 interactions, and the validation set and test set are split in a
𝑄 (𝑣) = 1 . (8) leave-one-out fashion, namely the last interaction for testing and the
| I | −1+𝛼 if 𝑣 ≠ 𝑣 +
penultimate one for validation. The dataset statistics are presented
Consequently scaling up the sampled normalizing term can be con- in Table 3.
sidered as assigning a higher probability mass to the target item 𝑣 + . Evaluation metrics. For each user, the scores returned by the
This partially explains why SCE is effective: a well-trained scoring recommender will be sorted in descending order to generate candi-
function should skew towards the target item. Another subtle dif- date lists. In addition to the aforementioned NDCG@𝑘 (Normalized
ference is that the normalizing term for importance sampling may Discounted Cumulative Gain), HR@𝑘 (Hit Rate) that quantifies
not include the target term, while SCE always preserves it. This is the proportion of successful hits among the top-𝑘 recommended
practically necessary to avoid an unstable training process. candidates will also be included in this paper due to its widespread
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation Conference’17, July 2017, Washington, DC, USA

Table 5: Overall performance comparison on the Beauty, MovieLens-1M, and Yelp datasets. The best results of each block are
marked in bold. ‘ ▲% over CE/LLM’ represents the relative gap between respective best results.

Beauty MovieLens-1M Yelp


HR@5 HR@10 NDCG@5 NDCG@10 HR@5 HR@10 NDCG@5 NDCG@10 HR@5 HR@10 NDCG@5 NDCG@10
POD 0.0185 0.0245 0.0125 0.0146 0.0422 0.0528 0.0291 0.0326 0.0476 0.0564 0.0330 0.0358
P5 (CID+IID) 0.0569 0.0791 0.0403 0.0474 0.2225 0.3131 0.1570 0.1861 0.0289 0.0453 0.0200 0.0252
LLM
LlamaRec 0.0591 0.0862 0.0405 0.0492 0.1757 0.2836 0.1113 0.1461 0.0416 0.0605 0.0306 0.0367
E4SRec 0.0527 0.0753 0.0376 0.0448 0.1871 0.2765 0.1234 0.1522 0.0309 0.0473 0.0207 0.0260
GRU4Rec 0.0474 0.0690 0.0329 0.0398 0.2247 0.3201 0.1542 0.1850 0.0275 0.0463 0.0171 0.0231
Caser 0.0435 0.0614 0.0303 0.0361 0.2181 0.3049 0.1520 0.1800 0.0283 0.0383 0.0211 0.0243
CE SASRec 0.0713 0.0986 0.0510 0.0597 0.2221 0.3131 0.1518 0.1812 0.0476 0.0696 0.0345 0.0415
BERT4Rec 0.0509 0.0747 0.0347 0.0423 0.1978 0.2922 0.1330 0.1634 0.0355 0.0540 0.0243 0.0303
FMLP-Rec 0.0717 0.0988 0.0507 0.0594 0.2287 0.3243 0.1585 0.1893 0.0512 0.0759 0.0364 0.0444
▲% over LLM 21.4% 14.6% 25.7% 21.3% 2.8% 3.6% 0.9% 1.7% 7.5% 25.6% 10.3% 21.0%
GRU4Rec 0.0214 0.0376 0.0134 0.0186 0.1595 0.2490 0.1023 0.1310 0.0157 0.0273 0.0098 0.0135
Caser 0.0282 0.0434 0.0185 0.0234 0.1639 0.2476 0.1078 0.1348 0.0304 0.0428 0.0224 0.0264
BCE SASRec 0.0429 0.0671 0.0275 0.0353 0.1594 0.2492 0.1040 0.1329 0.0325 0.0501 0.0225 0.0281
BERT4Rec 0.0245 0.0415 0.0152 0.0207 0.1241 0.2021 0.0789 0.1039 0.0223 0.0379 0.0138 0.0188
FMLP-Rec 0.0460 0.0710 0.0301 0.0381 0.1800 0.2722 0.1173 0.1469 0.0460 0.0651 0.0330 0.0391
▲% over CE -35.9% -28.1% -41.0% -36.2% -21.3% -16.1% -26.0% -22.4% -10.0% -14.3% -9.4% -11.9%
▲% over LLM -22.1% -17.7% -25.8% -22.6% -19.1% -13.1% -25.3% -21.1% -3.3% 7.6% -0.1% 6.5%
GRU4Rec 0.0434 0.0652 0.0288 0.0359 0.2273 0.3184 0.1541 0.1834 0.0241 0.0418 0.0148 0.0205
Caser 0.0377 0.0567 0.0253 0.0314 0.2213 0.3106 0.1523 0.1811 0.0296 0.0405 0.0220 0.0255
NCE SASRec 0.0686 0.0961 0.0485 0.0573 0.2177 0.3135 0.1479 0.1788 0.0471 0.0682 0.0344 0.0412
BERT4Rec 0.0487 0.0734 0.0324 0.0404 0.1960 0.2933 0.1311 0.1624 0.0389 0.0574 0.0271 0.0330
FMLP-Rec 0.0693 0.0964 0.0491 0.0578 0.2291 0.3279 0.1567 0.1885 0.0512 0.0760 0.0364 0.0444
▲% over CE -3.4% -2.4% -3.6% -3.2% 0.2% 1.1% -1.1% -0.4% 0.0% 0.1% 0.0% 0.1%
▲% over LLM 17.3% 11.8% 21.3% 17.4% 2.9% 4.7% -0.2% 1.3% 7.5% 25.7% 10.3% 21.1%
GRU4Rec 0.0489 0.0694 0.0344 0.0410 0.2309 0.3248 0.1587 0.1891 0.0290 0.0487 0.0183 0.0246
Caser 0.0456 0.0628 0.0322 0.0377 0.2274 0.3135 0.1586 0.1864 0.0293 0.0404 0.0218 0.0253
SCE SASRec 0.0698 0.0968 0.0500 0.0587 0.2273 0.3186 0.1567 0.1862 0.0472 0.0693 0.0339 0.0410
BERT4Rec 0.0540 0.0776 0.0372 0.0449 0.2078 0.3014 0.1405 0.1707 0.0414 0.0612 0.0283 0.0346
FMLP-Rec 0.0703 0.0979 0.0502 0.0591 0.2372 0.3284 0.1648 0.1942 0.0517 0.0779 0.0357 0.0441
▲% over CE -2.0% -0.9% -1.5% -1.1% 3.7% 1.3% 4.0% 2.6% 1.0% 2.6% -2.0% -0.6%
▲% over LLM 19.0% 13.6% 23.8% 20.0% 6.6% 4.9% 5.0% 4.4% 8.6% 28.8% 8.1% 20.2%

use in other studies [12, 21, 56]. Besides, we also provide the Mean Specifically, 𝐾 = 500 on the Beauty dataset, and 𝐾 = 100 on the
Reciprocal Rank (MRR) results in Appendix D. MovieLens-1M and Yelp datasets. We find that training 200 epochs
Baselines. Although LLM itself has surprising zero-shot rec- is sufficient for cross-entropy to converge, while sometimes NCE
ommendation ability, there still exist non-negligible gaps [1, 22] and SCE need 300 epochs. Other hyper-parameters of NCE and
unless domain-specific knowledge is injected. Hence, only LLM- SCE completely follow cross-entropy. Because the data preprocess-
based recommenders enhanced by fine-tuning will be compared ing scripts provided with POD may lead to information leakage
in this paper: P5 (CID+IID) [19], POD [27], LlamaRec [53], and [39], we assign random integer IDs to items rather than sequen-
E4SRec [28]. Specifically, the first two methods take T5 as the tially incrementing integer IDs. The maximum sequence length
foundation model, while the last two methods fine-tune Llama2 and embedding size have a direct impact on representation capabil-
for efficient sequential recommendation. Additionally, five sequen- ity and inference efficiency, so we will discuss them separately in
tial models including FMLP-Rec [56], Caser [45], GRU4Rec [17], Section 5.3.
SASRec [21], and BERT4Rec [44], are considered here to unveil
the true capability of conventional methods. They cover various 5.2 Overall Performance Evaluation
architectures so as to comprehensively validate the effectiveness of In this part, we fulfill our primary objective by presenting the
the proposed approximation methods. Table 4 presents an overview comparison between LLM-based recommenders and conventional
of the model statistics. Notice that for LlamaRec and E4SRec we recommenders. Considering the expensive training cost of LLM-
employ Llama2-7B instead of Llama2-13B as the foundation model based models, their results reported in Table 5 are based on a single
in order to improve training efficiency. This results in minor per- run, while the results of other conventional methods are averaged
formance differences in practice. over 5 independent runs.
Implementation details. According to the discussion in Sec- Conventional methods using cross-entropy outperform
tion 4, we set 𝑐 = 10 for NCE and 𝛼 = 100 for SCE, and less than LLM-based recommenders. Let us focus on the first three blocks
5% of all items will be sampled for both approximation objectives. in Table 5, where the use of cross-entropy greatly improves the
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

P5(CID+IID) SASRec SASRec+


%HDXW\ 0RYLH/HQV0 <HOS Beauty MovieLens-1M
*585HF   
0.18
&DVHU   

NDCG@10
0.055
6$65HF    0.16
%(575HF   
0.14
0.050
)0/35HF   
0.12
     
  
10 20 50 100 20 50 100 200
(a) HR@10 L L

%HDXW\ 0RYLH/HQV0 <HOS


Figure 6: P5 (CID+IID) versus SASRec(+) across various max-
*585HF   
imum sequence length 𝐿. SASRec+ uses a sliding window
&DVHU   
similar to P5 to augment each sequence.
6$65HF   

%(575HF   


Table 6: The impact of embedding size 𝑑.
)0/35HF   

        


   Beauty MovieLens-1M
(b) NDCG@10 𝑑 HR@10 NDCG@10 HR@10 NDCG@10
P5(CID+IID) 512 0.0791 0.0474 0.3131 0.1861
Figure 5: Relative gaps between SCE and NCE. SASRec+ 64 0.0937 0.0574 0.3271 0.1915
SASRec+ 512 0.0963 0.0589 0.3360 0.1999
conventional methods’ recommendation performance. In partic-
ular, SASRec and FMLP-Rec demonstrate superior performance
Additional experiments on BPR loss and MRR metrics are shown
compared to LLMs, but fall significantly behind if replacing cross-
in the Appendix. They draw the same conclusions as above.
entropy with BCE. Hence, previous affirmative arguments about
the LLMs’ recommendation performance are rooted in unobjective
5.3 Other Factors for Objective Evaluation
and unfair comparisons, wherein BCE or BPR are commonly used
for training conventional models. Moreover, the inflated model Due to the difference in model design, it is challenging to conduct
size (from 60M of P5 to 7B of LlamaRec) only yields negligible im- evaluations on a completely identical testbed. To clarify the reliabil-
provements in some of the metrics. The rich world knowledge and ity of the results in Table 5, we further investigate two key factors:
powerful reasoning ability seem to be of limited use here due to maximum sequence length2 𝐿 and embedding size 𝑑. According
the emphasis on personalization in sequential recommendation. In to the conclusions above, SCE is employed to train SASRec in the
conclusion, despite after fine-tuning, LLM-based recommenders following. Results for CE and NCE can be found in Appendix E.
still fail to surpass state-of-the-art conventional models. Maximum sequence length. Following the routine of previ-
Comparable effectiveness can be achieved using practical ous studies [21, 44], conventional models like SASRec are allowed
approximations. Conducting a full softmax over all items for to make predictions based on the last 𝐿 = 200 interactions on
cross-entropy may be infeasible in practice. Fortunately, the last MovieLens-1M and 𝐿 = 50 on other datasets. In contrast, LLM-
two blocks in Table 5 shows the substitutability of cross-entropy based recommenders are confined to 𝐿 = 20 on all three datasets
by applying the modified NCE or the proposed SCE. To showcase for training efficiency. We argue that this difference does not make
the effectiveness of these substitutes, we intentionally sample a the conclusion differ because as can be seen in Figure 6, P5 does
rather conservative number of negative samples, and thus there not exhibit much better performance with access to more historical
remains a slight gap compared to cross-entropy. Nevertheless, the interactions. Notably, SASRec’s performance is stable on Beauty,
superior results of NCE and SCE re-emphasize the clear gap that but on MovieLens-1M, it deteriorates significantly when 𝐿 gets
exists between LLM-based and conventional recommenders. smaller. This phenomenon primarily arises from the fact that the
SCE is more consistent and reliable than NCE. As dis- original implementation of SASRec only utilizes the most recent
cussed in Section 4.2, NCE is sensitive to the choice of 𝑐: extremely 𝐿 interactions for training, whereas for P5 each sequence is di-
small or large values might impede learning and degrade perfor- vided into multiple segments. Consequently, P5 is able to access far
mance. The inconsistent performance gains from NCE can also more interactions than SASRec during training, especially on the
verify this conclusion. Figure 5 clearly demonstrates that NCE can MovieLens-1M dataset known for long sequence length. If we ap-
contribute competitive performance to SASRec and FMLP-Rec, but ply a similar strategy that augments each sequence using a sliding
underperforms SCE across a variety of models (e.g., GRU4Rec and window, the resulting SASRec+ then performs consistently across
BERT4Rec) and datasets (e.g., Beauty and MovieLens-1M). For Caser diverse 𝐿.
on Beauty and GRU4Rec on Yelp, replacing SCE with NCE results 2 The maximum sequence length refers to the maximum number of historical interac-
in a performance degradation of even ≥15%. tions used for next-item prediction. Other tokens like prompts are not included.
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation Conference’17, July 2017, Washington, DC, USA

Embedding size. Models with a larger embedding size have


stronger representation capability and thus potentially better rec-
ommendation performance. According to the discussion above,
we examine its impact under the same setting of 𝐿 = 20. In Ta-
ble 6, when the embedding size of SASRec+ is increased from 64
to 512, the obtained performance gains are marginal. In view of
its costly overhead, such improvement is not attractive in prac-
tice. This also implies that existing LLM-based recommenders are
over-parameterized in terms of ranking capability.

6 CONCLUSION
In this work, we bridge the theoretical and empirical performance
gaps between cross-entropy and some of its approximations through
a modified noise contrastive estimation loss and an effective scaled
cross-entropy loss. Based on these practical approximations, we
showcase that existing LLM-based recommenders are not as ef-
fective as claimed. The innovative understandings and extensive
experiments can be expected to facilitate an objective evaluation of
LLM-based recommendation in the future.
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

REFERENCES [22] Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan
[1] Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. Do LLMs understand user prefer-
He. 2023. TALLRec: An effective and efficient tuning framework to align large ences? Evaluating llms on user rating prediction. arXiv preprint arXiv:2305.06474
language model with recommendation. In ACM Conference on Recommender (2023).
Systems (RecSys). ACM, 1007–1014. [23] Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with
[2] Yoshua Bengio and Jean-Sébastien Senécal. 2008. Adaptive importance sampling Graph Convolutional Networks. In International Conference on Learning Repre-
to accelerate training of a neural probabilistic language model. IEEE Transactions sentations (ICLR).
on Neural Networks (TNNLS) 19, 4 (2008), 713–722. [24] Anton Klenitskiy and Alexey Vasilev. 2023. Turning dross into gold loss: is
[3] Guy Blanc and Steffen Rendle. 2018. Adaptive sampled softmax with kernel based BERT4Rec really better than SASRec?. In ACM Conference on Recommender
sampling. In International Conference on Machine Learning (ICML) (Proceedings Systems (RecSys). ACM, 1120–1125.
of Machine Learning Research, Vol. 80). PMLR, 589–598. [25] Taku Kudo. 2018. Subword regularization: Improving neural network translation
[4] Sebastian Bruch, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2019. An models with multiple subword candidates. In Annual Meeting of the Association
analysis of the softmax cross entropy loss for learning-to-rank with binary rele- for Computational Linguistics (ACL). Association for Computational Linguistics,
vance. In ACM SIGIR International Conference on Theory of Information Retrieval 66–75.
(ICTIR). ACM, 75–78. [26] Jiacheng Li, Yujie Wang, and Julian J. McAuley. 2020. Time interval aware self-
[5] Chong Chen, Weizhi Ma, Min Zhang, Chenyang Wang, Yiqun Liu, and Shaoping attention for sequential recommendation. In International Conference on Web
Ma. 2023. Revisiting negative sampling vs. non-sampling in implicit recommen- Search and Data Mining (WSDM). ACM, 322–330.
dation. ACM Transactions on Information Systems (TOIS) 41, 1 (2023), 1–25. [27] Lei Li, Yongfeng Zhang, and Li Chen. 2023. Prompt distillation for efficient
[6] Jin Chen, Defu Lian, Binbin Jin, Kai Zheng, and Enhong Chen. 2022. Learning llm-based recommendation. In ACM International Conference on Information and
recommenders for implicit feedback with importance resampling. In ACM Web Knowledge Management (CIKM). ACM, 1348–1357.
Conference (WWW). ACM, 1997–2005. [28] Xinhang Li, Chong Chen, Xiangyu Zhao, Yong Zhang, and Chunxiao Xing. 2023.
[7] Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxi- E4SRec: An elegant effective efficient extensible solution of large language models
ang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering ChatGPT’s Capabilities in for sequential recommendation. arXiv preprint arXiv:2312.02443 (2023).
Recommender Systems. In ACM Conference on Recommender Systems (RecSys). [29] Defu Lian, Qi Liu, and Enhong Chen. 2020. Personalized ranking with importance
ACM, 1126–1132. sampling. In ACM Web Conference (WWW). ACM / IW3C2, 1093–1103.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [30] Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang,
Pre-training of deep bidirectional transformers for language understanding. In and Xiangnan He. 2023. LLaRA: Aligning large language models with sequential
Conference of the North American Chapter of the Association for Computational recommenders. arXiv preprint arXiv:2312.02445 (2023).
Linguistics: Human Language Technologies (NAACL-HLT). Association for Com- [31] Zhuang Ma and Michael Collins. 2018. Noise contrastive estimation and negative
putational Linguistics, 4171–4186. sampling for conditional models: Consistency and statistical efficiency. In Confer-
[9] Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, ence on Empirical Methods in Natural Language Processing (EMNLP). Association
Kun Gai, and Fei Sun. 2023. A large language model enhanced conversational for Computational Linguistics, 3698–3707.
recommender system. arXiv preprint arXiv:2308.06212 (2023). [32] Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, and Xiuqiang He.
[10] Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei 2021. UltraGCN: Ultra simplification of graph convolutional networks for rec-
Zhang. 2023. Chat-rec: Towards interactive and explainable llms-augmented ommendation. In ACM International Conference on Information and Knowledge
recommender system. arXiv preprint arXiv:2303.14524 (2023). Management (CIKM). ACM, 1253–1262.
[11] Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf [33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Schlüter, and Hermann Ney. 2021. On sampling-based training criteria for neural Distributed representations of words and phrases and their compositionality.
language modeling. In Annual Conference of the International Speech Communi- Advances in Neural Information Processing Systems (NeurIPS) 26 (2013), 3111–3119.
cation Association (Interspeech). ISCA, 1877–1881. [34] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training
[12] Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. neural probabilistic language models. In International Conference on Machine
Recommendation as language processing (RLP): A unified pretrain, personalized Learning (ICML).
prompt & predict paradigm (P5). In ACM Conference on Recommender Systems [35] OpenAI. 2023. GPT models documentation. https://platform.openai.com/docs/
(RecSys). ACM, 299–315. models/overview.
[13] Shansan Gong and Kenny Q. Zhu. 2022. Positive, negative and neutral: Modeling [36] Aleksandr Vladimirovich Petrov and Craig MacDonald. 2023. gSASRec: Reducing
implicit feedback in session-based news recommendation. In International ACM Overconfidence in Sequential Recommendation Trained with Negative Sampling.
SIGIR Conference on Research and Development in Information Retrieval. ACM, In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys 2023,
1185–1195. Singapore, Singapore, September 18-22, 2023. ACM, 116–128.
[14] Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A [37] Junyan Qiu, Haitao Wang, Zhaolin Hong, Yiping Yang, Qiang Liu, and Xingxing
new estimation principle for unnormalized statistical models. In International Wang. 2023. ControlRec: Bridging the semantic gap between language model
Conference on Artificial Intelligence and Statistics (AISTATS). JMLR Workshop and and personalized recommendation. arXiv preprint arXiv:2311.16441 (2023).
Conference Proceedings, 297–304. [38] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[15] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yong-Dong Zhang, and Meng Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
Wang. 2020. LightGCN: Simplifying and powering graph convolution network limits of transfer learning with a unified text-to-text transformer. Journal of
for recommendation. In International ACM SIGIR Conference on Research and Machine Learning Research (JMLR) 21 (2020), 140:1–140:67.
Development in Information Retrieval (SIGIR). ACM, 639–648. [39] Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H Keshavan, Trung
[16] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q Tran, Jonah Samost, et al. 2023.
Chua. 2017. Neural collaborative filtering. In International Conference on World Recommender systems with generative retrieval. arXiv preprint arXiv:2305.05065
Wide Web (WWW). ACM, 173–182. (2023).
[17] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. [40] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
2016. Session-based recommendations with recurrent neural networks. In Inter- 2009. BPR: Bayesian personalized ranking from implicit feedback. In Conference
national Conference on Learning Representations (ICLR). on Uncertainty in Artificial Intelligence (UAI). AUAI Press, 452–461.
[18] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean [41] Christian P Robert, George Casella, and George Casella. 1999. Monte Carlo
Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large statistical methods. Vol. 2. Springer.
language models. In International Conference on Learning Representations (ICLR). [42] Guy Shani, David Heckerman, Ronen I Brafman, and Craig Boutilier. 2005. An
[19] Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023. How to mdp-based recommender system. Journal of Machine Learning Research (JMLR)
index item ids for recommendation foundation models. In Annual International 6, 9 (2005), 1265–1295.
ACM SIGIR Conference on Research and Development in Information Retrieval in [43] Wentao Shi, Jiawei Chen, Fuli Feng, Jizhi Zhang, Junkang Wu, Chongming Gao,
the Asia Pacific Region (SIGIR-AP). ACM, 195–204. and Xiangnan He. 2023. On the theories behind hard negative sampling for
[20] Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On recommendation. In ACM Web Conference (WWW). ACM, 812–822.
using very large target vocabulary for neural machine translation. In Association [44] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
for Computational Linguistics (ACL). Association for Computer Linguistics, 1–10. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder repre-
[21] Wang-Cheng Kang and Julian J. McAuley. 2018. Self-attentive sequential rec- sentations from transformer. In ACM International Conference on Information and
ommendation. In IEEE International Conference on Data Mining (ICDM). IEEE Knowledge Management (CIKM). ACM, 1441–1450.
Computer Society, 197–206. [45] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation
via convolutional sequence embedding. In ACM International Conference on Web
Search and Data Mining (WSDM). ACM, 565–573.
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation Conference’17, July 2017, Washington, DC, USA

[46] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne E Maximum Sequence Length 15
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam-
ple. 2023. LLaMA: Open and efficient foundation language models. CoRR
abs/2302.13971 (2023).
[47] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
mine Babaei, Nikolay Bashlykov, et al. 2023. Llama 2: Open foundation and
fine-tuned chat models. CoRR abs/2307.09288 (2023).
A EXPERIMENTAL SETUP
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Baselines. Although LLM itself has surprising zero-shot recom-
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you
need. In Advances in Neural Information Processing Systems (NeurIPS). 5998–6008. mendation ability, there still exist non-negligible gaps [1, 22] unless
[49] Jiancan Wu, Xiang Wang, Xingyu Gao, Jiawei Chen, Hongcheng Fu, Tianyu domain-specific knowledge is injected. Hence, only LLM-based
Qiu, and Xiangnan He. 2023. On the effectiveness of sampled softmax loss for recommenders enhanced by fine-tuning will be compared in this
item recommendation. ACM Transactions on Information Systems (TOIS) (2023).
https://doi.org/10.1145/3637061 paper:
[50] Siyu Wu, Jun Wang, and Wei Zhang. 2024. Contrastive Personalized Exercise • P5 (CID+IID) [19] unifies multiple tasks (e.g., sequential recom-
Recommendation With Reinforcement Learning. IEEE Transactions on Learning
Technologies (TLT) 17 (2024), 691–703. mendation and rating prediction) into a sequence-to-sequence
[51] Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xi- paradigm. The use of collaborative and independent indexing
aoming Wang, Taibai Xu, and Ed H. Chi. 2020. Mixed Negative Sampling for
Learning Two-tower Neural Networks in Recommendations. In ACM Web Con-
together creates LLM-compatible item IDs.
ference (WWW). ACM / IW3C2, 441–447. • POD [27] bridges IDs and words by distilling long discrete prompts
[52] Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee into a few continuous prompts. It also suggests a task-alternated
Kumthekar, Zhe Zhao, Li Wei, and Ed H. Chi. 2019. Sampling-bias-corrected
neural modeling for large corpus item recommendations. In ACM Conference on training strategy for efficiency.
Recommender Systems (RecSys). ACM, 269–277. • LlamaRec [53] aims to address the slow inference process caused
[53] Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and by autoregressive generation. Given the candidates retrieved by
Even Oldridge. 2023. LlamaRec: Two-stage recommendation using large language
models for ranking. arXiv preprint arXiv:2311.02089 (2023). conventional models, it subsequently reranks them based on the
[54] Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong foundation model of Llama2.
Wen. 2023. Recommendation as instruction following: A large language model
empowered recommendation approach. arXiv preprint arXiv:2305.07001 (2023).
• E4SRec [28] incorporates ID embeddings trained by conven-
[55] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui tional sequential models through a linear adaptor, and applies
Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through LORA [18] for parameter-efficient fine-tuning.
rate prediction. In ACM SIGKDD International Conference on Knowledge Discovery
& Data Mining (KDD). 1059–1068. Additionally, five sequential models, covering MLP, CNN, RNN, and
[56] Kun Zhou, Hui Yu, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Filter-enhanced Transformer architectures, are considered here to uncover the true
MLP is all you need for sequential recommendation. In ACM Web Conference
(WWW). ACM, 2388–2399. capability of conventional methods.
• GRU4Rec [17] applies RNN to recommendation with specific
modifications made to cope with data sparsity.
Contents • Caser [45] treats the embedding matrix as an ‘image’, and cap-
Abstract 1 tures local patterns by utilizing conventional filters.
1 Introduction 1 • SASRec [21] and BERT4Rec [44] are two pioneering works
2 Related Work 2 equipped with unidirectional and bidirectional self-attention,
3 Preliminaries 3 respectively. By their nature, SASRec predicts the next item based
4 The Role of Cross-Entropy Loss in Optimizing on previously interacted items, while BERT4Rec is optimized
Ranking Capability 3 through a cloze task.
4.1 Cross-Entropy for Some Ranking Metrics 3 • FMLP-Rec [56] denoises item sequences in the frequency do-
4.2 Revisiting Noise Contrastive Estimation 4 main. Although FMLP-Rec consists solely of MLPs, it exhibits
4.3 Scaling Up the Sampled Normalizing Term 5 superior performance compared to Transformer-based models.
4.4 Computational Complexity 6 An overview of the model statistics can be found in Table 4. Note
5 Experiments 6 that for LlamaRec and E4SRec we employ Llama2-7B instead of
5.1 Experimental Setup 6 Llama2-13B as the foundation model for training efficiency. This
5.2 Overall Performance Evaluation 7 results in minor performance differences.
5.3 Other Factors for Objective Evaluation 8 Implementation details. The implementation of LLM-based
6 Conclusion 9 recommenders is due to their source code, while for conventional
References 10 models the code is available at https://anonymous.4open.science/r/
Contents 11 1025. Due to the difference in model design, it is challenging to con-
A Experimental setup 11 duct evaluations on a completely identical testbed. Therefore, we
B Overview of Loss Function 12 follow the routine of previous studies [21, 44], where the maximum
C Proofs 12 sequence length 𝐿 = 200 on MovieLens-1M and 𝐿 = 50 on Beauty
C.1 Proof of Proposition 4.1 12 and Yelp. In contrast, for LLM-based models, 𝐿 = 20 on all three
C.2 Proof of Theorem 4.3 and 4.4 13 datasets. Empirically, this difference does not make the conclusion
C.3 Proof of Eq. (8) 14 differ.
C.4 Proofs Regarding Reciprocal Rank (RR) 14 Training strategies. For SASRec and BERT4Rec, each item
D Empirical results on Reciprocal Rank (RR) 15 sequence is trained once per epoch. As a result, only the most
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

recent 𝐿 historical interactions are accessed during the training distribution:


process, which will negatively impact performance when the se- 𝐾
∑︁
quence lengths are typically longer than 𝐿. Some approaches such ℓNCE = − log 𝜎 (𝑠 𝑣′ + ) − log(1 − 𝜎 (𝑠 𝑣′𝑖 ))
as LLM-based approaches and other conventional methods thus 𝑖=1
divides each sequence into multiple sub-sequences.
exp(𝑠 𝑣′ + ) 𝐾
∑︁ 1
= − log − log
1 + exp(𝑠 𝑣′ + ) 𝑖=1 1 + exp(𝑠 𝑣′𝑖 )
B OVERVIEW OF LOSS FUNCTION
𝐾
Cross-Entropy (CE), also known as the negative log-likelihood
 Ö 
(NLL) loss, can be formulated as follows: = −𝑠 𝑣′ + + log (1 + exp(𝑠 𝑣′ + )) (1 + exp(𝑠 𝑣′𝑖 ) .
𝑖=1
exp(𝑠 𝑣+ ) ∑︁ | {z }
ℓCE = − log Í = −𝑠 𝑣+ + log exp(𝑠 𝑣 ) . =:𝑍 NCE
𝑣∈I exp(𝑠 𝑣 )
𝑣∈I
| {z } In the case of uniform sampling, 𝑠 𝑣′ = 𝑠 𝑣 − 𝑐 − log | 𝐾I | , where 𝑐 is a
=:𝑍 CE trainable parameter as an estimate of log 𝑍 CE .
This is also the de facto objective commonly used in the pre-training NEGative sampling (NEG) [33] is a special case of NCE by
|I|
(fine-tuning) of LLMs. fixing 𝑐 = log 𝐾 :
Binary Cross-Entropy (BCE). BCE samples one negative item 𝐾
∑︁
𝑣 − for each target 𝑣 + , which necessitates the recommender to pos- ℓNEG = − log 𝜎 (𝑠 𝑣+ ) − log(1 − 𝜎 (𝑠 𝑣𝑖 ))
sess excellent pointwise scoring capability: 𝑖=1
𝐾
ℓBCE = − log 𝜎 (𝑠 𝑣+ ) − log(1 − 𝜎 (𝑠 𝑣− )) exp(𝑠 𝑣+ ) ∑︁ 1
= − log − log
exp(𝑠 𝑣+ ) 1 1 + exp(𝑠 𝑣+ ) 𝑖=1 1 + exp(𝑠 𝑣𝑖 )
= − log − log
1 + exp(𝑠 𝑣+ ) 1 + exp(𝑠 𝑣− )  𝐾
Ö 

= −𝑠 𝑣+ + log (1 + exp(𝑠 𝑣+ ))(1 + exp(𝑠 𝑣− )) . = −𝑠 𝑣+ + log (1 + exp(𝑠 𝑣+ )) (1 + exp(𝑠 𝑣𝑖 ) .
| {z } 𝑖=1
| {z }
=:𝑍 BCE
=:𝑍 NEG
Here 𝜎 : R → [0, 1] denotes the sigmoid function.
Bayesian Personalized Ranking (BPR) [40] also samples one C PROOFS
negative item 𝑣 − for each target 𝑣 + , but it intends to maximize the This section completes the proofs regarding the connection between
probability that 𝑣 + will be chosen in preference to 𝑣 − : the aforementioned loss functions and ranking metrics. Before
delving into the proofs, it is important to note that in this paper we
ℓBPR = − log 𝜎 (𝑠 𝑣+ − 𝑠 𝑣− ) focus on the metrics specifically for next-item recommendation as
exp(𝑠 𝑣+ ) it is most consistent with LLM’s next-token generation feature.
= − log
exp(𝑠 𝑣+ ) + exp(𝑠 𝑣− )

= 𝑠 𝑣+ + log exp(𝑠 𝑣+ ) + exp(𝑠 𝑣− ) . C.1 Proof of Proposition 4.1
| {z } Proposition C.1. For a target item 𝑣 + which is ranked as 𝑟 + , the
=:𝑍 BPR following inequality holds true for any 𝑛 ≥ 𝑟 +
Importance Sampling (IS) [2, 20] is a widely used technique − log NDCG(𝑟 + ) ≤ ℓCE-𝑛 , (9)
for CE approximation. It is capable of correcting the approximation
error via a proposal distribution 𝑄: where
∑︁
ℓCE-𝑛 := 𝑠 𝑣+ + log exp(𝑠 𝑣 ). (10)
exp(𝑠 𝑣+ − log 𝑄 (𝑣 + ))
ℓIS = − log Í𝐾 𝑟 (𝑣) ≤𝑛
𝑖=1 exp(𝑠 𝑣𝑖 − log 𝑄 (𝑣𝑖 ))
exp(𝑠 𝑣+ ) Proof. Notice that log2 (1 + 𝑥) ≤ 𝑥 holds true for any 𝑥 ≥ 1.
= − log Í𝐾  Hence, we have
𝑖=1 exp 𝑠 𝑣𝑖 − log 𝑄 (𝑣𝑖 ) + log 𝑄 (𝑣 + )
𝐾 1
∑︁ NDCG(𝑟 + ) =
log2 (1 + 𝑟 + )

= −𝑠 𝑣+ + log exp 𝑠 𝑣𝑖 − log 𝑄 (𝑣𝑖 ) + log 𝑄 (𝑣 + ) .
𝑖=1 1 1 1
≥ = Í = Í
𝑟 + 1 + 𝑣≠𝑣+ 𝛿 (𝑠 𝑣 > 𝑠 𝑣+ ) 1 + 𝑟 (𝑣) <𝑟 + 𝛿 (𝑠 𝑣 > 𝑠 𝑣+ )
| {z }
=:𝑍 IS
1 exp(𝑠 𝑣+ )
In addition to uniform distribution, the distribution derived from ≥ Í = Í
1 + 𝑟 𝑣 <𝑟 + exp(𝑠 𝑣 − 𝑠 𝑣+ ) exp(𝑠 𝑣+ ) + 𝑟 (𝑣) <𝑟 + exp(𝑠 𝑣 )
popularity metrics [29] is also a commonly used choice.
exp(𝑠 𝑣+ )
Noise Contrastive Estimation (NCE) [14, 34] requires the ≥Í ,
model to discriminate the target 𝑣 + from an easy-to-sample noise 𝑟 (𝑣) ≤𝑛 exp(𝑠 𝑣 )
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation Conference’17, July 2017, Washington, DC, USA

where 𝛿 (condition) = 1 if the given condition is true else 0, and Lemma C.5. Let 𝜉 4 be the number of sampled items with scores
the second-to-last inequality holds because exp(𝑠 𝑣 − 𝑠 𝑣+ ) ≥ 1 when higher than that of the target; that is
𝑠 𝑣 > 𝑠 𝑣+ .
□ 𝜉 4 = {𝑣𝑖 : 𝑠 𝑣𝑖 ≥ 𝑠 𝑣+ , 𝑖 = 1, 2, . . . , 𝐾 } . (17)

Then, we have3
C.2 Proof of Theorem 4.3 and 4.4
First, let us introduce some lemmas that give lower bounds of the ℓIS ≥ log(𝜉 4 ) (18)
loss functions.
if the proposal distribution 𝑄 (𝑣) ≡ 1/|I|.
Lemma C.2. Let 𝜉 1 be the number of sampled items with non-
negative corrected scores; that is, Proof. According to the definition of importance sampling, it
{𝑣𝑖 : 𝑠 𝑣′𝑖
follows that
𝜉1 = ≥ 0, 𝑖 = 1, 2, . . . , 𝐾 } . (11)
𝐾
Then, we have ∑︁ 
ℓIS = −𝑠 𝑣+ + log exp(𝑠 𝑣𝑖 )
ℓNCE ≥ 𝜉 1 log 2. (12)
𝑖=1
Proof. According to the definition of NCE, it follows that 𝐾
∑︁ 
 𝐾  = log exp(𝑠 𝑣𝑖 − 𝑠 𝑣+ )
Ö
ℓNCE = −𝑠 𝑣′ + + log (1 + exp(𝑠 𝑣′ + )) (1 + exp(𝑠 𝑣′𝑖 ) 𝑖=1
𝐾
𝑖=1 ∑︁ 
 𝐾  ≥ log 𝛿 (𝑠 𝑣𝑖 ≥ 𝑠 𝑣+ ).
Ö
= log (1 + exp(−𝑠 𝑣′ + )) (1 + exp(𝑠 𝑣′𝑖 ) 𝑖=1
𝑖=1 = log(𝜉 4 ).
𝐾
Ö 𝐾
 ∑︁

≥ log (1 + exp(𝑠 𝑣′𝑖 ) = log(1 + exp(𝑠 𝑣′𝑖 ))
𝑖=1 𝑖=1
Lemma C.6. Let 𝜉 ∼ B (𝐾, 𝑝) denote a random variable represent-
𝐾
∑︁ ing the number of successes over 𝐾 binomial trials with a probability
≥ 𝛿 (𝑠 𝑣′𝑖 ≥ 0) log 2 = 𝜉 1 log 2.
of 𝑝. Then, we have
𝑖=1
□ P(𝜉 ≥ 𝑚) ≥ 1 − 𝑚(1 − 𝑝) ⌊𝐾/𝑚⌋ , ∀𝑚 = 0, 1, . . . , 𝐾 . (19)
Lemma C.3. Let 𝜉 2 be the number of sampled items with non-
negative scores; that is, Proof. 4 Divide the 𝐾 independent binomial trials into 𝑚 dis-
joint groups, each containing at least ⌊𝐾/𝑚⌋ trials. If 𝜉 < 𝑚, then
𝜉 2 = {𝑣𝑖 : 𝑠 𝑣𝑖 ≥ 0, 𝑖 = 1, 2, . . . , 𝐾 } . (13)
one of the groups must have no successes observed; formally, we
Then, we have have
ℓNEG ≥ 𝜉 2 log 2. (14) 𝑚
Ø
Proof. The proof is completely the same as that of Lemma C.2. P(𝜉 < 𝑚) ≤ P( {no successes observed in group 𝑖}) (20)
𝑖=1

𝑚
∑︁
Lemma C.4. Let 𝜉 3 be the number of sampled items with scores ≤ P({no successes observed in group 𝑖}) (21)
not lower than that of the target; that is 𝑖=1

𝜉 3 = {𝑣𝑖 : 𝑠 𝑣𝑖 ≥ 𝑠 𝑣+ , 𝑖 = 1, 2, . . . , 𝐾 } . (15) ≤ 𝑚(1 − 𝑝) ⌊𝐾/𝑚⌋ . (22)


Then, we have Hence, the proof is completed by noting the fact that
ℓSCE ≥ log(1 + 𝛼𝜉 3 ). (16) P(𝜉 ≥ 𝑚) = 1 − P(𝜉 < 𝑚). (23)
Proof. According to the definition of SCE, it follows that □
𝐾
∑︁ 
ℓSCE = −𝑠 𝑣+ + log exp(𝑠 𝑣+ ) + 𝛼 exp(𝑠 𝑣𝑖 ) Theorem C.7. Let 𝑣 + be a target item which is ranked as 𝑟 + ≤
𝑖=1 𝑚
22 − 1 for some 𝑚 ∈ N, and
𝐾
∑︁
S+′ := {𝑣 ∈ I : 𝑠 𝑣′ ≥ 0}.

= log 1 + 𝛼 exp(𝑠 𝑣𝑖 − 𝑠 𝑣+ ) S+ := {𝑣 ∈ I : 𝑠 𝑣 ≥ 0},
𝑖=1
𝐾
∑︁ 3 Assume
 that
≥ log 1 + 𝛼 𝛿 (𝑠 𝑣𝑖 ≥ 𝑠 𝑣+ ). log 0 := lim log 𝜖 = −∞.
𝑖=1 𝜖→0

= log(1 + 𝛼𝜉 3 ). 4 The
proof follows from the response posted at the URL: https://math.stackexchange.
□ com/questions/3626472/upper-bound-on-binomial-distribution.
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

If we uniformly sample 𝐾 items for training, then with a probability Proof. Under the condition of Eq. (C.3), we have
of at least
ℓSCE
 ⌊𝐾/𝑚⌋

 1 − 𝑚 1 − |S+′ |/|I| , if ℓ∗ = ℓNCE exp(𝑠 𝑣+ )
= − log

 1 − 𝑚 1 − |S+ |/|I|  ⌊𝐾/𝑚⌋ ,

if ℓ∗ = ℓNEG
 Í𝐾
exp(𝑠 𝑣+ ) + 𝛼 𝑖=1 exp(𝑠 𝑣𝑖 )

1 2𝑚 1 − 𝑟 /|I|  ⌊𝛼𝐾/2 ⌋ ,
𝑚 , (24)
 1 − 𝛼 + if ℓ∗ = ℓSCE | I | −1+𝛼
exp(𝑠 𝑣+ )


 𝑚  ⌊𝐾/2𝑚 ⌋ 𝛼
 1 − 2 1 − 𝑟 + /|I| if ℓ∗ = ℓIS = − log
 ,
| I | −1+𝛼 | I | −1+𝛼 Í𝐾
𝛼 exp(𝑠 𝑣+ ) + 1 𝑖=1 exp(𝑠 𝑣𝑖 )
we have  
𝛼
exp 𝑠 𝑣+ − log | I | −1+𝛼
− log NDCG(𝑟 + ) ≤ ℓ∗ . (25)
= − log    ,
𝑚
Proof. As 𝑟 + ≤ 22 − 1 for some 𝑚 ∈ N, we immediately have 1
𝛼 Í𝐾
exp 𝑠 𝑣+ − log | I | −1+𝛼 + 𝑖=1 exp 𝑠 𝑣𝑖 − log | I | −1+𝛼
− log NDCG(𝑟 + ) ≤ 𝑚 log 2.
which is morphologically equivalent to ℓIS with 𝐾 + 1 items in the
Now, let us prove the conclusions one by one. normalizing term.

Case 1. According to Lemma C.2 we know that

ℓNCE ≥ 𝜉 1 log 2. C.4 Proofs Regarding Reciprocal Rank (RR)


Therefore, Eq. (25) holds true for NCE as long as 𝜉 1 ≥ 𝑚. Reciprocal Rank (RR) is another popular metric used to measure
Formally, we have the ranking capability, which is defined as follows
  1
P − log NDCG(𝑟 + ) ≤ ℓNCE ≥ P 𝜉 1 ≥ 𝑚 . (26) RR(𝑟 + ) = . (33)
𝑟+
Also notice that uniformly sampling from I yields a probabil- We provide some theoretical conclusions here and leave the em-
ity of 𝑝 = |S+′ |/|I| such that the corrected score of the sam- pirical results in next section. Firstly, we connect RR to the cross-
pled item is non-negative. Therefore, based on Lemma C.6, entropy as Proposition 4.1 does for NDCG.
we have
Proposition C.8. For a target item 𝑣 + which is ranked as 𝑟 + , the
P 𝜉 1 ≥ 𝑚 ≥ 1 − 𝑚(1 − |S+′ |/|I|) ⌊𝐾/𝑚⌋ .

(27)
following inequality holds true for any 𝑛 ≥ 𝑟 +
Case 2. The proof of NEG is completely the same as that of NCE.
− log MRR(𝑟 + ) ≤ ℓCE-𝑛 , (34)
Case 3. Analogously, Lemma C.4 implies that
 where
P − log NDCG ≤ ℓSCE (28) ∑︁
𝑚
2 −1  ℓCE-𝑛 := 𝑠 𝑣+ + log exp(𝑠 𝑣 ).
≥P 𝜉 3 ≥ (29) 𝑟 (𝑣) ≤𝑛
𝛼
𝑚
2  𝑚
 2 
≥P 𝜉 3 ≥ = P 𝜉3 ≥ . (30) Proof.
𝛼 𝛼
1
Also notice that uniformly sampling from I yields a proba- RR(𝑟 + ) =
𝑟+
bility of 𝑝 = 𝑟 + /|I| such that the score of the sampled item
1 1
is not lower than that of the target (i.e., the top-𝑟 + ranked = Í = Í
1 + 𝑣≠𝑣+ 𝛿 (𝑠 𝑣 > 𝑠 𝑣+ ) 1 + 𝑟 (𝑣) <𝑟 + 𝛿 (𝑠 𝑣 > 𝑠 𝑣+ )
items). Therefore, based on Lemma C.6, we have
1 exp(𝑠 𝑣+ )
 2𝑚    2𝑚  2𝑚 ≥ =
(1 − 𝑟 + /|I+ |) ⌊𝐾/⌊ 𝛼 ⌋ ⌋
Í Í
P 𝜉3 < ≤ (31) 1 + 𝑟 𝑣 <𝑟 + exp(𝑠 𝑣 − 𝑠 𝑣+ ) exp(𝑠 𝑣+ ) + 𝑟 (𝑣) <𝑟 + exp(𝑠 𝑣 )
𝛼 𝛼
2𝑚 2𝑚
exp(𝑠 𝑣+ )
≤ (1 − 𝑟 + /|I+ |) ⌊𝐾/ 𝛼 ⌋ . (32) ≥Í .
𝛼 𝑟 (𝑣) ≤𝑛 exp(𝑠 𝑣 )

Case 3. The proof of importance sampling is similar to that of SCE. □


The proof has been completed now.
□ Next, we establish a connection between RR and NCE, NEG, SCE,
and importance sampling, similar to what Theorem C.7 does for
NDCG.
C.3 Proof of Eq. (8)
Here, we show that ℓSCE is morphologically equivalent to ℓIS if Theorem C.9. Let 𝑣 + be a target item which is ranked as 𝑟 + ≤ 2𝑚
( 𝛼
for some 𝑚 ∈ N, and
| I | −1+𝛼 if 𝑣 = 𝑣 +
𝑄 (𝑣) = 1 . S+ := {𝑣 ∈ I : 𝑠 𝑣 ≥ 0}, S+′ := {𝑣 ∈ I : 𝑠 𝑣′ ≥ 0}.
| I | −1+𝛼 if 𝑣 ≠ 𝑣 +
Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation Conference’17, July 2017, Washington, DC, USA

Beauty MovieLens-1M Table 7: MRR@5 and MRR@10 comparison on the Beauty,


MovieLens-1M, and Yelp datasets. The best results of each
0.048 0.140
block are marked in bold. ‘ ▲% over CE/LLM’ represents the
MRR@10

0.046 0.135 relative gap between the respective best results.


Loss

Loss
0.130
0.044
Beauty MovieLens-1M Yelp
Epoch
0.125 Epoch
0.042 MRR@5 MRR@10 MRR@5 MRR@10 MRR@5 MRR@10
0.1 1 5 0.1 1 5 POD 0.0107 0.0116 0.0249 0.0264 0.0267 0.0292
η η LLM
P5(CID+IID) 0.0345 0.0376 0.1312 0.1429 0.0164 0.0186
LlamaRec 0.0344 0.0380 0.0903 0.1046 0.0270 0.0295
E4SRec 0.0326 0.0356 0.1025 0.1144 0.0174 0.0196
Figure 7: Performance comparison based on tighter bounds GRU4Rec 0.0281 0.0310 0.1311 0.1438 0.0137 0.0161
for MRR. The dashed line represents the results trained by Caser 0.0260 0.0284 0.1302 0.1417 0.0187 0.0200
CE SASRec 0.0443 0.0479 0.1287 0.1408 0.0302 0.0331
CE (namely the case of 𝜂 → +∞). BERT4Rec 0.0294 0.0325 0.1118 0.1242 0.0207 0.0231
FMLP-Rec 0.0438 0.0473 0.1354 0.1481 0.0316 0.0348
α =1 α =5 α =100
▲% over LLM 28.1% 26.0% 3.2% 3.6% 16.9% 18.1%
Beauty MovieLens-1M GRU4Rec 0.0125 0.0147 0.0862 0.0977 0.0086 0.0102
0.05
Caser 0.0143 0.0163 0.1016 0.1132 0.0185 0.0201
0.14
0.04 BPR SASRec 0.0248 0.0279 0.0934 0.1045 0.0137 0.0158
BERT4Rec 0.0121 0.0141 0.0690 0.0795 0.0119 0.0138
MRR@10

0.12 FMLP-Rec 0.0269 0.0298 0.1034 0.1153 0.0286 0.0314


0.03
▲% over CE -39.2% -37.7% -23.6% -22.1% -9.6% -9.8%
0.02 0.10 ▲% over LLM -22.1% -21.5% -21.2% -19.3% 5.7% 6.5%
GRU4Rec 0.0108 0.0129 0.0836 0.0953 0.0079 0.0094
0.01 0.08 Caser 0.0154 0.0174 0.0894 0.1005 0.0198 0.0214
1 10 50 100 500 1 10 50 100 150
BCE SASRec 0.0225 0.0257 0.0859 0.0977 0.0192 0.0215
K K
BERT4Rec 0.0122 0.0144 0.0641 0.0743 0.0111 0.0131
FMLP-Rec 0.0248 0.0281 0.0967 0.1088 0.0287 0.0312
Figure 8: MRR@10 performance under various weight 𝛼. ▲% over CE -43.9% -41.2% -28.6% -26.5% -9.2% -10.5%
▲% over LLM -28.1% -26.0% -26.3% -23.8% 6.1% 5.7%
GRU4Rec 0.0241 0.0270 0.1300 0.1420 0.0118 0.0141
Caser 0.0213 0.0238 0.1296 0.1415 0.0194 0.0209
If we uniformly sample 𝐾 items for training, then with a probability NCE SASRec 0.0419 0.0455 0.1249 0.1377 0.0303 0.0330
of at least BERT4Rec 0.0271 0.0303 0.1098 0.1226 0.0232 0.0257
 ⌊𝐾/𝑚⌋ FMLP-Rec 0.0425 0.0460 0.1329 0.1460 0.0316 0.0348

 1 − 𝑚 1 − |S+′ |/|I| , if ℓ∗ = ℓNCE ▲% over CE -4.0% -3.8% -1.8% -1.4% -0.1% 0.0%
 ▲% over LLM 23.0% 21.2% 1.3% 2.2% 16.8% 18.1%
 1 − 𝑚 1 − |S+ |/|I|  ⌊𝐾/𝑚⌋ ,

if ℓ∗ = ℓNEG


, (35) GRU4Rec 0.0296 0.0323 0.1349 0.1475 0.0148 0.0173
1 2𝑚 1 − 𝑟 /|I|  ⌊𝛼𝐾/2 ⌋ ,
𝑚

 1 − 𝛼 + if ℓ∗ = ℓSCE Caser 0.0277 0.0300 0.1360 0.1474 0.0193 0.0208

  ⌊𝐾/2 𝑚⌋ SCE SASRec 0.0435 0.0470 0.1335 0.1457 0.0295 0.0324
 1 − 2𝑚 1 − 𝑟 /|I| , if ℓ∗ = ℓIS
 + BERT4Rec 0.0318 0.0349 0.1184 0.1308 0.0239 0.0265
FMLP-Rec 0.0436 0.0472 0.1410 0.1531 0.0304 0.0339
we have ▲% over CE -1.5% -1.3% 4.2% 3.4% -3.7% -2.8%
− log RR(𝑟 + ) ≤ ℓ∗ . (36) ▲% over LLM 26.1% 24.3% 7.5% 7.2% 12.6% 14.9%

Proof. As 𝑟 + ≤ 2𝑚 for some 𝑚 ∈ N, we immediately have


According to Proposition C.8, − log RR(𝑟 + ) would be strictly
− log RR(𝑟 + ) ≤ 𝑚 log 2.
bounded by CE-like losses, as long as all items ranked before 𝑣 + are
The conclusions then can be proved in the same way as Theo- retained in the normalizing term. Following the strategy of NDCG,
rem C.7. we perform the adaptive truncation Eq. (3) to investigate the effect
□ of a tighter bound on RR. Figure 7 depicts a similar curve to NDCG:
It increases as the training instability is gradually overcome, and
Remark 1. It is worth noting that the inequality for RR is achieved
decreases as the objective approaches cross-entropy (i.e., 𝜂 → +∞).
at a stricter condition compared to NDCG. For the same rank 𝑟 + ,
In addition, Figure 8 and Figure 9 respectively demonstrate the
NDCG allows for a smaller value of 𝑚, thereby yielding slightly higher
effectiveness of SCE and the limitations of NCE and NEG. By sam-
bounding probabilities. However, this nuance does not undermine the
pling more negative samples per iteration, SCE enjoys superior
fact that the same conclusions can be drawn from the two metrics.
performance, whereas both NCE and NEG suffer from the training
The empirical observations in the next section can be used to verify
difficulties caused by the weak bounds in the early training stages.
this.
Thus, SCE is arguably a simpler and preferable approximation to
cross-entropy.
D EMPIRICAL RESULTS ON RECIPROCAL
Finally, Table 7 reports the overall comparisons w.r.t. MRR, the
RANK (RR) corresponding conclusions drawn from Table 5 can also be observed
In this part, we provide the empirical results on Reciprocal Rank here.
(RR), which are very similar to those of NDCG. Note that Mean
Reciprocal Rank (MRR) reported below represents the average per- E MAXIMUM SEQUENCE LENGTH
formance over all users.
Conference’17, July 2017, Washington, DC, USA Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, and Wei Zhang

K = 10 K = 50 K = 100 K = 500 K = 10 K = 50 K = 100 K = 500


NCE (c = 1) NEG NCE (c = 1) NEG

0.04

0.10
MRR@10

0.03

MRR@10
0.02
0.05
0.01

0.00 0.00
0 200 400 0 200 400 0 100 200 300 0 100 200 300
Epoch Epoch Epoch Epoch
(a) Beauty (b) MovieLens-1M

Figure 9: MRR@10 performance of NCE and NEG across different number of negative samples.

Table 8: Recommendation performance under various maximum sequence length 𝐿. The best results of each block are marked
in bold. In contrast to SASRec, SASRec+ augments each sequence using a sliding window, so all interactions will be learned
during training.

Beauty MovieLens-1M
Method 𝐿 HR@5 HR@10 NDCG@5 NDCG@10 𝐿 HR@5 HR@10 NDCG@5 NDCG@10
10 0.0554 0.0785 0.0391 0.0466 20 0.2225 0.3131 0.1570 0.1861
20 0.0569 0.0791 0.0403 0.0474 50 0.2300 0.3204 0.1585 0.1877
LLM P5(CID+IID)
50 0.0559 0.0784 0.0405 0.0478 100 0.1560 0.2252 0.1035 0.1258
100 0.0557 0.0776 0.0391 0.0461 200 0.1382 0.1949 0.0919 0.1095
10 0.0673 0.0938 0.0479 0.0565 20 0.1715 0.2481 0.1161 0.1407
20 0.0691 0.0973 0.0492 0.0583 50 0.1975 0.2827 0.1341 0.1616
SASRec
50 0.0713 0.0986 0.0510 0.0597 100 0.2110 0.3017 0.1446 0.1738
100 0.0715 0.0995 0.0511 0.0601 200 0.2221 0.3131 0.1518 0.1812
CE
10 0.0670 0.0933 0.0477 0.0562 20 0.2309 0.3233 0.1590 0.1887
20 0.0691 0.0952 0.0490 0.0574 50 0.2336 0.3243 0.1614 0.1906
SASRec+
50 0.0685 0.0952 0.0485 0.0571 100 0.2314 0.3246 0.1589 0.1890
100 0.0688 0.0955 0.0492 0.0578 200 0.2295 0.3221 0.1589 0.1889
10 0.0654 0.0923 0.0460 0.0547 20 0.1703 0.2448 0.1132 0.1372
20 0.0676 0.0949 0.0476 0.0564 50 0.1905 0.2775 0.1281 0.1562
SASRec
50 0.0686 0.0961 0.0485 0.0573 100 0.2079 0.3001 0.1406 0.1703
100 0.0678 0.0955 0.0478 0.0567 200 0.2177 0.3135 0.1479 0.1788
NCE
10 0.0649 0.0912 0.0458 0.0542 20 0.2267 0.3236 0.1560 0.1873
20 0.0652 0.0915 0.0467 0.0552 50 0.2291 0.3217 0.1572 0.1871
SASRec+
50 0.0653 0.0911 0.0462 0.0545 100 0.2297 0.3262 0.1574 0.1886
100 0.0659 0.0913 0.0466 0.0548 200 0.2293 0.3246 0.1572 0.1880
10 0.0679 0.0935 0.0479 0.0562 20 0.1760 0.2514 0.1199 0.1442
20 0.0690 0.0957 0.0491 0.0577 50 0.2041 0.2869 0.1397 0.1664
SASRec
50 0.0698 0.0968 0.0500 0.0587 100 0.2182 0.3084 0.1496 0.1787
100 0.0707 0.0970 0.0506 0.0591 200 0.2273 0.3186 0.1567 0.1862
SCE
10 0.0682 0.0936 0.0488 0.0569 20 0.2312 0.3231 0.1602 0.1899
20 0.0679 0.0937 0.0491 0.0574 50 0.2319 0.3271 0.1607 0.1915
SASRec+
50 0.0684 0.0934 0.0492 0.0572 100 0.2342 0.3284 0.1620 0.1924
100 0.0687 0.0935 0.0496 0.0576 200 0.2331 0.3271 0.1617 0.1921

You might also like