Data-Efficient Fine-Tuning For LLM-based Recommendation
Data-Efficient Fine-Tuning For LLM-based Recommendation
Hefei, China
Tat-Seng Chua
[email protected]
National University of Singapore
Singapore
ABSTRACT the potential gap between the surrogate model and LLMs, we
Leveraging Large Language Models (LLMs) for recommendation has further propose an effort score to prioritize some hard samples
recently garnered considerable attention, where fine-tuning plays a specifically for LLMs. We instantiate the proposed method on
key role in LLMs’ adaptation. However, the cost of fine-tuning LLMs two competitive LLM-based recommender models, and empirical
on rapidly expanding recommendation data limits their practical results on three real-world datasets validate the effectiveness of our
application. To address this challenge, few-shot fine-tuning offers a proposed method. In particular, our method uses only 2% samples
promising approach to quickly adapt LLMs to new recommendation to surpass the full data fine-tuning, reducing time costs by 97%.
data. We propose the task of data pruning for efficient LLM-
based recommendation, aimed at identifying representative samples CCS CONCEPTS
tailored for LLMs’ few-shot fine-tuning. While coreset selection • Information systems → Recommender systems.
is closely related to the proposed task, existing coreset selection
methods often rely on suboptimal heuristic metrics or entail costly KEYWORDS
optimization on large-scale recommendation data. Data Pruning, LLM-based Recommendation, Efficient Fine-tuning
To tackle these issues, we introduce two primary objectives for
ACM Reference Format:
the data pruning task in the context of LLM-based recommendation: Xinyu Lin, Wenjie Wang∗ , Yongqi Li, Shuo Yang, Fuli Feng, Yinwei
1) high accuracy aims to identify the influential samples that can Wei, and Tat-Seng Chua. 2024. Data-efficient Fine-tuning for LLM-based
lead to high overall performance; and 2) high efficiency underlines Recommendation. In Proceedings of the 47th International ACM SIGIR
the low costs of the data pruning process. To pursue the two Conference on Research and Development in Information Retrieval (SIGIR
objectives, we propose a novel data pruning method incorporating ’24), July 14–18, 2024, Washington, DC, USA. ACM, New York, NY, USA,
two scores, namely influence score and effort score, to efficiently 10 pages. https://doi.org/10.1145/3626772.3657807
identify the influential samples. Particularly, the influence score
is introduced to accurately estimate the influence of removing 1 INTRODUCTION
each sample on the overall performance. To achieve low costs Leveraging Large Language Models (LLMs) for recommenda-
of the data pruning process, we employ a small-sized surrogate tion has demonstrated promising efficacy across various tasks,
model to replace LLMs to obtain the influence score. Considering including Click-Through Rate (CTR) prediction [4], sequential
∗ Corresponding
recommendation [35], and explainable recommendation [11]. To
author. This work is supported by the CCCD Key Lab of Ministry of
Culture and Tourism.
build LLM-based recommender models, it is crucial to fine-tune
LLMs on recommendation data for two primary reasons: 1) there
Permission to make digital or hard copies of all or part of this work for personal or exists a significant gap between previous LLMs’ tuning tasks and
classroom use is granted without fee provided that copies are not made or distributed the recommendation tasks [4], and 2) the rapid and continuous
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the update of recommendation data necessitates frequent fine-tuning
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or of LLMs [38]. For example, there are approximately 160 million
republish, to post on servers or to redistribute to lists, requires prior specific permission new videos and 942 billion interactions emerging on TikTok per
and/or a fee. Request permissions from [email protected].
SIGIR ’24, July 14–18, 2024, Washington, DC, USA day1 . Thus, frequent fine-tuning is imperative to incorporate up-to-
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. date item information and enhance user behavior comprehension.
ACM ISBN 979-8-4007-0431-4/24/07
https://doi.org/10.1145/3626772.3657807 1 https://www.tiktok.com/transparency/.
Conference’17,
SIGIR July 2017,
’24, July 14–18, 2024, Washington,
Washington,DC,
DC,USA
USA XinyuAnon.
Lin et al.
Tosummary,
• In achieve high
this accuracy,
work o�ers it is essential
three majortocontributions:
measure the influence
(Recall@10)
(Recall@10) Games (Recall@20)
0.0075
Model GPU (GiB) Time
0.023
Recall@10 0.031
• Weof removing
introduceeach training
a data sample
pruning taskontothe empirical
identify the risk. However,
in�uential
BIGRec 18.60 ⇥ 4GPU 36.87h assessingtailored
the influence of all samples is costly, as it requires
�ne-the
Recall@20
samples for e�cient LLM-based recommender
SASRec 1.61 ⇥ 1GPU 0.45h
0.0050
leaving-one-out retraining for each sample [43].
tuning, unlocking the remarkable potential of applying LLM-
0.023 % Red. 97.84% 98.78%
To achieve high efficiency,
models toone possible solution is to train a
0.018
BIGRec • based recommender real-world platforms.
0.0025
SASRec (full data) • We propose a novel data pruning method using
surrogate model for sample selection, e.g., a small-sized
to discover the
0.013 0.015
(b) The comparison
Figure of training costs
2: The comparison of traditionalsamples
in�uential recommender model, which
for LLM-based can drastically
recommendation, reduce
which
between an LLM (BIGRec) and a
the GPU memory usage and the training time compared to LLMs
0.0000 0.1
512 1024 2048 training costs between an
0.3 0.5 1 2
0 128
(a) Effect
256
of !Performance.
w.r.t. Recall
surrogate model (SASRec). The e�ectively and e�ciently assesses the in�uence of removing
(a)(a)Few-shot
Few-shot performance on LLM (BIGRec)
statistics are basedand a surrogate
on NVIDIA RTX (see Figure 1(b)). However,
a sample on empirical risk. there exists a gap between LLMs and
MicroLens-50K. A5000 on Games. 3 surrogate models, attributable to their divergent capabilities in
model (SASRec) . • We conduct extensive experiments on three real-world datasets,
Figure 1:Figure 1: imagethat BIGRec achieves remarkable
(a) reveals learning user behaviors (refer to Figure 3). As such, influential
performance with only hundreds of samples. (b) shows the demonstrating the e�ectiveness of DEALRec in achieving both
samples selected by surrogate models might deviate from the
low costs of surrogate models. high e�ciency and accuracy.
To overcome the above issues, we summarize two principal objec- ones on LLMs, potentially hurting the adaptation of LLMs.
tives for data
However, pruning LLMs
fine-tuning in the context of LLM-based
on large-scale recommendation:
recommendation data 2 To address
TASK the challenges, we propose a novel Data pruning
FORMULATION
1) high accuracy, which focuses on selecting
demands substantial computational resources and time costs [26], the samples that can method, to Efficiently identify the influentiAl samples for LLM-
In this section, we �rst introduce LLM-based recommender models
lead to low empirical risk; and 2) high
thereby diminishing the practicality of LLM-based recommender e�ciency, which emphasizes based Recommender fine-tuning (shorted as DEALRec). DEALRec
and uncover the challenge of real-world applicability. Thereafter, we
the low
models costs of the
in real-world data pruning
applications. process,
As such, it is i.e., eliminating
essential to enhance the leverages two scores, namely influence score and effort score, to
formulate the task of data pruning for LLM-based recommendation
thedependency
fine-tuningofefficiency well-trained LLMs on the
of LLM-based full data. Nevertheless,
recommender models. identify the influential samples. The influence score is formulated
and compare the related work on coreset selection.
pursuing the two objectives faces two
Fortunately, the rich world knowledge encoded in LLMs offers challenges: to estimate the influence of removing each training sample on
• To achieve high accuracy, it is essential
a promising solution for efficient fine-tuning: few-shot fine-tuning. to measure the in�uence •theLLM-based
empirical risk. recommender
It is calculated models. by To leverage the
extending the compe-influence
of removing each training sample on the empirical risk. However, tent capabilities of LLMs, LLM-based
function [15] via chain rules and second-order optimizationrecommendation typically
Previous studies have uncovered that LLMs have the potential to
assessing therecommendation
in�uence of all samples utilize powerful LLMs directly as the recommender models.score Since for
quickly adapt to tasks byisfine-tuning
costly, as iton requires
randomly the techniques [24]. To efficiently calculate the influence
leaving-one-out retraining for each sample [39]. LLMs
all are notDEALRec
samples, particularly trainedaon
employs the recommendation
simple yet effective data,
symmetric
sampled few-shot data [3, 4, 27] (Figure 1(a)), significantly reducing
• To achieve high e�ciency, one possible
training time and computational costs. Despite its efficiency, solution is to train a �ne-tuning
property to is the necessary
accelerate the and
calculation,key step for
requiring LLMs
only tothe learn the
estimation
surrogate model for sample selection,
randomly sampled data may lack sufficient representativeness e.g., using a small-sized item knowledge and understand user behavior.
once for all samples (cf. Section 3.1). Thereafter, DEALRec uses a Let U and I denote
traditional
to enable LLMsrecommender
to effectivelymodel, which can
comprehend new drastically
items and reduce
user the sets of users
traditional and items, respectively.
recommender model as a We presentmodel
surrogate each training to obtain
the GPU memory usage and the training
behaviors. To combat this issue, we introduce the task of data time compared to LLMs sample,
the influence
i.e., user sequence,
score and as
introduces
B = (G, the
~), where
effort Gscore
= [8 1 to
, 8 2 mitigate
, . . . , 8 |G | ]the
pruning(see Table ??). However,
for efficient LLM-based thererecommendation,
exists a gap between which LLMsaimsand to isgap
thebetween
user’s historical
the surrogateinteractions
modelinand chronological
LLMs. Theorder, effortand score ~ is
surrogate
identify models, attributable
representative samples tailored to theirfor divergent capabilities
LLMs’ few-shot in
fine- isobtained
the next by interacted
calculating item theof the user4 , where
gradient norm {8 of1,a. .sample
. , 8 |G | , ~}loss ⇢ I. w.r.t.
tuning.learning user behaviors (refer to Figure 4). As such, in�uential Formally, given the
the parameters of user
LLMs, sequences
intuitively of the training set
measuring theD effort
= {BDof|DLLMs 2
samples selected by surrogate models
A closely related literature to this data pruning task is coreset might deviate from the to
U}, the
fit a target
specific is to �ne-tune
sample. By an LLM for
regularizing recommendation
the influence tasks.
score The with
ones [13].
selection on LLMs,
It tries potentially
to select ahurting
small but therepresentative
adaptation of subset LLMs. from learnable
the effortparameters
score, DEALRec (q 2 ) of an LLM isthe
identifies optimized
influential by minimizing
samples that
the fullTo data,
address aiming to achieve comparable
the challenges, we propose performance.
a novel Data pruning Existing the negative log-likelihood
encompass of the next interacted
both the representativeness itemfull
of the ~ conditioned
data and the
coreset
method, selection
to E�cientlymethods generally
identify fall into two
the in�uentiAl categories
samples for LLM- 2 : 1) significance
on input G: to LLMs. We instantiate DEALRec on two LLM-based
Heuristic methods select
based Recommender hard or(shorted
�ne-tuning diverse as samples
DEALRec). based on pre-
DEALRec recommender models and conduct extensive experiments on three
real-world datasets, validating
’|~ |
the superiority of DEALRec in terms
defined metrics [30, 34, 49]. Such heuristic
leverages two scores, namely in�uence score and e�ort score, to methods do not estimate min{Lq !!"
= log %q (~C |~ <C , G)}, (1)
theidentify
impactthe of in�uential
selected samples samples. onThe empirical
influence risk, possibly
score is leading
formulated of both efficiency q 2 and accuracy. The code and datasets are available
BIGRec achieves remarkable performance with only hundreds of samples C=1
totosuboptimal
estimate the coreset
in�uence selection. 2) Optimization-based methods at https://github.com/Linxyhaha/DEALRec.
onofGames
removing each training sample on where ~C denotesthis the C-th
mainly optimize risk.
the empirical the selection of subsets
It is calculated bytoextending
minimizethe the in�uence
empirical In summary, worktokenoffers of three
~, andmajor
~ <C represents
contributions: the token
sequence preceding ~C .
risk [5, 50]. [16]
function However,via chainthese methods
rules andaresecond-order
inapplicable to large-scale
optimization We introduce
• While a data pruning task to identify the influential
�ne-tuning LLMs has demonstrated e�ectiveness in
recommendation
techniques [25]. To e�ciently calculate the in�uence bi-level
datasets due to the complex and costly or samples tailored
score for recommendation tasksfor [30],efficient LLM-based
its practical application recommender
is hindered by fine-
discrete
all samples,optimization
DEALRec problem
employs [17]. Worse yet
a simple still,e�ective
both heuristic
symmetric and tuning, unlocking the remarkable potential of applying LLM-
the high resource costs required by LLMs and the continuous in�ux
optimization-based methods rely on the model well-trained by the based recommender data models
property to accelerate the calculation, requiring only the estimation of new recommendation [35].toHence,
real-world platforms.
it is essential to enhance
full data
once fortoallselect
samples the (cf.
coreset,
Section e.g., calculating
3.1). Thereafter, pre-defined
DEALRec uses scores a • We propose a novel data pruning method to discover the
the e�ciency of LLM-based recommender �ne-tuning.
ortraditional
optimizingrecommender
the data subset basedason
model the well-trained
a surrogate model modelto obtain (cf. influential samples for LLM-based recommendation, which
Section 2). As such, Data pruning and for e�cient LLM-based recommendation.
scoreitand is infeasible
introducestothe directly
e�ort apply these methods • effectively efficiently assesses the influence of removing
the in�uence score to mitigate the
forgap
LLM-based To achieve e�cient LLM-based recommendation, a promising
between recommendation
the surrogate model because
and LLMs.of the high training
The e�ort scorecosts is a sample on empirical risk.
ofobtained
LLMs onby thecalculating
large-scalethe fullgradient
recommendation approach is to extensive
reduce the costs by few-shoton three �ne-tuning with
norm of a data. sample loss w.r.t. • We conduct experiments real-world datasets,
To overcome the above issues, we summarize two principal objec- randomly selected samples
demonstrating the [4]. Nevertheless,
effectiveness of DEALRecthe random
in achieving samples both
the parameters of LLMs, intuitively measuring the e�ort of LLMs
tives for data pruning in the context of LLM-based recommendation: might
high lose some crucial
efficiency and information for LLMs to acquire the latest
accuracy.
to �t a speci�c sample. By regularizing the in�uence score with
1) the
highe�ortaccuracy, information on user behavior or items, e.g., trending items. In this
score,which
DEALRec focuses on selecting
identi�es the samples
the in�uential samples thatthatcan
lead to low empirical risk; and 2) high efficiency, which emphasizes light,
2 TASK we introduce the task of data pruning for e�cient LLM-based
FORMULATION
encompass both the representativeness of the full data and the
the low costs of the data pruning process, i.e., eliminating the recommendation, which aims to identify a set of representative
signi�cance to LLMs. We instantiate DEALRec on two LLM-based In this section, we first introduce LLM-based recommender models
dependency of well-trained LLMs on the fullexperiments
data. Nevertheless, samples particularly for LLMs’ few-shot �ne-tuning. Formally,
recommender models and conduct extensive on three and uncover the challenge of real-world applicability. Thereafter, we
pursuing thedatasets,
two objectives faces given all training samples D = {BD |D 2 U}, the target of data
real-world validating thetwo challenges:
superiority of DEALRec in terms formulate the task of data pruning for LLM-based recommendation
of both
2 More detailede�ciency
related workandisaccuracy.
discussed andThecompared
code and in datasets are5.available
Section 4 and 4and compare
Our main focus liesthein related
sequentialwork on coresetwhich
recommendation, selection.
holds notable practical
at https://anonymous.4open.science/r/DEALRec/. signi�cance by intricately considering the temporal aspect in real-world scenarios.
Data-efficient Fine-tuning for LLM-based Recommendation SIGIR ’24, July 14–18, 2024, Washington, DC, USA
fine-tuning is the necessary and key step for LLMs to learn the
Surrogate
Model
Score
Calculation ×$
+
item knowledge and understand user behavior. Let U and I denote Samples
the sets of users and items, respectively. We present each training training
after training
sample, i.e., user sequence, as 𝑠 = (𝑥, 𝑦), where 𝑥 = [𝑖 1, 𝑖 2, . . . , 𝑖 |𝑥 | ]
is the user’s historical interactions in chronological order, and 𝑦 Figure 2: Overview of DEALRec. DEALRec first trains a
is the next interacted item of the user3 , where {𝑖 1, . . . , 𝑖 |𝑥 | , 𝑦} ⊂ I. surrogate model on the full training samples. Subsequently,
Formally, given the user sequences of the training set D = {𝑠𝑢 |𝑢 ∈ it calculates the influence score, which is then regularized
U}, the target is to fine-tune an LLM for recommendation tasks. The by the effort score, to identify influential samples.
learnable parameters (𝜙 ∈ Φ) of an LLM is optimized by minimizing
the negative log-likelihood of the next interacted item 𝑦 conditioned
on input 𝑥: 2) Optimization-based methods [5, 22, 23, 48] mainly utilize bi-
∑︁
|𝑦| level optimization techniques to learn the best subset chosen
min { L𝜙𝐿𝐿𝑀 = − log 𝑃𝜙 (𝑦𝑡 |𝑦 <𝑡 , 𝑥 ) }, (1) for training:
𝜙 ∈Φ
𝑡 =1
ˆ D ),
S ∗ = arg min L (𝜃, s.t. 𝜃ˆ = arg min L (𝜃, S). (3)
where 𝑦𝑡 denotes the 𝑡-th token of 𝑦, and 𝑦 <𝑡 represents the token S ⊂D 𝜃 ∈Θ
sequence preceding 𝑦𝑡 .
While fine-tuning LLMs has demonstrated effectiveness in Besides, there is also some work that employs discrete optimiza-
recommendation tasks [29], its practical application is hindered by tion problems based on the empirical minimizer 𝜃ˆ in Eq. (2).
the high resource costs required by LLMs and the continuous influx Nevertheless, they struggle to be applied to large-scale datasets
of new recommendation data [38]. Hence, it is essential to enhance e.g., recommendation data, due to the complex solving of the
the efficiency of LLM-based recommender fine-tuning. optimization problem [17].
• Data pruning for efficient LLM-based recommendation. Furthermore, as shown in Eq. (2-3), previous coreset selection
To achieve efficient LLM-based recommendation, a promising methods usually require the model to be trained over original
approach is to reduce the costs by few-shot fine-tuning with training samples D, which however is infeasible for LLM-based
randomly selected samples [4]. Nevertheless, the random samples recommender models due to the continuous influx of data and the
might lose some crucial information for LLMs to acquire the latest high resource costs of LLMs (cf. Section 1).
information on user behavior or items, e.g., trending items. In this • Drawing upon the above insights, we consider two objectives
light, we introduce the task of data pruning for efficient LLM-based for data pruning: 1) high accuracy emphasizes the low empirical
recommendation, which aims to identify a set of representative risk of the model trained on the selected samples, and 2) high
samples particularly for LLMs’ few-shot fine-tuning. Formally, efficiency focuses on the low costs of the data pruning process,
given all training samples D = {𝑠𝑢 |𝑢 ∈ U}, the target of data breaking free from the heavy fine-tuning of LLMs for data pruning.
pruning is to select a subset S ⊂ D, such that the LLMs trained on
the subset S can yield good performance on the testing set. The 3 DEALREC
size of S is controlled by the given selection ratio 𝑟 , i.e., |S| = 𝑟 |D|. To pursue efficient LLM-based recommendation, we propose a
• Retrospect of coreset selection. As the closely related work novel data pruning method DEALRec, which involves two key
to this data pruning task, coreset selection methods generally fall components, i.e., the influence score to estimate the influence on
into two groups: empirical risk, and the effort score as a regularization to mitigate
1) Heuristic methods [7, 10, 44] typically design some heuristic the gap between surrogate model and LLMs. The overview of our
strategies to select samples based on an empirical minimizer: method is presented in Figure 2.
ˆ D ),
S = 𝐻 (𝜃, s.t. 𝜃ˆ = arg min L (𝜃, D ), (2)
𝜃 ∈Θ 3.1 Influence Score
where L (·) is the loss function of the task, e.g., image clas- To achieve good overall performance with the model trained on the
sification [16] or CTR prediction [14], and 𝐻 (·) denotes the pruned dataset S, the key lies in the ability to assess the influence
heuristic strategy such as selecting samples with larger prediction on the empirical risk, i.e., overall performance, caused by removing
entropy [7], or clustering the samples based on the sample a sample in training. However, simply assessing the the influence by
representations [6]. However, this group of methods designs removing each sample is impractical, because it requires brute force
the strategy 𝐻 (·) intuitively and fails to explicitly consider the leaving-one-out-retraining for 𝑛 = |D| times. To overcome this
influence of a sample on the empirical risk. This might lead to challenge, we propose an efficient approximation of the influence
suboptimal selection, thereby declining the performance of the for all samples by extending influence on parameter change (i.e., a
model trained by the selected subset. classic result from influence function [24]) via chain rule and second-
3 Our main focus lies in sequential recommendation, which holds notable practical order optimization techniques. We further utilize the symmetric
significance by intricately considering the temporal aspect in real-world scenarios. property to speed up the calculation of the influence score.
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xinyu Lin et al.
• Influence on parameter change. To estimate the influence Algorithm 1 Procedure of HVP Estimation
on empirical risk for each sample, we first start with the classic Input: Original training dataset D, parameters of a well-trained model 𝜃ˆ,
result [28] from research on influence function [8], which gives us iteration number 𝑇 .
the estimation of the parameter change caused by upweighting a Í
1: Compute 𝑖 𝑛1 ∇𝜃 L (𝑠𝑖 , 𝜃ˆ ) for ∀𝑖 ∈ {1, . . . , 𝑛}.
hÍ i Í
sample 𝑠 for training. Considering a training sample 𝑠 is upweighted 2: Initialize 𝐻˜ −1 1
∇𝜃 L (𝑠𝑖 , 𝜃ˆ ) = 1
∇𝜃 L (𝑠𝑖 , 𝜃ˆ ).
0
by a small 𝜖, the empirical minimizer can be rewritten as:
𝑖 𝑛 𝑖 𝑛
3: for all 𝑡 ∈ {1, . . . ,𝑇 } do
1 ∑︁ Randomly sample a training sample 𝑠𝑡 ∈ D;
𝜃ˆ𝜖,𝑠 = arg min L (𝑠𝑖 , 𝜃 ) + 𝜖 L (𝑠, 𝜃 ). (4) 4:
𝑛 5: Calculate ∇𝜃2 L (𝑠𝑡 ) as the unbiased estimator of 𝐻 ;
hÍ i
𝜃 ∈Θ
Í
𝑠𝑖 ∈D
According to [28], the influence of upweighting a sample 𝑠 on the 6: 𝐻˜ 𝑡−1 𝑖 L (𝑠𝑖 , 𝜃ˆ ) ← 𝑖 𝑛1 ∇𝜃 L (𝑠𝑖 , 𝜃ˆ )+
1
𝑛 ∇𝜃
hÍ i
parameter change is then given as: 𝐼 − ∇𝜃2 L (𝑠𝑡 ) 𝐻˜ 𝑡−1 1 ˆ
𝑖 𝑛 ∇𝜃 L (𝑠𝑖 , 𝜃 ) ; ⊲ Eq. (10)
hÍ i h −1
i
d𝜃ˆ𝜖,𝑠 ˆ 7: 𝐻˜ −1 1 ˆ
𝑖 𝑛 ∇𝜃 L (𝑠𝑖 , 𝜃 ) ← 𝐻𝑇
˜ −1 Í𝑖 1 ∇𝜃 L (𝑠𝑖 , 𝜃ˆ ) .
Iparam (𝑠 ) = = −𝐻 −1
ˆ ∇𝜃 L (𝑠, 𝜃 ), (5) hÍ i
d𝜖
𝑛
(effort score)
the data coverage will be improved to ensure a high-probability
(influence score) bound for the empirical risk (refer to [57] for detailed proof). In
detail, we first divide the samples into 𝐾 groups according to their
4We obtain the effort scores for surrogate model by calculating the gradient norm of overall scores. We then iteratively sample 𝑛𝑠 user sequences from
the parameters of the surrogate model (Eq. (12)).
5 The learnable parameters can be either the whole parameters of LLMs or the learnable the group with the fewest samples and discard that group after
parameters from parameter-efficient training, e.g., LoRA [19]. sampling, where 𝑛𝑠 is the average sampling budget for all groups
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xinyu Lin et al.
Table 1: Statistics of the three datasets. 1) Few-shot fine-tuning fine-tunes LLM-based recommender models
Datasets # Users # Items # Interactions Density with limited samples at a fixed size, e.g., 1024-shot, obtained via
Games 49,156 17,332 342,329 0.04% different data pruning methods. 2) Full fine-tuning utilizes all
MicroLens-50K 49,887 19,217 359,048 0.04% samples to fine-tune LLM-based recommender models without
Book 88,263 86,272 5,303,707 0.07% data pruning.
Table 2: Overall performance comparison between the baselines and DEALRec instantiated on two competitive LLM-based
recommender models on three datasets. For each backend model, the bold results highlight the best results while the second-best
ones are underlined. ∗ implies the improvements over the second-best results are statistically significant (𝑝-value < 0.01) under
one-sample t-tests. We run all experiments for 3 times with different random seeds and report the averaged results.
Games MicroLens-50K Book
1024-shot (𝒓=2%) 1024-shot (𝒓=2%) 1024-shot (𝒓=1%)
Methods R@10 R@20 N@10 N@20 R@20 R@50 N@20 N@50 R@20 R@50 N@20 N@50
TF-DCon 0.0102 0.0157 0.0062 0.0078 0.0066 0.0099 0.0027 0.0034 0.0104 0.0144 0.0083 0.0092
RecRanker 0.0112 0.0166 0.0074 0.0090 0.0024 0.0042 0.0011 0.0014 0.0108 0.0145 0.0090 0.0097
CCS 0.0164 0.0246 0.0097 0.0122 0.0096 0.0131 0.0041 0.0049 0.0110 0.0145 0.0088 0.0096
BIGRec GraNd 0.0158 0.0250 0.0098 0.0125 0.0014 0.0032 0.0006 0.0010 0.0102 0.0136 0.0080 0.0087
EL2N 0.0154 0.0256 0.0098 0.0128 0.0096 0.0045 0.0041 0.0016 0.0107 0.0149 0.0085 0.0094
Random 0.0163 0.0241 0.0100 0.0122 0.0108 0.0151 0.0044 0.0054 0.0099 0.0134 0.0083 0.0090
DEALRec 0.0181* 0.0276* 0.0115* 0.0142* 0.0124* 0.0160* 0.0055* 0.0064* 0.0117* 0.0155* 0.0096* 0.0104*
TF-DCon 0.0051 0.0074 0.0033 0.0040 0.0006 0.0057 0.0002 0.0013 0.0028 0.0051 0.0020 0.0027
RecRanker 0.0028 0.0045 0.0019 0.0024 0.0043 0.0064 0.0011 0.0014 0.0027 0.0052 0.0018 0.0025
CCS 0.0050 0.0084 0.0031 0.0041 0.0026 0.0061 0.0010 0.0013 0.0026 0.0048 0.0018 0.0024
TIGER GraNd 0.0042 0.0053 0.0027 0.0030 0.0006 0.0014 0.0003 0.0005 0.0008 0.0020 0.0006 0.0010
EL2N 0.0034 0.0048 0.0024 0.0029 0.0011 0.0016 0.0004 0.0004 0.0005 0.0015 0.0004 0.0007
Random 0.0062 0.0102 0.0039 0.0051 0.0037 0.0059 0.0011 0.0014 0.0033 0.0066 0.0022 0.0031
DEALRec 0.0074* 0.0114* 0.0062* 0.0074* 0.0058* 0.0076* 0.0020* 0.0020* 0.0039* 0.0076* 0.0026* 0.0037*
Table 3: Performance comparison between DEALRec under 1024-shot fine-tuning and the full fine-tuning of the BIGRec in
terms of both accuracy and time costs. “%Improve.” denotes the relative improvement achieved by DEALRec compared to the
full fine-tuning. Models are trained for 50 epochs with the early stopping strategy.
Games MicroLens-50K Book
R@10↑ R@20↑ N@10↑ N@20↑ Time↓ R@20↑ R@50↑ N@20↑ N@50↑ Time↓ R@20↑ R@50↑ N@20↑ N@50↑ Time↓
Full 0.0169 0.0233 0.0102 0.0120 36.87h 0.0081 0.0136 0.0038 0.0053 66.64h 0.0076 0.0108 0.0060 0.0068 84.77h
DEALRec 0.0181 0.0276 0.0115 0.0142 1.67h 0.0124 0.0160 0.0055 0.0064 1.23h 0.0117 0.0155 0.0096 0.0104 1.93h
% Improve. 7.10% 18.45% 12.75% 18.33% -95.47% 53.09% 17.65% 44.74% 20.75% -98.15% 53.95% 43.52% 60.00% 52.94% -97.72%
4.2 Overall Performance (RQ1) it maintains easy samples for selection, thus compensating the
The results of the baselines and DEALRec with two competitive knowledge of recommendation data from high-density areas.
backend LLM-based recommender models on three datasets under • Another interesting observation is that random sampling yields
few-shot fine-tuning (1024 samples) are presented in Table 2, from competitive performance or even outperforms other coreset
which we have the following observations: selection methods in some cases, which might attributed to two
possible reasons: 1) Uniformly selected user sequences preserve
• All methods with BIGRec typically yield better performance than high coverage of the original training distribution compared to
those with TIGER, which is attributed to two reasons: 1) BIGRec other baselines, which ensures a high probability of guaranteed
employs a larger LLM (i.e., LLaMA-7B) compared to TIGER, bound for low empirical risk [57]. This observation is also
thereby benefiting from the stronger generalization ability of consistent with the findings in [13]. 2) The inferior performance
large-sized LLMs [27]; and 2) BIGRec leverages item titles to of some coreset selection methods also might be caused by the
present the user sequence, leading to better utilization of world implementation settings (Section 4.1.3), where they may suffer
knowledge in LLMs. In contrast, TIGER learns extra item tokens from the learning ability gap between the surrogate model and
for LLMs. This might result in cold-start item issues since only LLMs. (cf. Section 3.2).
limited item tokens are learned while others are maintained • DEALRec significantly outperforms all coreset selection methods
randomly initialized under the few-shot fine-tuning setting. across the three datasets. The consistent performance improve-
• Among all coreset selection baselines, difficulty-based (GraNd, ments on both backend models validate the superiority of
EL2N) methods generally perform better than diversity-based DEALRec in identifying influential samples for LLMs’ adaptation
methods (TF-DCon, RecRanker). This is reasonable since to the recommendation data. The superior performance is
diversity-based methods merely heuristically encourage selecting attributed to: 1) the accurate and efficient estimation of the
users with divergent preference, which lacks the assessments of influence on empirical risk, i.e., overall performance by removing
their contributions to the model training. In contrast, GraNd and a sample in training; and 2) the gap regularization based on
EL2N use pre-defined metrics to measure the sample difficulty the effort score to penalize the easy samples for LLMs. By
and select the samples with larger scores, which encourages emphasizing the non-trivial samples specifically for LLMs, gap
selecting the samples that are more informative for models’ regularization alleviates the learning ability gap between the
optimization. Besides, CCS improves EL2N in most cases, as surrogate model and the LLMs.
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Xinyu Lin et al.
0.030
Games 0.016
MicroLens-50K (Recall@20) (Time Costs (h)) (%Reduction)
0.028 1.8 1.009
Recall@20 NDCG@20 Recall@20 NDCG@20 %Reduction
0.0118 0.0097
Full Training
0.016 0.6
0.022 0.011
0.0106 0.0087
0.01 0 0.925
0.018 0.008 0.0100 0.0082 0.2% 0.5% 1% 1.5% 2% 4% 0.2% 0.5% 1% 1.5% 2% 4%
Greedy w/o IS w/o 𝜹𝒔 DEALRec Greedy w/o IS w/o 𝜹𝒔 DEALRec
(a) (b) (a) Effect of 𝒓 w.r.t. Recall (b) Effect of 𝒓 w.r.t. time costs
Figure 4: Ablation study of the influence score, effort score, Figure 5: Performance of DEALRec with different selection
and coverage-enhanced sample selection strategy. ratio 𝑟 w.r.t. accuracy and efficiency on Games.
0.052
Games Games 5.2 Coreset Selection
Coreset selection has been widely studied in both traditional
↑11.42% ↑8.41%
Random Random
DEALRec
0.02
DEALRec machine learning and deep learning [47, 50], benefiting many
0.038
downstream tasks such as data-efficient learning [44], neural
↑37.50%
architecture search [40], and active learning [39]. It aims to select a
0.024
↑37.29% 0.01 small but representative subset from the full data that can lead
↑7.02%
to comparable model performance. Previous work mainly falls
↑5.98% into two groups: 1) Heuristic methods [7, 10, 44] typically assume
0.01
Group 1 Group 2 Group 3
0
Group 1 Group 2 Group 3 difficult or diverse samples are informative for model training. 2)
(a) Performance w.r.t. Recall@20 (b) Performance w.r.t. NDCG@20 Optimization-based methods [21, 25, 50] leverages the bi-level or
Figure 6: Performance of DEALRec over easy to difficult discrete optimization techniques to optimize the data subset that
samples (Group 1 to Group 3). can minimize the empirical risk. However, heuristic methods might
be suboptimal since they overlook the impact of selected samples on
(Recall@10) Games (Recall@20) (NDCG@10) Games (NDCG@20) empirical risk. And optimization-based methods fail to be applied
Recall@10 0.031 0.02 NDCG@10 0.019 to LLM-based recommendation due to the cumbersome calculation
for complex optimization. Furthermore, previous methods usually
0.023
Recall@20 NDCG@20
rely on the training of the model on full data for selection, which is
0.018 0.023 0.012 0.010 infeasible for LLM-based recommendation (cf. Section 2).
• Data Condensation [56] is another potential solution to
0.013 0.015 0.004 0.000
achieve data-efficient training. However, it is intrinsically different
0.1 0.3 0.5 1 2 0.1 0.3 0.5 1 2 from our proposed task of data pruning. While it aims to synthesize
(a) Effect of 𝝀 w.r.t. Recall (b) Effect of 𝝀 w.r.t. NDCG a small but informative dataset [55], our task targets to identify
Figure 7: Performance of DEALRec with different 𝜆. existing samples that are representative. Besides, previous work
mainly works for continuous data, which is inapplicable to LLM-
each group, which validates the effectiveness of DEALRec in based recommendation [48]. TF-DCon [49] is recently proposed for
considering the influence on overall performance. content-based recommendation and we compare it in Section 4.2.
REFERENCES Instruction Tuning Large Language Model as Ranker for Top-k Recommendation.
[1] Naman Agarwal, Brian Bullins, and Elad Hazan. 2016. Second-order stochastic arXiv:2312.16018.
optimization in linear time. stat 1050 (2016), 15. [31] Zheqi Lv, Wenqiao Zhang, Zhengyu Chen, Shengyu Zhang, and Kun Kuang. 2024.
[2] Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. 2020. Intelligent Model Update Strategy for Sequential Recommendation. In WWW.
Contextual diversity for active learning. In ECCV. Springer, 137–153. ACM.
[3] Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yancheng [32] Zheqi Lv, Wenqiao Zhang, Shengyu Zhang, Kun Kuang, Feng Wang, Yongwei
Luo, Fuli Feng, Xiangnaan He, and Qi Tian. 2023. A bi-step grounding paradigm Wang, Zhengyu Chen, Tao Shen, Hongxia Yang, Beng Chin Ooi, et al. 2023. DUET:
for large language models in recommendation systems. arXiv:2308.08434. A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework
[4] Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. for Efficient Device Model Generalization. In WWW. ACM, 3077–3085.
2023. Tallrec: An effective and efficient tuning framework to align large language [33] Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan
model with recommendation. In RecSys. ACM. He, Yongfeng Zhang, and Fajie Yuan. 2023. A Content-Driven Micro-Video
[5] Zalán Borsos, Mojmir Mutny, and Andreas Krause. 2020. Coresets via bilevel Recommendation Dataset at Scale. arXiv:2309.15379 (2023).
optimization for continual learning and streaming. NeurIPS 33 (2020), 14879– [34] Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. Deep
14890. learning on a data diet: Finding important examples early in training. NeurIPS
[6] Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, and 34, 20596–20607.
Guoren Wang. 2023. Efficient coreset selection with cluster-based methods. In [35] Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H Keshavan, Trung
KDD. ACM, 167–178. Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q Tran, Jonah Samost, et al. 2023.
[7] C Coleman, C Yeh, S Mussmann, B Mirzasoleiman, P Bailis, P Liang, J Leskovec, Recommender Systems with Generative Retrieval. In NeurIPS. Curran Associates,
and M Zaharia. 2020. Selection via Proxy: Efficient Data Selection for Deep Inc.
Learning. In ICLR. [36] Xubin Ren, Wei Wei, Lianghao Xia, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei
[8] R Dennis Cook. 1977. Detection of influential observation in linear regression. Yin, and Chao Huang. 2024. Representation learning with large language models
Technometrics 19, 1 (1977), 15–18. for recommendation. In WWW. ACM.
[9] Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, [37] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering chatgpt’s capabilities 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI. AUAI
in recommender systems. In RecSys. ACM, 1126–1132. Press, 452–461.
[10] Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize [38] Noveen Sachdeva, Mehak Dhaliwal, Carole-Jean Wu, and Julian McAuley. 2022.
and why: Discovering the long tail via influence estimation. NeurIPS 33 (2020), Infinite recommendation networks: a data-centric approach. NeurIPS 35, 31292–
2881–2891. 31305.
[11] Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei [39] Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural
Zhang. 2023. Chat-rec: Towards interactive and explainable llms-augmented networks: A core-set approach. (2018).
recommender system. arXiv:2303.14524. [40] Jae-hun Shim, Kyeongbo Kong, and Suk-Ju Kang. 2021. Core-set sampling for
[12] Yuqi Gong, Xichen Ding, Yehui Su, Kaiming Shen, Zhongyi Liu, and Guannan efficient neural architecture search. arXiv:2107.06869.
Zhang. 2023. An Unified Search and Recommendation Foundation Model for [41] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng
Cold-Start Scenario. In CIKM. 4595–4601. Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder
[13] Chengcheng Guo, Bo Zhao, and Yanbing Bai. 2022. Deepcore: A comprehensive representations from transformer. In CIKM. 1441–1450.
library for coreset selection in deep learning. In DEXA. Springer, 181–195. [42] Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun
[14] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as
DeepFM: a factorization-machine based neural network for CTR prediction. In Re-Ranking Agent. In EMNLP. ACL, 14918–14937.
IJCAI. 1725–1731. [43] Haoru Tan, Sitong Wu, Fei Du, Yukang Chen, Zhibin Wang, Fan Wang, and
[15] Frank R Hampel. 1974. The influence curve and its role in robust estimation. Xiaojuan Qi. 2023. Data Pruning via Moving-one-Sample-out. arXiv:2310.14664.
Journal of the american statistical association 69, 346 (1974), 383–393. [44] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler,
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual Yoshua Bengio, and Geoffrey J Gordon. 2018. An empirical study of example
learning for image recognition. In CVPR. IEEE, 770–778. forgetting during deep neural network learning. arXiv:1812.05159.
[17] Muyang He, Shuo Yang, Tiejun Huang, and Bo Zhao. 2023. Large-scale Dataset [45] Wenjie Wang, Xinyu Lin, Liuhui Wang, Fuli Feng, Yunshan Ma, and Tat-Seng
Pruning with Dynamic Uncertainty. arXiv:2306.05175. Chua. 2023. Causal Disentangled Recommendation Against User Preference
[18] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Shifts. TOIS (2023).
Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for [46] Wenjie Wang, Xinyu Lin, Liuhui Wang, Fuli Feng, Yinwei Wei, and Tat-Seng Chua.
recommendation. In SIGIR. 639–648. 2023. Equivariant Learning for Out-of-Distribution Cold-start Recommendation.
[19] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean In MM. 903–914.
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large [47] Kai Wei, Rishabh Iyer, and Jeff Bilmes. 2015. Submodularity in data subset
language models. arXiv:2106.09685. selection and active learning. In ICML. PMLR, 1954–1963.
[20] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential [48] Jiahao Wu, Wenqi Fan, Shengcai Liu, Qijiong Liu, Rui He, Qing Li, and Ke Tang.
recommendation. In ICDM. IEEE, 197–206. 2023. Dataset condensation for recommendation. arXiv:2310.01038.
[21] Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir [49] Jiahao Wu, Qijiong Liu, Hengchang Hu, Wenqi Fan, Shengcai Liu, Qing Li,
De, and Rishabh Iyer. 2021. Grad-match: Gradient matching based data subset Xiao-Ming Wu, and Ke Tang. 2023. Leveraging Large Language Models
selection for efficient deep model training. In ICML. PMLR, 5464–5474. (LLMs) to Empower Training-Free Dataset Condensation for Content-Based
[22] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Recommendation. arXiv:2310.09874.
Rishabh Iyer. 2021. Glister: Generalization based data subset selection for efficient [50] Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, and Ping Li. 2023.
and robust learning. In AAAI, Vol. 35. 8110–8118. Dataset pruning: reducing training data by examining generalization influence.
[23] Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh Iyer. 2021. (2023).
Retrieve: Coreset selection for efficient and robust semi-supervised learning. [51] Yuhao Yang, Chao Huang, Lianghao Xia, Chunzhen Huang, Da Luo, and Kangyi
NeurIPS 34 (2021), 14488–14501. Lin. 2023. Debiased Contrastive Learning for Sequential Recommendation. In
[24] Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via WWW. 1063–1073.
influence functions. In ICML. PMLR, 1885–1894. [52] Honglei Zhang, He Liu, Haoxuan Li, and Yidong Li. 2024. TransFR: Transferable
[25] Suraj Kothawade, Vishal Kaushal, Ganesh Ramakrishnan, Jeff Bilmes, and Federated Recommendation with Pre-trained Language Models. arXiv:2402.01124.
Rishabh Iyer. 2022. PRISM: A Unified Framework of Parameterized Submodular [53] Honglei Zhang, Fangyuan Luo, Jun Wu, Xiangnan He, and Yidong Li. 2023.
Information Measures for Targeted Data Subset Selection and Summarization. In LightFR: Lightweight federated recommendation with privacy-preserving matrix
AAAI. factorization. TOIS 41, 4 (2023), 1–28.
[26] Lei Li, Yongfeng Zhang, and Li Chen. 2023. Prompt distillation for efficient [54] Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong
llm-based recommendation. In CIKM. 1348–1357. Wen. 2023. Recommendation as instruction following: A large language model
[27] Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng empowered recommendation approach. arXiv:2305.07001.
Chua. 2023. A multi-facet paradigm to bridge large language model and [55] Bo Zhao and Hakan Bilen. 2023. Dataset condensation with distribution matching.
recommendation. arXiv:2310.06491. In WACV. IEEE, 6514–6523.
[28] Robert F Ling. 1984. Residuals and influence in regression. [56] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2020. Dataset Condensation
[29] Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2024. ONCE: Boosting with Gradient Matching. In ICLR.
Content-based Recommendation with Both Open- and Closed-source Large [57] Haizhong Zheng, Rui Liu, Fan Lai, and Atul Prakash. 2022. Coverage-centric
Language Models. In WSDM. ACM. Coreset Selection for High Pruning Rates. In ICLR.
[30] Sichun Luo, Bowei He, Haohan Zhao, Yinya Huang, Aojun Zhou, Zongpeng
Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. 2023. RecRanker: