0% found this document useful (0 votes)
33 views

llmrecommpaper

The document presents A-LLMRec, an efficient all-round recommender system that integrates Large Language Models (LLMs) with collaborative filtering to enhance recommendation performance in both cold and warm scenarios. By leveraging collaborative knowledge from pre-trained CF-RecSys without requiring fine-tuning of the LLM or CF-RecSys, A-LLMRec demonstrates superior efficiency and effectiveness compared to traditional models. Extensive experiments validate its capability to outperform existing LLM-based and CF models across various real-world datasets and scenarios.

Uploaded by

asbadeb21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

llmrecommpaper

The document presents A-LLMRec, an efficient all-round recommender system that integrates Large Language Models (LLMs) with collaborative filtering to enhance recommendation performance in both cold and warm scenarios. By leveraging collaborative knowledge from pre-trained CF-RecSys without requiring fine-tuning of the LLM or CF-RecSys, A-LLMRec demonstrates superior efficiency and effectiveness compared to traditional models. Extensive experiments validate its capability to outperform existing LLM-based and CF models across various real-world datasets and scenarios.

Uploaded by

asbadeb21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Large Language Models meet Collaborative Filtering: An Efficient

All-round LLM-based Recommender System


Sein Kim∗ Hongseok Kang∗ Seungyoon Choi
[email protected] [email protected] [email protected]
Korea Advanced Institute of Science Korea Advanced Institute of Science Korea Advanced Institute of Science
and Technology and Technology and Technology
Daejeon, Republic of Korea Daejeon, Republic of Korea Daejeon, Republic of Korea

Donghyun Kim Minchul Yang Chanyoung Park†


[email protected] [email protected] [email protected]
NAVER Corporation NAVER Corporation Korea Advanced Institute of Science
Seongnam, Republic of Korea Seongnam, Republic of Korea and Technology
Daejeon, Republic of Korea

ABSTRACT CCS CONCEPTS


Collaborative filtering recommender systems (CF-RecSys) have • Information systems → Recommender systems.
shown successive results in enhancing the user experience on social
media and e-commerce platforms. However, as CF-RecSys strug- KEYWORDS
gles under cold scenarios with sparse user-item interactions, re- Recommender System, Large Language Models, Collaborative Fil-
cent strategies have focused on leveraging modality information tering
of user/items (e.g., text or images) based on pre-trained modality
encoders and Large Language Models (LLMs). Despite their effec- ACM Reference Format:
tiveness under cold scenarios, we observe that they underperform Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang,
and Chanyoung Park. 2024. Large Language Models meet Collaborative
simple traditional collaborative filtering models under warm sce-
Filtering: An Efficient All-round LLM-based Recommender System. In Pro-
narios due to the lack of collaborative knowledge. In this work, ceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and
we propose an efficient All-round LLM-based Recommender sys- Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spain. ACM, New
tem, called A-LLMRec, that excels not only in the cold scenario York, NY, USA, 12 pages. https://doi.org/10.1145/3637528.3671931
but also in the warm scenario. Our main idea is to enable an LLM
to directly leverage the collaborative knowledge contained in a
1 INTRODUCTION
pre-trained state-of-the-art CF-RecSys so that the emergent ability
of the LLM as well as the high-quality user/item embeddings that With the recent exponential growth in the number of users and
are already trained by the state-of-the-art CF-RecSys can be jointly items, collaborative filtering models [14, 15, 20, 40] encounter the
exploited. This approach yields two advantages: (1) model-agnostic, long-standing cold-start problem [1, 43, 52], stemming from the
allowing for integration with various existing CF-RecSys, and (2) ef- inherent sparsity of user-item interaction data. In other words,
ficiency, eliminating the extensive fine-tuning typically required for for users/items with few interactions, it becomes challenging to
LLM-based recommenders. Our extensive experiments on various construct collaborative knowledge with other similar users/items,
real-world datasets demonstrate the superiority of A-LLMRec in leading to suboptimal recommendation performance, especially
various scenarios, including cold/warm, few-shot, cold user, and in the cold-start scenarios. To overcome this issue, recent studies
cross-domain scenarios. Beyond the recommendation task, we also have focused on leveraging modality information of users/items
show the potential of A-LLMRec in generating natural language (e.g., user demographics, item titles, descriptions, or images) to
outputs based on the understanding of the collaborative knowledge enhance recommendation performance under cold-start scenarios.
by performing a favorite genre prediction task. Our code is available Specifically, MoRec [51] utilizes pre-trained modality encoders (e.g.,
at https://github.com/ghdtjr/A-LLMRec. BERT [9] or Vision-Transformer [10]) to project raw modality fea-
tures of items (e.g., item texts or images), thereby replacing the item
embeddings typically used in collaborative filtering recommenda-
∗ Both authors contributed equally to this research.
† Corresponding tion models. Similarly, CTRL [25] considers tabular data and its
author.
textual representation as two different modalities and uses them to
pre-train collaborative filtering recommendation models through a
This work is licensed under a Creative Commons Attribution contrastive learning objective, which is then fine-tuned for specific
International 4.0 License. recommendation tasks.
1 An
KDD ’24, August 25–29, 2024, Barcelona, Spain item is categorized as ‘warm’ if it falls within the top 35% of interactions, and if it
© 2024 Copyright held by the owner/author(s). falls within the bottom 35%, it is classified as a ‘cold’ item.
2 After training each model using all the available data in the training set, we separately
ACM ISBN 979-8-4007-0490-1/24/08
https://doi.org/10.1145/3637528.3671931 evaluate on cold and warm items in the test set.

1395
KDD ’24, August 25–29, 2024, Barcelona, Spain Sein Kim et al.

SASRec MoRec TALLRec In this paper, we propose an efficient all-round LLM-based rec-
0.8 AMZ. Movies 0.7 AMZ. Video Games
ommender system, called A-LLMRec (All-round LLM-based Rec-
0.7 0.6787 0.6 ommender system), that excels not only in the cold scenario but also
0.5764
0.6 0.4977 0.4897 in the warm scenario (hence, all-round recommender system). Our
0.5
0.5 main idea is to enable an LLM to directly leverage the collaborative
0.4395 0.4 0.395
HIT@1

HIT@1
0.4 knowledge contained in a pre-trained state-of-the-art collaborative
0.3
0.3 0.2987 filtering recommender system (CF-RecSys) so that the emergent
0.2589 0.2745 0.2654 0.2318
0.2 0.1991 ability [45] of the LLM, as well as the high-quality user/item em-
0.2

0.1 0.1 beddings that are already trained by the state-of-the-art CF-RecSys,
can be jointly exploited. More precisely, we devise an alignment
0.0 0.0
Cold Cold Warm Warm network that aligns the item embeddings of the CF-RecSys with
Figure 1: Comparisons between collaborative filtering model the token space of the LLM, aiming at transferring the collabora-
(SASRec), modality-aware model (i.e., MoRec), and LLM- tive knowledge learned from a pre-trained CF-RecSys to the LLM
based model (i.e., TALLRec) under the cold/warm1 scenarios enabling it to understand and utilize the collaborative knowledge
on Amazon Movies/Video Games dataset (Hit@1)2 . for the downstream recommendation task.
The key innovation of A-LLMRec is that it requires the fine-
tuning of neither the CF-RecSys nor the LLM, and that the alignment
Despite the effectiveness of modality-aware recommender sys- network is the only neural network that is trained in A-LLMRec,
tems in cold scenarios, the recent emergence of Large Language which comes with the following two crucial advantages:
Models (LLMs), known for their rich pre-trained knowledge and (1) (Model-agnostic) A-LLMRec allows any existing CF-RecSys
advanced language understanding capabilities, has attracted signif- to be integrated, which implies that services using their own
icant interest in the recommendation domain to effectively extract recommender models can readily utilize the power of the LLM.
and integrate modality information [37, 48]. Early studies on LLM- Besides, any updates of the recommender models can be easily
based recommendation [12, 16, 44] have employed OpenAI-GPT reflected by simply replacing the old models, which makes the
with In-context Learning [4]. This approach adapts to new tasks model practical in reality.
or information based on the context provided within the input (2) (Efficiency) A-LLMRec is efficient in that the alignment net-
prompt and demonstrates the potential of LLMs as a recommender work is the only trainable neural network, while TALLRec [2]
system. Moreover, to bridge the gap between the training tasks of requires the fine-tuning of the LLM with LoRA [18]. As a re-
LLMs and recommendation tasks, TALLRec [2] fine-tunes LLMs sult, A-LLMRec trains approximately 2.53 times and inferences
with recommendation data using LoRA [18]. This approach has 1.71 times faster than TALLRec, while also outperforming both
empirically demonstrated that, in cold scenarios and cross-domain TALLRec and CF-RecSys in both cold and warm scenarios.
scenarios, fine-tuned LLMs outperform traditional collaborative
filtering models. Our extensive experiments on various real-world datasets demon-
Although modality-aware and LLM-based recommender systems strate the superiority of A-LLMRec, revealing that aligning high-
have proven effective in cold scenarios with limited user-item in- quality user/item embeddings with the token space of the LLM is the
teractions, we argue that these methods suffer from the lack of key for solving not only cold/warm scenarios but also few-shot, cold
collaborative knowledge due to their heavy reliance on textual in- user, and cross-domain scenarios. Lastly, beyond the recommenda-
formation [51]. Consequently, when abundant user-item interactions tion task, we perform a language generation task, i.e., favorite genre
are available (i.e., warm scenario), modality-aware and LLM-based prediction, to demonstrate that A-LLMRec can generate natural
recommenders are rather inferior to simple traditional collaborative language outputs based on the understanding of users and items
filtering models. As shown in Figure 1, while the modality-aware through the aligned collaborative knowledge from CF-RecSys. Our
recommender (i.e., MoRec) and the LLM-based recommender (i.e., main contributions are summarized as follows:
TALLRec) significantly outperform the traditional collaborative • We present an LLM-based recommender system, called A-LLMRec,
filtering model (i.e., SASRec [20]) in the cold scenario, they are that directly leverages the collaborative knowledge contained
outperformed by the traditional collaborative filtering model in in a pre-trained state-of-the-art recommender system.
the warm scenario. This is mainly because the textual information • A-LLMRec requires the fine-tuning of neither the CF-RecSys
becomes less important in the warm scenario, where ID-based col- nor the LLM, while only requiring an alignment network to be
laborative filtering models excel at modeling popular items [6, 51]. trained to bridge between them.
However, while excelling in the cold scenario is crucial, the majority • Our extensive experiments demonstrate that A-LLMRec out-
of user interactions and the revenue are predominantly generated performs not only the conventional CF-RecSys in the warm
from already existing and active items (i.e., warm items) in real- scenario but also the LLMs in the cold scenario.
world application of recommendation systems, which contribute up
to 90% of interactions in offline-industrial data [8, 49]. Furthermore, 2 RELATED WORK
as demonstrated by DCBT [49], modeling both warm and cold items
is essential for improving overall user engagement, which is evi- 2.1 Collaborative Filtering
denced by A/B testing with real-world industrial data. This implies Collaborative Filtering (CF) is the cornerstone of recommenda-
that the warm scenario should not be overlooked. tion systems, fundamentally relying on leveraging users’ historical

1396
Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System KDD ’24, August 25–29, 2024, Barcelona, Spain

preferences to inform future suggestions. The key idea is to rely LLM as a recommender system. More precisely, [12, 16, 44] utilize
on similar users/items for recommendations. The emergence of LLMs with In-context Learning [4], adapting to new tasks or in-
matrix factorization marked a significant advancement in CF, as formation based on the context provided within the input prompt.
evidenced by numerous studies [19, 22, 38], demonstrating its supe- For example, Sanner et al. [37] employs In-context Learning for
riority in capturing the latent factors underlying user preferences. recommendation tasks, exploring various prompting styles such
This evolution continued with the introduction of Probabilistic Ma- as completion, instructions, and few-shot prompts based on item
trix Factorization (PMF) [5, 33] and Singular Value Decomposition texts and user descriptions. Gao et al. [12] assigns the role of a
(SVD) [30, 54], which integrate probabilistic and decomposition recommender expert to rank items that meet users’ needs through
techniques to further refine the predictive capabilities of CF models. prompting and conducts zero-shot recommendations. These studies
AutoRec [39] and Neural Matrix Factorization (NMF) [15] utilized empirically demonstrated the potential of LLMs using its rich item
deep learning to enhance CF by capturing complex user-item in- information and natural language understanding in the recommen-
teraction patterns. Recently, [7, 21, 34, 36] proposed modeling col- dation domain. However, these approaches often underperform
laborative filtering based on sequential interaction history. Caser traditional recommendation models [20, 40], due to the gap be-
[41] and NextItNet [50] utilize Convolutional Neural Networks tween the natural language downstream tasks used for training
(CNNs) [23] to capture the local sequence information, treating an LLMs and the recommendation task [2]. To bridge this gap, TALL-
item sequence as images. While these methods effectively capture Rec [2] employs the Parameter Efficient Fine-Tuning (PEFT) method,
user preferences using interaction history, including user and item also known as LoRA [18]. This methodology enables TALLRec to
IDs, they overlook the potential of the modality information of the demonstrate enhanced efficacy, surpassing traditional collaborative
user/item, which could enhance model performance and offer a filtering recommendation models, particularly in mitigating the
deeper analysis of user behaviors. challenges posed by the cold start dilemma and in navigating the
complexities of cross-domain recommendation scenarios. However,
2.2 Modality-aware Recommender Systems it is important to note that since TALLRec simply converts the con-
ventional recommendation task into an instruction text and uses it
Modality-aware recommenders utilize modality information such
for fine-tuning, it still fails to explicitly capture the collaborative
as item titles, descriptions, or images to enhance the recommen-
knowledge that is crucial in warm scenarios.
dation performance mainly under cold scenarios. Initially, CNNs
were used to extract visual features, modeling human visual pref-
3 PROBLEM FORMULATION
erences based on Mahalanobis distance [31]. With advancements
in pre-trained modality encoders like BERT [9, 27, 29, 47, 51] and In this section, we introduce a formal definition of the problem
ResNet/Vision-Transformer [10, 11], modality-aware recommender including the notations and the task description.
systems have accelerated research by utilizing modality knowledge Notations. Let D denote the historical user-item interaction dataset
on recommendation tasks. For example, NOVA [27] and DMRL [28] (U, I, T , S) ∈ D, where U, I, T , and S denote the set of users,
proposed non-invasive fusion and disentangled fusion of modality, items, item titles/descriptions, and item sequences, respectively.
respectively, by carefully integrating pure item embeddings and S𝑢 = (𝑖𝑢1 , 𝑖𝑢2 , · · · , 𝑖𝑘𝑢 , · · · 𝑖𝑢| S𝑢 | ) ∈ S is a sequence of item interac-
text-integrated item embeddings using the attention mechanism. tions of a user 𝑢 ∈ U, where 𝑖𝑘𝑢 denotes the 𝑘-th interaction of user
MoRec [51] leverages modality encoders to project raw modality 𝑢, and this corresponds to the index of the interacted item in the
features, thereby replacing item embeddings used in collaborative item set I. Moreover, each item 𝑖 ∈ I is associated with title and
filtering models. As for the pre-training based models, Liu et al. [29] description text (𝑡 𝑖 , 𝑑 𝑖 ) ∈ T .
constructs user-user and item-item co-interaction graphs to extract Task: Sequential Recommendation. The goal of sequential rec-
collaborative knowledge, then integrates with user/item text infor- ommendation is to predict the next item to be interacted with by a
mation through attention mechanism in an auto-regressive manner, user based on the user’s historical interaction sequence. Given a set
n o
and CTRL [25] pre-trains the collaborative filtering models using
of user historical interaction sequences S = S 1, S 2, · · · , S | U | ,
paired tabular data and textual data through a contrastive learn-
ing objective, subsequently fine-tuning them for recommendation where S𝑢 denotes the sequence of user 𝑢, the subset S1:𝑘 𝑢 ⊆ S𝑢

tasks. Most recently, RECFORMER [24] proposed to model user represents the sequence of user 𝑢 from the first to the 𝑘-th item
denoted as S1:𝑘 𝑢 = (𝑖𝑢 , 𝑖𝑢 , · · · , 𝑖𝑢 ). Given an item embedding matrix
preferences and item features as language representations based on 1 2 𝑘
the Transformer architecture by formulating the sequential recom- E ∈ R |𝐼 | ×𝑑 , the embedding matrix of items in S1:𝑘 𝑢 is denoted by

mendation task as the next item sentence prediction task, where 𝑢


E1:𝑘 = (E𝑖𝑢1 , E𝑖𝑢2 , ..., E𝑖𝑢 ) ∈ R 𝑘 ×𝑑 , where E𝑖𝑢𝑗 denotes the 𝑖𝑢𝑗 -th row
𝑘
the item key-value attributes are flattened into a sentence. of E. This sequence embedding matrix is fed into a collaborative
filtering recommender (e.g., SASRec [20]) to learn and predict the
2.3 LLM-based Recommender Systems next item in the user behavior sequence S1:𝑘 𝑢 as follows:

Recently, research on LLMs has gained prominence in the field of Ö |SÖ


𝑢
|−1
modality-aware recommendation systems, with LLM-based recom- max 𝑢
𝑝 (𝑖𝑘+1 𝑢
| S1:𝑘 ; Θ) (1)
Θ
mendations emerging as a significant area of focus. The pre-trained 𝑢 ∈U 𝑘=1
knowledge and the reasoning power of LLMs based on the advanced 𝑢 |S𝑢 ; Θ) represents the probability of the (𝑘 + 1)-th
where 𝑝 (𝑖𝑘+1 1:𝑘
comprehension of language are shown to be effective for recommen- interaction of user 𝑢 conditioned on the user’s historical interaction
dation tasks, and many approaches have been proposed leveraging 𝑢 , and Θ denotes the set of learnable parameters of the
sequence S1:𝑘

1397
KDD ’24, August 25–29, 2024, Barcelona, Spain Sein Kim et al.

User-Item Interaction History

A-LLMRec ℒ123
User Representation
/.$+(
!! Copy !!
Item
CF-RecSys "
A-LLMRec

ℒ5295-12378
/./$( /0/$(

ℒ4526-12378
ℒ12345678 #- #.
A-LLMRec SBERT
Item Text
Information
" %

/.$+( /0$+(
CF-RecSys [User Representation] is a user representation.
Input
This user has watched [HISTORY (Item Title, Item Emb)] in the past.
Recommend a movie … following set of movie titles, Prompt
Input Prompt Embedding [CANDIDATE (Item Title, Item Emb)]. The recommendation is
Item
SBERT Embeddings
Large Language Model Large Language Model
User-Item Interaction History
[Next Item Title] Trained Candidate Item Text
[Next Item Title]
Item Information
Frozen
(a) Framework Overview (b) Stage 1 (c) Stage 2

Figure 2: (a) is the overview of A-LLMRec. (b) and (c) are the detailed architecture of Stage 1 and Stage 2, respectively.

collaborative filtering recommender (CF-RecSys). By optimizing Θ and text embeddings as follows:


to maximize Equation 1, the model can obtain the probability of  
the next items for user 𝑢, over all possible items. Lmatching = E E 𝑢 [𝑀𝑆𝐸 (e𝑖 , q𝑖 ) ]
S ∈S 𝑖 ∈S
𝑢
It is important to note that although we mainly focus on the se-   (2)
 
quential recommendation task in this work, A-LLMRec can also be = E E 𝑢 𝑀𝑆𝐸 (𝑓𝐼𝑒𝑛𝑐 (E𝑖 ), 𝑓𝑇𝑒𝑛𝑐 (Q𝑖 ) )
S ∈S 𝑖 ∈S
𝑢
readily applied to non-sequential recommendation tasks by simply
where Q𝑖 = SBERT(“𝑇𝑖𝑡𝑙𝑒 : 𝑡 𝑖 , 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 : 𝑑 𝑖 ”) denotes the en-
replacing the backbone CF-RecSys, e.g., from SASRec [20] (sequen-
coded representation of item text (i.e., item title and description)
tial) to NCF [15] (non-sequential), which will be demonstrated in
by SBERT, and 𝑀𝑆𝐸 is the mean squared error loss. That is, we
the experiments (Section 5.4.3).
match the item embeddings from a frozen CF-RecSys and the text
4 PROPOSED METHOD: A-LLMREC embeddings from SBERT in the latent space of the encoders, so as
to align the semantics of items and their associated texts for later
In this section, we propose A-LLMRec, a novel LLM-based recom- use in the LLM.
mender framework that aligns a frozen pre-trained collaborative
filtering recommender (CF-RecSys) with a frozen LLM aiming to 4.1.1 Avoiding Over-smoothed Representation. On the other hand,
enhance the recommendation performance not only in the cold simply optimizing the latent space matching loss defined in Equa-
scenario but also in the warm scenario. To bridge the modality tion 2 would result in over-smoothed representations, i.e., the en-
gap, A-LLMRec aligns collaborative knowledge of the CF-RecSys coders would be trained to produce similar outputs (i.e., e𝑖 ≈ q𝑖 )
with the token space of the LLM. Our approach involves two pre- to minimize Lmatching . In an extreme case, the output of the en-
training stages: (1) Aligning collaborative and textual knowledge coders would be collapsed to a trivial representation by assigning
with a frozen CF-RecSys (Section 4.1), and (2) Recommendation their weights to all zeros. Hence, to prevent this issue and preserve
stage with a frozen LLM (Section 4.2) in which the joint collabora- the original information of the item and its associated text em-
tive and textual knowledge is projected onto the LLM. bedding, we add a decoder to each of the encoders and introduce
reconstruction losses as follows:
4.1 Alignment between Collaborative and
 h i
Litem-recon = E E 𝑢 𝑀𝑆𝐸 (E𝑖 , 𝑓𝐼𝑑𝑒𝑐 (𝑓𝐼𝑒𝑛𝑐 ( (E𝑖 ) ) ) (3)
Textual Knowledge (Stage-1) S ∈S 𝑖 ∈S
𝑢
 h i
In this section, we introduce how to align the item embeddings Ltext-recon = E E 𝑀𝑆𝐸 (Q𝑖 , 𝑓𝑇𝑑𝑒𝑐 (𝑓𝑇𝑒𝑛𝑐 ( (Q𝑖 ) ) ) (4)
from a frozen CF-RecSys with their associated text information to S𝑢 ∈S 𝑖 ∈S𝑢

capture both collaborative and textual knowledge. We employ a pre- where 𝑓𝐼𝑑𝑒𝑐 and 𝑓𝑇𝑑𝑒𝑐 are the decoders added to the encoders 𝑓𝐼𝑒𝑛𝑐
trained Sentence-BERT (SBERT) [35] model, which is fine-tuned and 𝑓𝑇𝑒𝑛𝑐 , respectively. In Section 5.3.1, we empirically demonstrate
during training, to extract text embeddings from textual informa- the benefit of introducing the reconstruction losses.
tion associated with items3 . Then, we introduce two encoders, i.e.,
4.1.2 Recommendation Loss. Besides aligning the collaborative
item encoder 𝑓𝐼𝑒𝑛𝑐 and text encoder 𝑓𝑇𝑒𝑛𝑐 , each containing a 1-layer
knowledge from the user-item interactions with the textual knowl-
Multi-Layer Perceptron (MLP), to align the item embeddings from
edge from the associated text information, we introduce a recom-
a frozen CF-RecSys with the text embeddings from SBERT. Given
′ mendation loss to explicitly incorporate the collaborative knowl-
an item 𝑖, the item encoder 𝑓𝐼𝑒𝑛𝑐 : R𝑑 → R𝑑 encodes an item edge, while informing the model about the recommendation task.

embedding E𝑖 ∈ R𝑑 into a latent item embedding e𝑖 ∈ R𝑑 , i.e., Specifically, the recommendation loss is defined as follows [20]:

e𝑖 = 𝑓𝐼𝑒𝑛𝑐 (E𝑖 ), while the text encoder 𝑓𝑇𝑒𝑛𝑐 : R768 → R𝑑 encodes a ∑︁ 
Lrec = − 𝑙𝑜𝑔 (𝜎 (𝑠 (x𝑢|S𝑢 |−1 , 𝑓𝐼𝑑𝑒𝑐 (𝑓𝐼𝑒𝑛𝑐 (E𝑖𝑢 𝑢 ) ) ) ) )
text embedding Q𝑖 ∈ R768 from SBERT, whose output dimension S𝑢 ∈S
|S |
′ (5)
size is 768, into a latent text embedding q𝑖 ∈ R𝑑 , i.e., q𝑖 = 𝑓𝑇𝑒𝑛𝑐 (Q𝑖 ). +𝑙𝑜𝑔 (1 − 𝜎 (𝑠 (x𝑢|S𝑢 |−1 , 𝑓𝐼𝑑𝑒𝑐 (𝑓𝐼𝑒𝑛𝑐 (E𝑖𝑢,− ) ) ) ) )

Then, we perform latent space matching between item embeddings |S𝑢 |

3 Although using a larger language model, such as OPT [53] and LLaMA [42], would
where x𝑢| S𝑢 | −1 = CF-RecSys(S1:|
𝑢
S𝑢 | −1
) ∈ R𝑑 is the user repre-
further enhance the quality of the text embeddings, we adopt SBERT for efficiency. sentation extracted from the collaborative filtering recommender

1398
Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System KDD ’24, August 25–29, 2024, Barcelona, Spain

system, i.e., CF-RecSys, obtained after the user 𝑢 has interacted [User Representation] is a user representation.
𝑢
with the last item in the sequence S1:| , and E𝑖𝑢,−𝑢 ∈ R𝑑 is the This user has watched [HISTORY (Item Titles, Item Emb)]
S𝑢 | −1 LLM
|S | in the past. Recommend a movie for this user to watch
Input:
embedding of a negative item of 𝑖𝑢| S𝑢 | , i.e., 𝑖𝑢,− , and 𝑠 (a, b) is a dot next from the following set of movie titles,
| S𝑢 | [CANDIDATE (Item Titles, Item Emb)]. The
product between a and b. recommendation is

4.1.3 Final Loss of Stage-1. Finally, the final objective of Stage-1, LLM
[Next Item Title]
Output:
i.e., Lstage-1 , is the sum of the matching loss defined in Equation 2,
reconstruction losses defined in Equation 3 and 4, and recommen- Figure 3: An example prompt of A-LLMRec designed for the
dation loss in Equation 5: Amazon Movies dataset. For other datasets, we keep the same
Lstage-1 = Lmatching + 𝛼 Litem-recon + 𝛽 Ltext-recon + Lrec (6) format but adjust the verbs and nouns to fit the context (e.g.,
‘watched’ → ‘bought’, ‘movie’ → ’item’).
where 𝛼 and 𝛽 are the coefficients that control the importance of
each term. Note that for efficiency in training, we only considered
the last item in S𝑢 for each user 𝑢 to minimize Lstage-1 . However,
considering all items in the sequence further enhances the recom-
token token
mendation performance, which will be shown in Section 5.4.2. where O𝑢 ∈ R𝑑 and O𝑖 ∈ R𝑑 are the projected embeddings
of the representation of user 𝑢 and the joint collaborative-text
4.1.4 Joint Collaborative-Text Embedding. Having trained the au-
embedding of item 𝑖, and they can now be used as inputs to LLM
toencoder based on Equation 6, we consider e𝑖 = 𝑓𝐼𝑒𝑛𝑐 (E𝑖 ) as the
prompts, which allow the LLM to perform recommendation without
joint collaborative-text embedding (shortly joint embedding) of
any fine-tuning.
item 𝑖, which will be passed to the LLM as input. The joint embed-
ding introduces the collaborative and textual knowledge to LLMs,
which will be described in Section 4.2. 4.2.2 Prompt Design for Integrating Collaborative Knowledge. Prompt
It is important to note that when encountering new items that engineering helps in understanding the capabilities and limitations
have not been seen during the training of the collaborative filter- of LLMs, enabling them to perform complex tasks such as question
ing recommender, we can instead rely on the text encoder 𝑓𝑇𝑒𝑛𝑐 to answering and arithmetic reasoning [4, 46]. Recent studies on LLM-
extract the joint collaborative-text embedding, i.e., q𝑖 = 𝑓𝑇𝑒𝑛𝑐 (Q𝑖 ). based recommender systems have shown that carefully crafted
Since the two encoders 𝑓𝐼𝑒𝑛𝑐 and 𝑓𝑇𝑒𝑛𝑐 are jointly trained to match prompts enhance the performance of LLMs [2, 16, 37]. However, as
their latent spaces, we expect the joint embedding q𝑖 to not only existing LLM-based recommender systems focus on cold scenarios
capture the textual knowledge but also to implicitly capture the col- with few user-item interactions, their prompts mainly consider
laborative knowledge. In summary, we use e𝑖 = 𝑓𝐼𝑒𝑛𝑐 (E𝑖 ) as the joint ways to incorporate modality information (e.g., item description
collaborative-text embedding by default, but we use q𝑖 = 𝑓𝑇𝑒𝑛𝑐 (Q𝑖 ) text), while overlooking the collaborative knowledge. To this end,
when item 𝑖 lacks interactions, i.e., cold item, few-shot, and cross- we introduce a novel approach to prompt design for LLM-based rec-
domain scenarios, which will be demonstrated in the experiments ommender system, which combines collaborative knowledge with
in Section 5.2.2, Section 5.2.4, and Section 5.2.5, respectively. recommendation instructions (See Figure 3). This is done by directly
incorporating user representations O𝑢 and joint collaborative-text
4.2 Alignment between Joint Collaborative-Text embeddings O𝑖 into the textual prompts in the token embedding
space. In other words, as O𝑢 and O𝑖 have been projected into the
Embedding and LLM (Stage-2)
LLM token space, they can be considered as ordinary tokens used
Recall that in Stage-1 we obtained the joint collaborative-text em- by the LLM and readily incorporated within a prompt. To facilitate
beddings by aligning the collaborative knowledge with item textual the understanding of the LLM regarding the given user, which is
information. Our goal in Stage-2 is to align these joint embeddings crucial for personalized recommendation, we place the projected
with the token space of the LLM (Section 4.2.1), and design a prompt user representation O𝑢 at the beginning of the prompt to provide
that allows the LLM to solve the recommendation task by leverag- the LLM with the information about users, which is analogous to
ing the learned collaborative knowledge (Section 4.2.2). Figure 2 soft prompts [26]. Moreover, we add the projected joint embedding
shows the overall architecture of Stage-2. Note that the component of an item O𝑖 next to its title. This structured prompt then serves as
trained in Stage-1, which is also utilized in Stage-2, i.e., 𝑓𝐼𝑒𝑛𝑐 , is an input to the LLM, with the expected output being recommenda-
frozen in Stage-2. tions tailored to the user. The learning objective of Stage-2 is given
4.2.1 Projecting collaborative knowledge onto the token space of as follows:
𝑢
LLM. We first project the user representations x𝑢 ∈ R𝑑 and the ∑︁ |𝑦
∑︁|
′ max 𝑙𝑜𝑔 (𝑃𝜃,Θ (𝑦𝑘𝑢 |𝑝𝑢 , 𝑦𝑢<𝑘 ) ) (8)
joint collaborative-text embeddings e𝑖 ∈ R𝑑 obtained from Stage-1 𝜃
token S𝑢 ∈S 𝑘=1
onto the token space of LLM, i.e., R𝑑 . By doing so, we allow
the LLM to take them as inputs. More precisely, we introduce two
token ′ token where 𝜃 denotes the learnable parameters of 𝐹𝑈 and 𝐹𝐼 , Θ is the
2-layer MLPs, i.e., 𝐹𝑈 : R𝑑 → R𝑑 and 𝐹𝐼 : R𝑑 → R𝑑 , to frozen parameters of LLM, 𝑝𝑢 and 𝑦𝑢 are the input prompt and the
project the user representations and the joint collaborative-text next item title of user 𝑢, respectively. 𝑦𝑘𝑢 is the 𝑘-th token of 𝑦𝑢 and
embeddings to the token space of LLM, respectively, as follows: 𝑦𝑢<𝑘 represents the tokens before 𝑦𝑘𝑢 . Note that we only use the last
O𝑢 = 𝐹𝑈 (x𝑢 ), O𝑖 = 𝐹𝐼 (e𝑖 ) (7) item of each user sequence to train Equation 8 for efficiency.

1399
KDD ’24, August 25–29, 2024, Barcelona, Spain Sein Kim et al.

Table 1: Overall model performance (Hit@1) over various datasets. The best performance is denoted in bold.
Collaborative filtering Modality-aware LLM-based
NCF NextItNet GRU4Rec SASRec MoRec CTRL RECFORMER LLM-Only TALLRec MLP-LLM A-LLMRec

Movies and TV 0.4273 0.5855 0.5215 0.6154 0.4130 0.3467 0.4865 0.0121 0.2345 0.5838 0.6237
Video Games 0.3159 0.4305 0.4026 0.5402 0.4894 0.2354 0.4925 0.0168 0.4403 0.4788 0.5282
Beauty 0.2957 0.4231 0.4131 0.5298 0.4997 0.3963 0.4878 0.0120 0.5542 0.5548 0.5809
Toys 0.1849 0.1415 0.1673 0.2359 0.1728 0.1344 0.2871 0.0141 0.0710 0.3225 0.3336

Table 2: Statistics of the dataset after preprocessing. Avg. Len Table 3: Hyperparameter specifications of A-LLMRec
denotes the average sequence length of users.
Learning rate Learning rate embedding dim embedding dim
alpha beta
stage 1 stage 2 (CF-RecSys) 𝑑 (𝑓𝐼𝑒𝑛𝑐 , 𝑓𝑇𝑒𝑛𝑐 ) 𝑑 ′
Datasets #Users #Items #Interactions. Avg. Len Movies and TV 0.0001 0.0001 50 128 0.5 0.5
Video Games 0.0001 0.0001 50 128 0.5 0.5
Movies and TV 297,498 59,944 3,409,147 11.46 Beauty 0.0001 0.0001 50 128 0.5 0.2
Toys 0.0001 0.0001 50 128 0.5 0.2
Video Games 64,073 33,614 598,509 8.88
Beauty 9,930 6,141 63,953 6.44
Toys 30,831 61,081 282,213 9.15
systems (LLM-Only, TALLRec [2] and MLP-LLM). For more detail
regarding the baselines, please refer to Appendix A
5 EXPERIMENTS Evaluation Setting. We divide user sequences into training, val-
5.1 Experimental Setup idation, and test sets. For each user sequence, the most recently
Datasets. For comprehensive evaluations, we used four datasets interacted item, denoted as 𝑖𝑢| S𝑢 | , is used as the test set, while the
from Amazon datasets [13, 32], i.e., Movies and TV, Video Games, second most recent user interaction item, 𝑖𝑢| S𝑢 | −1 , is used as the
Beauty, and Toys, which consist of comprehensive textual informa- validation set. The remaining sequence of items is used as the train-
tion including "title" and "description." Note that we deliberately ing set. To evaluate the performance of sequential recommendation
selected datasets with varying statistics in terms of number of users models, we add 19 randomly selected non-interacted items to the
and items to conduct an extensive analysis of the models. The sta- test set, so that the test set of each user contains 1 positive item
tistics for each dataset after preprocessing are presented in Table 2 and 19 negative items. For quantitative comparison, we employ a
and we describe details regarding data preprocessing as follows: widely used metric, Hit Ratio at 1 (Hit@1) for all experiments.
• Movies and TV To evaluate the models on a large scale, we Implementation Details. Although A-LLMRec is model-agnostic,
select about 300K users and 60K items. Following existing stud- in this work, we adopt OPT-6.7B [53] as the backbone LLM and
ies [20, 51], we removed users and items with fewer than 5 SASRec [20] as the pre-trained CF-RecSys. For fair comparisons,
interactions. we also used OPT-6.7B as the backbone LLM for other LLM-based
• Video Games To evaluate the models on moderate-scale data, models (i.e., LLM-Only, TALLRec [2] and MLP-LLM). Moreover,
which is smaller than the Movies and TV dataset, we select we use SASRec as the CF-RecSys in other modality-aware models
about 64K users and 33K items, removing users and items with (i.e., MoRec [51] and CTRL [25]), and fix the dimension of item
fewer than 5 interactions, as in the Movies and TV dataset. and model embeddings to 50 for all the methods and datasets. For
• Beauty To compose a small and cold dataset, we select about RECFORMER [24], we follow the paper and employ Longformer [3]
9K users and 6K items, removing users and items with fewer as the backbone network. We set the batch size to 128 for all col-
than 4 interactions. To retain some information from user-item laborative filtering-based and modality-aware models. Moreover,
feedback, we categorized user ratings by treating items above the batch size is set to 32 for Stage-1 of A-LLMRec, and 4 for MLP-
3 as positive and all others including non-interacted items as LLM, TALLRec, and Stage-2 of A-LLMRec. We trained Stage-1 of
negative. A-LLMRec for 10 epochs, and Stage-2 of A-LLMRec for 5 epochs,
• Toys For the evaluation of the models where the number of and TALLRec is trained for a maximum of 5 epochs. We use the
items is larger than number of users, unlike other datasets, we Adam optimizer to train the models in all datasets. For hyperpa-
select about 3K users and 6K items, with the number of items rameters, we tune the model in certain ranges as follows: learning
being twice as large as the number of users, and remove users rate 𝜂 1, 𝜂 2 in {0.01, 0.001, 0.0005, 0.0001} for the training stage each,
and items with fewer than 4 interactions. Similar to the Beauty coefficient 𝛼, 𝛽 in {0.1, 0.2, 0.5, 0.75, 1.0} for each, we report the best-
dataset, to preserve some information from user-item feedback, performing hyper-parameters for each dataset in Table 3. We use
we categorize positive and negative items with the criterion of four NVIDIA GeForce A6000 48GB for the Movies and TV dataset
rating 3. to train LLM-based models, and one NVIDIA GeForce A6000 48GB
Baselines. We compare A-LLMRec with the following baselines for other datasets including LLM-based and other models.
that can be categorized into three types: collaborative filtering
recommender systems (NCF [15], NextItNet [50], GRU4Rec [17] and 5.2 Performance Comparison
SASRec [20]), modality-aware recommender systems (MoRec [51], For comprehensive evaluations of A-LLMRec, we perform evalu-
CTRL [25], and RECFORMER [24]), and LLM-based recommender ations under various scenarios, i.e., general scenario (Sec. 5.2.1),

1400
Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System KDD ’24, August 25–29, 2024, Barcelona, Spain

Table 4: Results (Hit@1) on cold/warm item scenario. A- Table 5: Results (Hit@1) on cold user scenario.
LLMRec (SBERT) is a variant of A-LLMRec that uses q instead
Movies and TV Video Games Beauty
of e for inference.
SASRec 0.2589 0.4048 0.4459
Movies and TV Video Games Beauty MoRec 0.3918 0.3572 0.4815
Cold Warm Cold Warm Cold Warm
CTRL 0.2273 0.1737 0.3902
SASRec 0.2589 0.6787 0.1991 0.5764 0.1190 0.6312 RECFORMER 0.4481 0.3989 0.4644
MoRec 0.2745 0.4395 0.2318 0.4977 0.2145 0.5425 TALLRec 0.2143 0.3895 0.5202
CTRL 0.1517 0.3840 0.2074 0.2513 0.1855 0.4711 MLP-LLM 0.4909 0.3960 0.5276
RECFORMER 0.3796 0.5449 0.3039 0.5377 0.3387 0.5133 A-LLMRec 0.5272 0.4160 0.5337
TALLRec 0.2654 0.2987 0.3950 0.4897 0.5462 0.6124
A-LLMRec 0.5714 0.6880 0.4263 0.5970 0.5605 0.6414 Table 6: Results (Hit@1) on the few-shot training scenario
A-LLMRec (SBERT) 0.5772 0.6802 0.4359 0.5792 0.5591 0.6405 on various datasets (𝐾: num. users in the training set).
𝐾 SASRec MoRec TALLRec A-LLMRec A-LLMRec (SBERT)
cold/warm item scenario (Sec. 5.2.2), cold user scenario (Sec. 5.2.3),
256 0.2111 0.2208 0.1846 0.2880 0.2963
few-shot training scenario (Sec. 5.2.4), cross-domain scenario (Sec. 5.2.5). Movies and TV
128 0.1537 0.1677 0.1654 0.2518 0.2722
256 0.1396 0.1420 0.2321 0.2495 0.2607
5.2.1 Overall Performance. The results of the recommendation Video Games
128 0.1089 0.1157 0.1154 0.1608 0.1839
task on four datasets are given in Table 1. We have the following Beauty
256 0.2243 0.2937 0.3127 0.3467 0.3605
0.3486
observations: 1) A-LLMRec outperforms other LLM-based recom- 128 0.1813 0.2554 0.2762 0.3099

mender systems that do not consider the collaborative knowledge


only under cold scenario, whereas SASRec outperforms TALLRec
from user-item interactions (i.e., LLM-Only and TALLRec), imply-
only under warm scenario. This demonstrates the importance of
ing that the collaborative knowledge is crucial for improving the
capturing both the collaborative knowledge and the text informa-
performance of recommendation in general. 2) We observe that
tion to excel in both cold/warm scenarios. 3) A-LLMRec (SBERT)
MLP-LLM, which replaces the alignment module of A-LLMRec with
outperforms A-LLMRec under the cold item scenario, while A-
a simple MLP, underperforms A-LLMRec. This implies that bridging
LLMRec generally outperforms A-LLMRec (SBERT) under the warm
between CF-RecSys and LLM is a challenging problem and that
item scenario. As discussed in Section 4.1.4, this implies that the
our proposed two-stage alignment module is beneficial. 3) ‘LLM-
joint collaborative-text embedding obtained from the text encoder
Only’ performs the worst among the LLM-based models, implying
given the text information (i.e., qi = 𝑓𝑇𝑒𝑛𝑐 (Q𝑖 )) is more useful than
that naively adopting an LLM based on a prompt designed for the
that obtained from the item encoder given the item embedding (i.e.,
recommendation task is not sufficient. Note that the prompt used
ei = 𝑓𝐼𝑒𝑛𝑐 (E𝑖 )).
by ‘LLM-Only’ is exactly the same as the prompt shown in Fig-
ure 3 without user representation and item embeddings. This again 5.2.3 Cold User Scenarios. Besides evaluations under the cold item
demonstrates the importance of incorporating collaborative knowl- scenario, we additionally conduct evaluations under the cold user
edge into the LLM for improving the recommendation performance. scenario (Table 5). To simulate the cold user scenario, we sample
4) While TALLRec fine-tunes the LLM for the recommendation task, users who have interacted with exactly three items, where the
it underperforms a collaborative filtering model, SASRec. This high- last item in the sequence serves as the test set. Then, we use the
lights that the text information alone may not generate sufficient models trained on the entire set of users except for the sampled
knowledge for capturing collaborative knowledge effectively even users to perform inference on the sampled users. We observe that A-
with fine-tuning the LLM. This again demonstrates the superiority LLMRec consistently outperforms other models in the cold user
of our alignment module. 5) Although the modality-aware models scenario, while SASRec struggles to perform well, especially on a
(MoRec and CTRL) use SASRec as the backbone CF-RecSys, they large dataset, i.e., Movies and TV, due to the lack of collaborative
underperform SASRec. Moreover, RECFORMER struggles to out- knowledge from users. Moreover, LLM-based models demonstrate
perform SASRec despite using Longformer for item text attributes, superior performance in handling cold users as text information
due to the emphasis on textual information in similarity matching becomes useful under cold scenarios.
between user and item sentences. This shows that the modality
knowledge might hinder the learning of collaborative knowledge, 5.2.4 Few-shot Training Scenario. To investigate the impact of un-
leading to performance degradation. seen/new items on recommendation models, we conduct experi-
ments on a few-shot training scenario where the number of users
5.2.2 Cold/Warm Item Scenarios. This section evaluates the mod- in the training set is extremely limited to only 𝐾 users, i.e., 𝐾-shot
els under cold/warm item scenarios. Items are labeled as ‘warm’ if (Table 6). Under this scenario, we expect the models to encounter a
they belong to the top 35% of interactions, while those in the bottom large amount of unseen/new items at the inference stage, which
35% are labeled as ‘cold’ items. After training each model using all would make it hard to provide accurate recommendations. We have
the available data in the training set, we separately evaluate cold the following observations: 1) A-LLMRec outperforms all other
and warm items in the test set (Table 4). We make the following baselines under the few-shot scenario. Despite being trained with
observations: 1) A-LLMRec outperforms all other baselines across extremely small amount of users, A-LLMRec relies on CF-RecSys
both scenarios, which demonstrates that our alignment network to capture the collaborative knowledge, which is combined with
indeed allows the LLM to understand and utilize the collaborative the textual knowledge of items, leading to superior performance in
knowledge. 2) On the other hand, TALLRec outperforms SASRec few-shot learning. 2) A-LLMRec (SBERT) outperforms A-LLMRec,

1401
KDD ’24, August 25–29, 2024, Barcelona, Spain Sein Kim et al.

implying again that using the text encoder to extract the joint text- indicates the reduction of collaborative knowledge between items
collaborative knowledge is useful when items lack interactions. 3) and users, which is crucial for recommendation tasks. 4) Lastly,
Under the few-shot scenario, LLM-based models outperform the we kept SBERT frozen while training A-LLMRec. We observe that
CF-Resys, i.e., SASRec, due to the textual understanding of LLM, freezing SBERT leads to poor performance across all datasets. This
which helps extract information from the text of the unseen item, implies that fine-tuning SBERT facilitates the text embeddings to
while CF-RecSys suffers from the lack of collaborative knowledge adapt to the recommendation task.
regarding unseen/new items.
Table 9: Ablation study on Stage-2 of A-LLMRec (Hit@1).
Table 7: Results (Hit@1) on a cross-domain scenario (i.e., Pre- Row Ablation Movies and TV Video Games Beauty Toys
trained: Movies and TV, Evaluation: Video Games). (1) A-LLMRec 0.6237 0.5282 0.5809 0.3336
SASRec MoRec RECFORMER TALLRec A-LLMRec A-LLMRec (SBERT) (2) A-LLMRec w/o user representation 0.5925 0.5121 0.5547 0.3217
Movies and TV (3) A-LLMRec w/o joint embedding 0.1224 0.4773 0.5213 0.2831
0.0506 0.0624 0.0847 0.0785 0.0901 0.1203
→ Video Games (4) A-LLMRec with random joint embedding 0.1200 0.4729 0.5427 0.0776

5.2.5 Cross-domain Scenario. To further investigate the general-


5.3.2 Effect of the Alignment method in Stage-2. Recall that a user
ization ability of A-LLMRec, we evaluate the models on the cross-
representation and item embeddings are injected to the LLM prompt
domain scenario, where the models are evaluated on datasets that
as shown in Figure 3. In this section, we verify the benefit of in-
have not been used for training (Table 7). Specifically, we pre-train
jecting them into the prompt (rows (2-4) in Table 9). We have the
the models on the Movies and TV dataset and perform evaluations
following observations: Across all datasets, 1) the absences of ei-
on the Video Games dataset. We have the following observations:
ther the user representation (row (2)) or the joint embedding (row
1) A-LLMRec outperforms all the baselines in the cross-domain
(3)) from the prompt led to a reduction in performance. Notably,
scenario, and A-LLMRec (SBERT) particularly performs well. This
the exclusion of the joint embedding results in a more substantial
is again attributed to the text encoder that becomes useful when
decrease, underscoring its significant role in transferring collab-
collaborative information is lacking. 2) SASRec underperforms
orative knowledge. Moreover, as joint embeddings also capture
modality-aware models and LLM-based models, indicating that us-
the textual information about items, their exclusion is particularly
ing textual knowledge is crucial for the cross-domain scenario due
detrimental. 2) When we replace the joint embedding with a ran-
to the lack of collaborative information.
domly initialized embedding (row (4)), which means A-LLMRec is
trained with item embeddings without collaborative knowledge,
5.3 Ablation Studies
we observe performance degradation across all datasets. This indi-
In this section, we show ablation studies for our model. We mainly cates the importance of leveraging the collaborative knowledge for
analyze the effect of each component in A-LLMRec regarding Stage- recommendation.
1 (Section 5.3.1) and Stage-2 (Section 5.3.2).
Table 8: Ablation studies on Stage-1 of A-LLMRec (Hit@1). 5.4 Model Analysis
Ablation Movies and TV Beauty Toys 5.4.1 Train/Inference Speed. Recall that A-LLMRec requires the
fine-tuning of neither the CF-RecSys nor the LLM. Specifically, A-
A-LLMRec 0.6237 0.5809 0.3336 LLMRec efficient in that the alignment network is the only trainable
w/o Lmatching 0.5838 0.5548 0.3225 neural network, while TALLRec [2] requires the fine-tuning of the
w/o Litem-recon &Ltext-recon 0.5482 0.5327 0.3204 LLM with LoRA. In this section, we compare the training and the
inference time of A-LLMRec and TALLRec. As for the training
w/o Lrec 0.6130 0.5523 0.1541
time, we measured the total time spent until the end of training,
Freeze SBERT 0.6173 0.5565 0.1720 and as for the inference time, we measured the time spent per
mini-batch. Table 10 shows that A-LLMRec exhibits significantly
5.3.1 Effect of Components in Stage-1. This section presents the
faster training and inference time compared with TALLRec. No-
experimental results showing the benefit of each component during
tably, a more substantial improvement is observed in training time,
the Stage-1. Across all datasets, the exclusion of any loss resulted
since A-LLMRec does not require the LLM to be fine-tuned unlike
in decreased performance. We make the following observations:
TALLRec, which demonstrates the applicability of LLM in large-
1) Removing Lmatching from in Equation 2 results in a significant
scale recommendation datasets. Moreover, the faster inference time
performance decline across all datasets. This implies that the align-
demonstrates the practicality of A-LLMRec in real-world scenar-
ment between the item and the text information is effective and
ios, especially in the context of real-time recommendation services
that the LLM can comprehend item textual information in joint
where inference time is critically important.
collaborative-text embeddings to enhance recommendation capa-
bilities. 2) Removing Litem-recon and Ltext-recon leads to perfor- 5.4.2 Training with all items in each sequence. Recall that for effi-
mance drop, owing to the risk of over-smoothed representations ciency in training, we used only the last item of each user sequence
(i.e., e ≈ q), as discussed in Section 4.1.1. 3) We observe that remov- when optimizing the final loss in Stage-1 (Equation 6) and Stage-2
ing Lrec leads to performance drop. Since L𝑟𝑒𝑐 is introduced to (Equation 8) of A-LLMRec. In this section, we report the recommen-
explicitly incorporate the collaborative knowledge while informing dation performance in terms of Hit@1 and train/inference speed
the model about the recommendation task, the performance drop when using all items in each user sequence for optimization (see

1402
Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System KDD ’24, August 25–29, 2024, Barcelona, Spain

[User Representation] is a user representation.


Table 10: Train/Inference time comparison (Beauty dataset). This user has watched [An American Christmas Carol (Item Emb), This user has watched [An American Christmas
Darker Than Black (Item Emb), Scrooged (Item Emb), The Burns Carol, Darker Than Black, Scrooged, The Burns
and Allen Show – Collection (Item Emb) … ] in the past. Specify and Allen Show – Collection, … ] in the past.
Train time (min) Inference time (sec/batch) Hit@1 the genres this user would enjoy watching. Specify the genres this user would enjoy watching.

Comedy, Drama, Family


TALLRec 588.58 3.36 0.5542 I’m not sure what you mean by this, but “An
This user would like to receive recommendations for the
American Christmas Carol” is a good movie !
A-LLMRec 232.5 1.98 0.5809 following categories of content: Family Life, Children’s
Programming and Animation
A-LLMRecall 643.33 1.98 0.6002
This user has watched [Meet Jon Doe, Mr Sith Goes
[User Representation] is a user representation.
to Washington, Mary Poppins, … ] in the past.
This user has watched [Meet Jon Doe (Item Emb), Mr Sith Goes
Specify the genres this user would enjoy watching.
to Washington (Item Emb), Mary Poppins (Item Emb) … ] in the
Table 11: Results showing A-LLMRec is model-agnostic. past. Specify the genres this user would enjoy watching.
Please specify what genres this user enjoys
Action, Comedy, Drama, Family watching in order to help us improve our
Model Beauty Toys “Mission: Impossible: The 5 Movie Collection (2015)” is recommendations for you!
a film from the series “Action”, this user will watch it
SASRec 0.5298 0.2359
(a) A-LLMRec (b) LLM-Only
A-LLMRec (SASRec) 0.5809 0.3336
NextItNet 0.4231 0.1415
A-LLMRec (NextItNet) 0.5642 0.3203 Figure 4: A-LLMRec v.s. LLM-Only on the favorite genre pre-
GRU4Rec 0.4131 0.1673 diction task (Movies and TV dataset used).
A-LLMRec (GRU4Rec) 0.5542 0.3089
watching. The only difference in the prompt is that while LLM-only
NCF 0.2957 0.1849
is only given titles of movies watched by the user in the past, A-
A-LLMRec (NCF) 0.5431 0.3263
LLMRec is given the user representation and item embeddings along
A-LLMRecall in Table 10). We observe that as expected the recom- with the movie titles. In Figure 4, we observe that A-LLMRec in-
mendation performance is further improved when using all items deed generates proper answers, while LLM-Only fails to do so. We
in each user sequence. However, considering that the training time attribute this to the fact that the item embeddings of the CF-RecSys
also increased approximately 3 times, the improvement seems mar- are well aligned with the token space of the LLM, which enables
ginal. It is important to note that since vanilla A-LLMRec is trained the LLM to understand and utilize collaborative knowledge. Note
based on only the last item in each user sequence, there is a large that although we also experimented with TALLRec, we were not
amount of unseen/new items that appear in the test set4 . However, able to obtain valid outputs. We conjecture that since the LLM in
valilla A-LLMRec still showed comparable performance with A- TALLRec is fine-tuned via an instruction-tuning process that makes
LLMRecall , implying the generalization ability of A-LLMRec. the model provide responses as part of the recommendation task,
generating valid natural language outputs has become a non-trivial
5.4.3 A-LLMRec is Model-Agnostic. Although A-LLMRec adopts task. Please refer to Appendix B for the results of TALLRec.
SASRec as the backbone CF-RecSys, it can be replaced with any
existing collaborative filtering recommender systems, thanks to the 6 CONCLUSION
model-agnostic property. Hence, we adopt three other collaborative In this paper, we propose a novel LLM-based recommender system,
filtering recommender systems including two sequential recom- named A-LLMRec. The main idea is to enable LLMs to utilize the
menders (i.e., NextItNet and GRU4Rec), and one non-sequential collaborative knowledge from pre-trained CF-RecSys. By doing
recommender (i.e., NCF) to A-LLMRec. We make the following so, A-LLMRec outperforms existing CF-RecSys, modality-aware
observations from Table 11. 1) Adopting the SASRec backbone per- recommender systems, and LLM-based recommenders under vari-
forms the best, which is expected since SASRec outperforms other ous scenarios including cold/warm items, cold user, few-shot, and
CF-RecSys in their vanilla versions. This implies that transferring cross-domain scenarios. Moreover, we also demonstrate that the
high-quality collaborative knowledge can enhance the performance two advantages originated from fine-tuning neither pre-trained
of A-LLMRec. 2) Adopting A-LLMRec to any backbone improves CF-RecSys nor LLMs, i,e, Model-agnostic and efficiency. Lastly, we
the performance of the vanilla model. This implies that if the SOTA show the potential of A-LLMRec in generating natural language
model changes in the future, our framework has the potential to tasks based on the understanding of collaborative knowledge from
further improve performance by replacing the existing CF-RecSys CF-RecSys. For future work, we plan to further enhance the ability
in the framework. 3) We observe that while the performance differ- of the LLM in A-LLMRec based on advanced prompt engineering
ence between SASRec and NCF is nearly double when they operate such as chain-of-thought prompting [46].
as standalone CF-RecSys, the integration with A-LLMRec, which
Ethics Statement To the best of our knowledge, this paper aligns
leverages the modality of item text information and the intensive
with the KDD Code of Ethics without any ethical concerns. The
capabilities of LLM, reduces this performance gap.
datasets and codes employed in our research are publicly available.
5.4.4 Beyond Recommendation: Language Generation Task (Favorite
genre prediction). To validate whether A-LLMRec can generate natu- ACKNOWLEDGMENTS
ral language outputs based on the understanding of users and items This work was supported by NAVER Corporation, the National
through the aligned collaborative knowledge from CF-RecSys, we Research Foundation of Korea(NRF) grant funded by the Korea gov-
conduct a favorite genre prediction task (Figure 4). That is, given ernment(MSIT) (RS-2024-00335098), and National Research Foun-
the same prompt format, we ask the LLM-based models (i.e., A- dation of Korea(NRF) funded by Ministry of Science and ICT (NRF-
LLMRec and LLM-Only) using the same backbone LLM, which is 2022M3J6A1063021).
OPT-6.7B, to predict the movie genres that a given user would enjoy
4 About 13% of items are unseen during training in the Beauty dataset.

1403
KDD ’24, August 25–29, 2024, Barcelona, Spain Sein Kim et al.

REFERENCES [21] Sein Kim, Namkyeong Lee, Donghyun Kim, Minchul Yang, and Chanyoung
[1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling Park. 2023. Task Relation-aware Continual User Representation Learning. In
Popularity Bias in Learning-to-Rank Recommendation. In Proceedings of the Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and
Eleventh ACM Conference on Recommender Systems (Como, Italy) (RecSys ’17). Data Mining (KDD ’23). Association for Computing Machinery, New York, NY,
Association for Computing Machinery, New York, NY, USA, 42–46. https://doi. USA, 1107–1119. https://doi.org/10.1145/3580305.3599516
org/10.1145/3109859.3109912 [22] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-
[2] Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. niques for recommender systems. Computer 42, 8 (2009), 30–37.
2023. Tallrec: An effective and efficient tuning framework to align large language [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Clas-
model with recommendation. arXiv preprint arXiv:2305.00447 (2023). sification with Deep Convolutional Neural Networks. In Advances in Neural
[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Wein-
document transformer. arXiv preprint arXiv:2004.05150 (2020). berger (Eds.), Vol. 25. Curran Associates, Inc.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [24] Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda McAuley. 2023. Text Is All You Need: Learning Language Representations for
Askell, et al. 2020. Language models are few-shot learners. Advances in neural Sequential Recommendation (KDD ’23). Association for Computing Machinery,
information processing systems 33 (2020), 1877–1901. New York, NY, USA, 1258–1267. https://doi.org/10.1145/3580305.3599519
[5] Allison JB Chaney, David M Blei, and Tina Eliassi-Rad. 2015. A probabilistic model [25] Xiangyang Li, Bo Chen, Lu Hou, and Ruiming Tang. 2023. CTRL: Connect Tabular
for using social networks in personalized item recommendation. In Proceedings and Language Model for CTR Prediction. arXiv preprint arXiv:2306.02841 (2023).
of the 9th ACM Conference on Recommender Systems. 43–50. [26] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous
[6] Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Associa-
and Keping Yang. 2021. AutoDebias: Learning to Debias for Recommendation tion for Computational Linguistics and the 11th International Joint Conference on
(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 21–30. Natural Language Processing (Volume 1: Long Papers).
https://doi.org/10.1145/3404835.3462919 [27] Chang Liu, Xiaoguang Li, Guohao Cai, Zhenhua Dong, Hong Zhu, and Lifeng
[7] Chen Cheng, Haiqin Yang, Michael R. Lyu, and Irwin King. 2013. Where you Shang. 2021. Noninvasive self-attention for side information fusion in sequential
like to go next: successive point-of-interest recommendation. In Proceedings of recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence,
the Twenty-Third International Joint Conference on Artificial Intelligence (Beijing, Vol. 35. 4249–4256.
China) (IJCAI ’13). AAAI Press, 2605–2611. [28] Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankan-
[8] Robert G. Cooper and Scott J. Edgett. 2012. Best Practices in the Idea-to-Launch halli. 2022. Disentangled multimodal representation learning for recommenda-
Process and Its Governance. Research Technology Management 55, 2 (2012), 43–54. tion. IEEE Transactions on Multimedia (2022).
https://www.jstor.org/stable/26586220 [29] Zhuang Liu, Yunpu Ma, Matthias Schubert, Yuanxin Ouyang, and Zhang Xiong.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. 2022. Multi-Modal Contrastive Pre-training for Recommendation. In Proceedings
BERT: Pre-training of Deep Bidirectional Transformers for Language Under- of the 2022 International Conference on Multimedia Retrieval (Newark, NJ, USA)
standing, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association (ICMR ’22). Association for Computing Machinery, New York, NY, USA, 99–108.
for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https: https://doi.org/10.1145/3512527.3531378
//doi.org/10.18653/v1/N19-1423 [30] Chih-Chao Ma. 2008. A guide to singular value decomposition for collaborative
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- filtering. Computer (Long Beach, CA) 2008 (2008), 1–14.
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg [31] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel.
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is 2015. Image-Based Recommendations on Styles and Substitutes (SIGIR ’15).
Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Association for Computing Machinery, New York, NY, USA, 43–52. https://doi.
Conference on Learning Representations. org/10.1145/2766462.2767755
[11] Xiaoyu Du, Zike Wu, Fuli Feng, Xiangnan He, and Jinhui Tang. 2022. In- [32] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
variant Representation Learning for Multimedia Recommendation. In Pro- 2015. Image-based recommendations on styles and substitutes. In Proceedings
ceedings of the 30th ACM International Conference on Multimedia (<conf-loc>, of the 38th international ACM SIGIR conference on research and development in
<city>Lisboa</city>, <country>Portugal</country>, </conf-loc>) (MM ’22). As- information retrieval. 43–52.
sociation for Computing Machinery, New York, NY, USA, 619–628. https: [33] Andriy Mnih and Russ R Salakhutdinov. 2007. Probabilistic matrix factorization.
//doi.org/10.1145/3503161.3548405 Advances in neural information processing systems 20 (2007).
[12] Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei [34] Yunhak Oh, Sukwon Yun, Dongmin Hyun, Sein Kim, and Chanyoung Park. 2023.
Zhang. 2023. Chat-rec: Towards interactive and explainable llms-augmented MUSE: Music Recommender System with Shuffle Play Recommendation Enhance-
recommender system. arXiv preprint arXiv:2303.14524 (2023). ment. In Proceedings of the 32nd ACM International Conference on Information
[13] Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Vi- and Knowledge Management (CIKM ’23). Association for Computing Machinery,
sual Evolution of Fashion Trends with One-Class Collaborative Filtering. In New York, NY, USA, 1928–1938. https://doi.org/10.1145/3583780.3614976
Proceedings of the 25th International Conference on World Wide Web (Mon- [35] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings
tréal, Québec, Canada) (WWW ’16). International World Wide Web Confer- using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
ences Steering Committee, Republic and Canton of Geneva, CHE, 507–517. [36] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factor-
https://doi.org/10.1145/2872427.2883037 izing personalized Markov chains for next-basket recommendation. In Proceedings
[14] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng of the 19th International Conference on World Wide Web (Raleigh, North Carolina,
Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for USA) (WWW ’10). Association for Computing Machinery, New York, NY, USA,
recommendation. In Proceedings of the 43rd International ACM SIGIR conference 811–820. https://doi.org/10.1145/1772690.1772773
on research and development in Information Retrieval. 639–648. [37] Scott Sanner, Krisztian Balog, Filip Radlinski, Ben Wedin, and Lucas Dixon.
[15] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng 2023. Large language models are competitive near cold-start recommenders for
Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international language-and item-based preferences. In Proceedings of the 17th ACM conference
conference on world wide web. 173–182. on recommender systems. 890–896.
[16] Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, [38] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based
Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large collaborative filtering recommendation algorithms. In Proceedings of the 10th
language models as zero-shot conversational recommenders. In Proceedings of the international conference on World Wide Web. 285–295.
32nd ACM international conference on information and knowledge management. [39] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015.
720–730. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th
[17] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. international conference on World Wide Web. 111–112.
2015. Session-based recommendations with recurrent neural networks. arXiv [40] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
preprint arXiv:1511.06939 (2015). 2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep-
[18] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean resentations from transformer. In Proceedings of the 28th ACM international
Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large conference on information and knowledge management. 1441–1450.
Language Models. In International Conference on Learning Representations. [41] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-
[19] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for tion via convolutional sequence embedding. In Proceedings of the eleventh ACM
implicit feedback datasets. In 2008 Eighth IEEE international conference on data international conference on web search and data mining. 565–573.
mining. Ieee, 263–272. [42] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
[20] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv
197–206. preprint arXiv:2302.13971 (2023).

1404
Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System KDD ’24, August 25–29, 2024, Barcelona, Spain

[43] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. DropoutNet: Recommendation. In Proceedings of the 46th International ACM SIGIR Conference
Addressing Cold Start in Recommender Systems. In Advances in Neural In- on Research and Development in Information Retrieval (SIGIR ’23). Association for
formation Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wal- Computing Machinery, New York, NY, USA, 3369–3373. https://doi.org/10.1145/
lach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran 3539618.3591856
Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/ [50] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xi-
dbd22ba3bd0df8f385bdac3e9f8be207-Paper.pdf angnan He. 2019. A simple convolutional generative network for next item
[44] Lei Wang and Ee-Peng Lim. 2023. Zero-Shot Next-Item Recommendation using recommendation. In Proceedings of the twelfth ACM international conference on
Large Pretrained Language Models. arXiv preprint arXiv:2304.03153 (2023). web search and data mining. 582–590.
[45] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian [51] Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu
Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Pan, and Yongxin Ni. 2023. Where to Go Next for Recommender Systems? ID-
2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 vs. Modality-based Recommender Models Revisited (SIGIR ’23). Association for
(2022). Computing Machinery, New York, NY, USA, 2639–2649. https://doi.org/10.1145/
[46] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, 3539618.3591932
Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning [52] Sukwon Yun, Kibum Kim, Kanghoon Yoon, and Chanyoung Park. 2022. Lte4g:
in large language models. Advances in Neural Information Processing Systems 35 Long-tail experts for graph neural networks. In Proceedings of the 31st ACM
(2022), 24824–24837. International Conference on Information & Knowledge Management. 2434–2443.
[47] Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng [53] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui
Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt:
Recommendation of Micro-video (MM ’19). Association for Computing Machin- Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068
ery, New York, NY, USA, 1437–1445. https://doi.org/10.1145/3343031.3351034 (2022).
[48] Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, [54] Xun Zhou, Jing He, Guangyan Huang, and Yanchun Zhang. 2015. SVD-based
Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al. 2023. A Survey on Large incremental approaches for recommender systems. J. Comput. System Sci. 81, 4
Language Models for Recommendation. arXiv preprint arXiv:2305.19860 (2023). (2015), 717–733.
[49] Jieyu Yang, Liang Zhang, Yong He, Ke Ding, Zhaoxin Huan, Xiaolu Zhang, and
Linjian Mo. 2023. DCBT: A Simple But Effective Way for Unified Warm and Cold

1405
KDD ’24, August 25–29, 2024, Barcelona, Spain Sein Kim et al.

[User Representation] is a user representation.


This user has watched [The Fisher King (Item Emb), The City of This user has watched [The Fisher King, The City of Lost This user has watched [The Fisher King, The City of Lost
Lost Children (Item Emb), Psycho 3 (Item Emb), …] in the past. Children, Psycho 3, …] in the past. Specify the genres Children, Psycho 3, …] in the past. Specify the genres
Specify the genres this user would enjoy watching. this user would enjoy watching. this user would enjoy watching.

Horror, Mystery/Thriller ""'s & B'..., R. and P-A*S F: '10/6" (B+: 6"… M: Sh Y":
“Psycho 3” Themes: Death, Psychoanalysis, Serial Killer Movie We will try as hard as possible to get them added ASAP
"Shoe V: 2"'s & S& C": "R"
[User Representation] is a user representation.
This user has watched [The Bounty VHS (Item Emb), Hopalong This user has watched [The Bounty VHS, Hopalong
This user has watched [The Bounty VHS, Hopalong
Cassidy: The Complete Series (Item Emb), Lash LaRue Cassidy: The Complete Series, Lash LaRue Collector's
Cassidy: The Complete Series, Lash LaRue Collector's
Collector's Set (Item Emb), An American Christmas Carol VHS Set, An American Christmas Carol VHS, …] in the past.
Set, An American Christmas Carol VHS, …] in the past.
(Item Emb), …] in the past. Specify the genres this user would enjoy watching.
Specify the genres this user would enjoy watching.
Specify the genres this user would enjoy watching.
Please specify the genres this user enjoy watching, and ""R. & B's'- P: R, and... F*: '10/26: M: 6+… C: O Y: 10" (A:
Westerns, Action & Adventure also please add any other information you think is
Hopalong Cassidy: The Complete Series and Lash S: Sh: V: H's E:B's & D: Mc:'s
relevant to help us improve our database of movies
LaRue Collector's Set is a Classic Western film
[User Representation] is a user representation. This user has watched [White House Down, Thor: The
This user has watched [White House Down, Thor: The
This user has watched [White House Down (Item Emb), Thor: Dark World, Sleeping Beauty, Ant-Man, …] in the past.
Dark World, Sleeping Beauty, Ant-Man, …] in the past.
The Dark World (Item Emb), Sleeping Beauty (Item Emb), Ant- Specify the genres this user would enjoy watching.
Specify the genres this user would enjoy watching.
Man (Item Emb), …] in the past.
Specify the genres this user would enjoy watching.
's "'... & P. B, '-/10*(A+B and F's (R's Y R's Shoe M's S's:
I'm not sure what you mean by this
Action, Adventure/Fantasy 6:6's"'s"'s V: 10"'s"'s C#s" in:"S"'s"'s"'s"'s"'s"
Recommend “San Andreas Bilingual”

(a) A-LLMRec (b) LLM-Only (c) TALLRec

Figure 5: A-LLMRec, LLM-Only, and TALLRec on the favorite genre prediction task (Movies and TV dataset used).
This user has watched [HISTORY (Item Titles)] in the • RECFORMER [24] models user preferences and item features
LLM past. Recommend a movie for this user to watch next
Input: from the following set of movie titles, using the Transformer architecture, transforming sequential
[CANDIDATE (Item Titles)]. The recommendation is
recommendation into a task of predicting the next item as if
LLM
[Next Item Title] predicting the next sentence, by converting item attributes
Output:
into a sentence format.
Figure 6: An example prompt designed for the Amazon (3) LLM-based recommender systems
Movies dataset used by LLM-based models, i.e., TALLRec • LLM-Only utilizes an open-source LLM model OPT [53] with
and LLM-Only models. prompts related to recommendation tasks as shown in Figure 6.
Table 12: Source code links of the baseline methods. In our experiments, we adopt the 6.7B size version of OPT for
all LLM-based recommendations.
Methods Source code • TALLRec [2] is our main baseline, which learns the recom-
SASRec https://github.com/pmixer/SASRec.pytorch
NextItNet https://github.com/syiswell/NextItNet-Pytorch mendation task based on prompts consisting solely of text and
GRU4Rec https://github.com/hungpthanh/GRU4REC-pytorch fine-tunes the LLMs using the LoRA. Their approach involves
RECFORMER https://github.com/AaronHeee/RecFormer
TALLRec https://github.com/SAI990323/TALLRec providing user interaction history and one target item and
A-LLMRec https://github.com/ghdtjr/A-LLMRec determining whether a user will prefer this target item. This
simpler task necessitates only a brief prompt for the LLMs.
A BASELINES In contrast, our recommendation task requires a more exten-
(1) Collaborative filtering recommender systems sive prompt. Even though this adjustment results in a smaller
• NCF [15] combines neural networks (MLP) to capture the col- batch size, the same as A-LLMRec, for training TALLRec. We
laborative information. Note that NCF is a two-tower model use the prompt shown in Figure 6.
comprised of separate components for the user and item em- • MLP-LLM is an additionally designed LLM-based recommen-
bedding matrix. dation model for analysis. Compared with A-LLMRec, this
• NextItNet [50] proposes a temporal convolutional network model directly connects the user and item embeddings from
that utilizes 1D-dilated convolutional layers and residual con- frozen CF-RecSys and LLM using only MLP layers, instead
nections to capture the long-term dependencies inherent in of the auto-encoders in A-LLMRec that involve various tech-
interaction sequence. niques to align the collaborative knowledge of CF-RecSys
• GRU4Rec [17] adopts RNNs to model user behavior sequences with the LLM. Note that we use the prompt shown in Figure 3.
for session-based recommendations. B LANGUAGE GENERATION TASK
• SASRec [20] is our main baseline, a state-of-the-art collabo- In Figure 5, we present additional favorite genre prediction task
rative filtering recommender system (CF-RecSys) that adopts results for experiment in shown in Section 5.4.4. As mentioned in
a self-attention encoding method to model user preferences Section 5.4.4, TALLRec could not generate valid natural language
from user behavior sequences. outputs due to the fine-tuning via instruction tuning process, which
(2) Modality-aware recommender systems makes the LLM of TALLRec being able to answer only with some
• MoRec [51] employs a pre-trained SBERT to utilize the text particular prompts used in instruction tuning process. The addi-
information of items to generate the initial embeddings for tional results indicate that A-LLMRec can generate the favorite
items that will be used in collaborative filtering models. We genres for the users based on the understanding of the aligned user
utilize SASRec as the backbone model of MoRec. representation and item embeddings while LLM-only fails to do so.
• CTRL [25] employs a two-stage learning process: the first
stage involves contrastive learning on textual information C REPRODUCIBILITY
of items to initialize the backbone model, and the second For implementing the baseline, we followed the official codes pub-
stage, fine-tunes the model on recommendation tasks. We use lished by authors as detailed in Table 12. Refer to our source code
SASRec as the backbone model of CTRL. and instructions to run code for reproducing the results reported
in the experiments.

1406

You might also like