Zero-Shot Recommendation As Language Modeling
Zero-Shot Recommendation As Language Modeling
net/publication/359722889
CITATIONS READS
24 33
3 authors, including:
Damien Sileo
National Institute for Research in Computer Science and Control
31 PUBLICATIONS 180 CITATIONS
SEE PROFILE
All content following this page was uploaded by Damien Sileo on 24 December 2022.
KU Leuven, Belgium
[email protected]
arXiv:2112.04184v1 [cs.CL] 8 Dec 2021
1 Introduction
Recommender systems predict an affinity score between users and items. Current rec-
ommender systems are based on content-based filtering (CB), collaborative filtering
techniques (CF), or a combination of both. CF recommender systems rely on (U SER ,
I TEM , I NTERACTION) triplets. CB relies on (I TEM , F EATURES) pairs. Both system
types require a costly structured data collection step. Meanwhile, web users express
themselves about various items in an unstructured way. They share lists of their favorite
items and ask for recommendations on web forums, as in (1)2 which hints at a similarity
between the enumerated movies.
(1) Films like Beyond the Black Rainbow, Lost River, Suspiria, and The Neon Demon.
The web also contains a lot of information about the items themselves, like synopsis or
reviews for movies. Language models such as GPT-2 [14] are trained on large web cor-
pora to generate plausible text. We hypothesize that they can make use of this unstruc-
tured knowledge to make recommendations by estimating the plausibility of items be-
ing grouped together in a prompt. LM can estimate the probability of a word sequence,
P (w1 , ...wn ). Neural language models are trained over a large corpus of documents: to
train a neural network, its parameters Θ are optimized for next word prediction likeli-
hood maximization over k-length sequences sampled from a corpus. The loss writes as
follows: X
LLM = −log P (wi |wi−k ....wi−1 ; Θ) (1)
i
1
https://colab.research.google.com/drive/...?usp=sharing
2
https://www.reddit.com/r/MovieSuggestions/...lost river/
2 D. Sileo et al.
where <mi > is the name of the movie mi and <m1 ...mn > are those of randomly
ordered movies liked by u. We then directly use Rbu,i = PΘ (pu,i ) as a relevance score
to sort items for user u.
Our contributions are as follow (i) we propose a model for recommendation with
standard LM; (ii) we derive prompt structures from a corpus analysis and compare
their impact on recommendation accuracy; (iii) we compare LM-based recommenda-
tion with next sentence prediction (NSP) [12] and a standard supervised matrix factor-
ization method [9,15].
2 Related work
Zero-shot prediction with language models Neural language models have been used
for zero-shot inference on many NLP tasks [14,2]. For example, they manually con-
struct a prompt structure to translate text, e.g. Translate english to french : ”cheese”
=>, and use the language model completions to find the best translations. Petroni et
al. [13] show that masked language models can act as a knowledge base when we use
part of a triplet as input, e.g. Paris in <mask>. Here, we apply LM-based prompts to
recommendation.
Hybrid and zero-shot recommendation The cold start problem [17], i.e. dealing with
new users or items is a long-standing problem in recommender systems, usually ad-
dressed with hybridization of CF-based and CB-based systems. Previous work [20,10,5,6]
introduced models for zero-shot recommendation, but they use zero-shot prediction
with a different sense than ours. They train on a set of (U SER , I TEM , I NTERACTION )
triplets, and perform zero-shot predictions on new users or items with known attributes.
These methods still require (U SER , I TEM , I NTERACTION ) or (I TEM , F EATURES ) tu-
ples for training. To our knowledge, the only attempt to perform recommendations with-
out such data at all is from Penha et al. [12] who showed that BERT [3] next sentence
prediction (NSP) can be used to predict the most plausible movie after a prompt. NSP
is not available in all language models and requires a specific pretraining. Their work is
designed as a probing of BERT knowledge about common items, and lacks comparison
with a standard recommendation model, which we here address.
Zero-Shot Recommendation as Language Modeling 3
3 Experiments
3.1 Setup
Dataset We use the standard the MovieLens 1M dataset [8] with 1M ratings from
0.5 to 5, 6040 users, and 3090 movies in our experiments. We address the relevance
prediction task3 , so we consider a rating r as positive if r ≥ 4.0, as negative if ≤ 2.5
and we discard the other ratings. We select users with at least 21 positive ratings and
4 negative ratings and thus obtain 2716 users. We randomly select 20% of them as test
users4 . 1 positive and 4 negative ratings are reserved for evaluation for each user, and
the goal is to give the highest relevance score to the positively rated item. We use 5
positive ratings per user unless mentioned otherwise. We remove the years from the
movie titles and reorder the articles (a, the) in the movie titles provided in the dataset
(e.g. Matrix, The (1999) → The Matrix).
Evaluation metric We use the mean average precision at rank 1 (MAP@1) [18] which
is the rate of correct first ranked prediction averaged over test users, because of its
interpretability.
Pretrained language models In our experiments we use the GPT-2 [14] language mod-
els, which are publicly available in several sizes. GPT-2 is trained with LM pretraining
(equation 1) on the WebText corpus [14], which contains 8 million pages covering var-
ious domains. Unless mentioned otherwise, we use the GPT-base model, with 117M
parameters.
We analyze the Reddit comments from May 20155 to find out how web users mention
lists of movies in web text. This analysis will provide prompt candidates for LM-based
3
Item relevance could be mapped to ratings but we do not address rating prediction here.
4
Training users are only used for the matrix factorization baseline.
5
https://www.kaggle.com/reddit/reddit-comments-may-2015
4 D. Sileo et al.
0.40
0.35
MAP@1
0.30
0.25
0 2 5 7 10 12 15 17 20
#Movies per user
Fig. 2: MAP@1 of LM models with a varying number of movies per user sampled in
the input prompt.
Figure 2 shows that increasing the number of ratings per user has diminishing re-
turns and lead to increasing instability, so specifying n ≈ 5 seems to lead to the best re-
sults with the least user input. After 5 items, adding more items might make the prompt
Zero-Shot Recommendation as Language Modeling 5
less natural, even though the LM seems to adapt when the number of items keeps in-
creasing. It is also interesting to note that when we use an empty prompt, accuracy is
above chance level because the LM captures some information about movie popularity.
We now use a matrix factorization as a baseline, with the Bayesian Personalised Rank-
ing algorithm (BPR) [15]. Users and items are mapped to d randomly initialized latent
factors, and their dot product is used as a relevance score trained with ranking loss. We
use [16] implementation with default hyperparameters6 d = 10 and a learning rate of
0.001.
We also compare GPT-2 LM to BERT next sentence prediction [12] which models
affinity scores with R bu,i = BERTNSP (pu , <mi >), where pu is a prompt containing
movies liked by u. BERT was pretrained with contiguous sentence prediction task [3]
and Penha et al. [12] proposed to use it as a way to probe BERT for recommendation
capabilities.
0.50
0.45
0.40
MAP@1
0.35
BPR [15]
0.30
BERT-base NSP [12]
BERT-large NSP [12]
0.25
GPT-2-base
GPT-2-medium
0.20
0 100 200 300 400 500
#Users in training set
Fig. 3: MAP@1 for BPR models with increasing numbers of users compared the zero-
shot language models (with 0 training user). BERT-base and BERT-large respectively
have 110M and 340M parameters. GPT-2-base and GPT-2-medium have 117M and
345M parameters.
6
https://cornac.readthedocs.io/en/latest/models.html#
bayesian-personalized-ranking-bpr, we experimented with other hyperpa-
rameter configurations but did not observe significant changes.
6 D. Sileo et al.
Up until there, we have used LM to score the likelihood of sequences. LM can also be
used directly for text generation, unlike BERT. We here show LM-generated prompt
completions randomly sampled in our dataset, using greedy decoding.
Prompt (P1): Forrest Gump, Blade Runner, Modern Times, Amelie, Lord of the Rings
The Return of the King, Shaun of the Dead, Alexander, Pan’s Labyrinth, Cashback,
Avatar:
Completion (C1): 3, The Hunger Games: Mockingjay Part 2, King Arthur, A Feast for
Crows, The Hunger Games: Catching Fire, Jackass, Jackass 2, King Arthur
Prompt (P2): Independence Day, Winnie the Pooh and the Blustery Day, Raiders of
the Lost Ark, Star Wars Episode VI - Return of the Jedi, Quiet Man, Game, Labyrinth,
Return to Oz, Song of the South, Matrix:
Completion (C2): and many more. The list can be read by clicking on the relevant
section at the left of the image. To access the list of releases
Some prompts, i.e. (P1) generate valid movie names, but others, like (P2), do not.
LM-based recommender do need a post-processing to match movie names in the possi-
ble sampled generations.
4 Conclusion
We showed that standard language models can be used to perform item recommen-
dations without any adaptation and that they are competitive with supervised matrix
factorization when the number of users is very low (less than 100 users). LM can there-
fore be used to kickstart recommender systems if items are frequently discussed in the
training corpora. Further research could explore ways to adjust LM for recommendation
purposes or to combine LM with matrix factorization into hybrid systems. Another way
to use of our findings would be to generate movie recommendation datasets by mining
web data which could feed standard supervised recommendation techniques.
Zero-Shot Recommendation as Language Modeling 7
5 Acknowledgements
This work is part of the CALCULUS project, which is funded by the ERC Advanced
Grant H2020-ERC-2017 ADG 7885067 .
References
1. Barkan, O., Koenigstein, N.: Item2vec: Neural item embedding for collaborative filtering.
In: 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing
(MLSP). pp. 1–6 (2016). https://doi.org/10.1109/MLSP.2016.7738886
2. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan,
T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever,
I., Amodei, D.: Language models are few-shot learners (2020)
3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidi-
rectional transformers for language understanding. In: Proceedings of the 2019 Con-
ference of the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–
4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019).
https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
4. Devooght, R., Bersini, H.: Long and short-term recommendations with recurrent neu-
ral networks. p. 13–21. UMAP ’17, Association for Computing Machinery, New York,
NY, USA (2017). https://doi.org/10.1145/3079628.3079670, https://doi.org/10.
1145/3079628.3079670
5. Ding, H., Ma, Y., Deoras, A., Wang, Y., Wang, H.: Zero-shot recommender systems (2021)
6. Feng, P.J., Pan, P., Zhou, T., Chen, H., Luo, C.: Zero shot on the cold-start prob-
lem: Model-agnostic interest learning for recommender systems. In: Proceedings of
the 30th ACM International Conference on Information & Knowledge Management. p.
474–483. CIKM ’21, Association for Computing Machinery, New York, NY, USA (2021).
https://doi.org/10.1145/3459637.3482312, https://doi.org/10.1145/3459637.
3482312
7. Guàrdia-Sebaoun, E., Guigue, V., Gallinari, P.: Latent trajectory modeling: A light and ef-
ficient way to introduce time in recommender systems. In: Proceedings of the 9th ACM
Conference on Recommender Systems. pp. 281–284 (2015)
8. Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. ACM Trans. Inter-
act. Intell. Syst. 5(4) (Dec 2015). https://doi.org/10.1145/2827872, https://doi.org/
10.1145/2827872
9. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems.
Computer 42(8), 30–37 (2009)
10. Li, J., Jing, M., Lu, K., Zhu, L., Yang, Y., Huang, Z.: From zero-shot learning to cold-
start recommendation. Proceedings of the AAAI Conference on Artificial Intelligence
33(01), 4189–4196 (Jul 2019). https://doi.org/10.1609/aaai.v33i01.33014189, https://
ojs.aaai.org/index.php/AAAI/article/view/4324
11. Li, Z., Zhao, H., Liu, Q., Huang, Z., Mei, T., Chen, E.: Learning from history and present:
Next-item recommendation via discriminatively exploiting user behaviors. In: Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Min-
ing. pp. 1734–1743 (2018)
7
https://calculus-project.eu/
8 D. Sileo et al.
12. Penha, G., Hauff, C.: What does bert know about books, movies and music? probing bert
for conversational recommendation. In: Fourteenth ACM Conference on Recommender
Systems. p. 388–397. RecSys ’20, Association for Computing Machinery, New York,
NY, USA (2020). https://doi.org/10.1145/3383313.3412249, https://doi.org/10.
1145/3383313.3412249
13. Petroni, F., Rockaschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., Miller, A.: Language
models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP). pp. 2463–2473. Association for Computa-
tional Linguistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1250,
https://aclanthology.org/D19-1250
14. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language
models are unsupervised multitask learners (2019), https://openai.com/blog/
better-language-models/
15. Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: Bpr: Bayesian personalized
ranking from implicit feedback. p. 452–461. UAI ’09, AUAI Press, Arlington, Virginia, USA
(2009)
16. Salah, A., Truong, Q.T., Lauw, H.W.: Cornac: A comparative framework for multimodal
recommender systems. Journal of Machine Learning Research 21(95), 1–5 (2020)
17. Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: Methods and metrics
for cold-start recommendations. In: Proceedings of the 25th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval.
p. 253–260. SIGIR ’02, Association for Computing Machinery, New York, NY,
USA (2002). https://doi.org/10.1145/564376.564421, https://doi.org/10.1145/
564376.564421
18. Schröder, G., Thiele, M., Lehner, W.: Setting goals and choosing metrics for recommender
system evaluations. In: UCERSTI2 workshop at the 5th ACM conference on recommender
systems, Chicago, USA. vol. 23, p. 53 (2011)
19. Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., Jiang, P.: Bert4rec: Sequential rec-
ommendation with bidirectional encoder representations from transformer. In: Proceed-
ings of the 28th ACM International Conference on Information and Knowledge Man-
agement. p. 1441–1450. CIKM ’19, Association for Computing Machinery, New York,
NY, USA (2019). https://doi.org/10.1145/3357384.3357895, https://doi.org/10.
1145/3357384.3357895
20. Volkovs, M., Yu, G., Poutanen, T.: Dropoutnet: Addressing cold start in recommender sys-
tems. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,
Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran As-
sociates, Inc. (2017), https://proceedings.neurips.cc/paper/2017/file/
dbd22ba3bd0df8f385bdac3e9f8be207-Paper.pdf