self-supervised-learning-for-large-scale-item-recommendations

Uploaded by

lu284918171

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

self-supervised-learning-for-large-scale-item-recommendations

Uploaded by

lu284918171

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Self-supervised Learning for Large-scale Item Recommendations

Tiansheng Yao, Xinyang Yi, Derek Zhiyuan Cheng, Felix Yu, Ting Chen, Aditya Menon
Lichan Hong, Ed H. Chi, Steve Tjoa, Jieqi (Jay) Kang, Evan Ettinger
Google Inc., United States

ABSTRACT embedding-based neural networks. The recommendation task is typ-

Large scale recommender models find most relevant items from ically formulated as an extreme classification problem [10] where
huge catalogs, and they play a critical role in modern search and rec- each item is represented as a dense vector in the output space.
ommendation systems. To model the input space with large-vocab This paper is focused on the two-tower DNNs (see Figure 1),
categorical features, a typical recommender model learns a joint popular in many real-world recommenders (see e.g. [23, 39]). In this
arXiv:2007.12865v4 [cs.LG] 25 Feb 2021

embedding space through neural networks for both queries and architecture, a neural network encodes a set of item features into an
items from user feedback data. However, with millions to billions embedding thus making it applicable even for indexing cold-start
of items in the corpus, users tend to provide feedback for a very items. Moreover, the two-tower DNN architecture enables efficient
small set of them, causing a power-law distribution. This makes serving for a large corpus of items in real time, by converting the
the feedback data for long-tail items extremely sparse. top-k nearest neighbor search problem to Maximum-Inner-Product-
Inspired by the recent success in self-supervised representation Search (MIPS) [9] that is solvable in sublinear complexity.
learning research in both computer vision and natural language un- Embedding-based deep models typically have large amount of pa-
derstanding, we propose a multi-task self-supervised learning (SSL) rameters because they are built with high-dimensional embeddings
framework for large-scale item recommendations. The framework that represent high cardinality sparse features such as topics or
is designed to tackle the label sparsity problem by learning better item IDs. In many existing literature, the loss functions for training
latent relationship of item features. Specifically, SSL improves item these models are formulated as a supervised learning problem. The
representation learning as well as serving as additional regulariza- supervision comes from the collected labels (e.g., clicks). Modern
tion to improve generalization. Furthermore, we propose a novel recommendation systems collect billions to trillions of footprints
data augmentation method that utilizes feature correlations within from users, providing huge amount of training data for building
the proposed framework. deep models. However, when it comes to modeling a huge catalogue
We evaluate our framework using two real-world datasets with of items in the order of millions (e.g., songs and apps [28]) to even
500M and 1B training examples respectively. Our results demon- billions (e.g., videos on YouTube [10]), data could still be highly
strate the effectiveness of SSL regularization and show its superior sparse for certain slices due to:
performance over the state-of-the-art regularization techniques. We • Highly-skewed data distribution: The interaction between
also have already launched the proposed techniques to a web-scale queries and items are often highly skewed in a power-law dis-
commercial app-to-app recommendation system, with significant tribution [30]. So a small set of the popular items gets most of
improvements top-tier business metrics demonstrated in A/B exper- the interactions. This will always leave the training data for
iments on live traffic. Our online results also verify our hypothesis long-tail items very sparse.
that our framework indeed improves model performance even more • Lack of explicit user feedback: Users often provide lots of
on slices that lack supervision. positive feedback implicitly like clicks and thumb-ups. However,
they are much less likely to provide explicit feedback like item
ratings, feedback for user happiness, and relevance scores.
1 INTRODUCTION
Recently, neural-net models have emerged to the main stage of Self-supervised learning (SSL) offers a different angle to improve
modern recommendation systems throughout the industry (see, e.g., deep representation learning via unlabeled data. The basic idea
[18, 31, 39, 42]), and academia ([8, 32]). Compared to conventional is to enhance training data with various data augmentations, and
approaches like matrix factorization [1, 21, 22], gradient boosted supervised tasks to predict or reconstitute the original examples as
decision trees [4, 29], and logistic regression based recommenders auxiliary tasks. Self-supervised learning has been widely used in the
[19], these deep models handle categorical features more effectively. areas of Compute Vision (CV) [15, 25, 33] and Natural Language Un-
They also enable more complex data representations, and introduce derstanding (NLU) [12, 24]. An example work [25] in CV proposed
more non-linearity to better fit the complex data for recommenders. to rotate images at random, and train a model to predict how each
A particular recommendation task we focus on in this paper is augmented input image was rotated. In NLU, masked language task
to identify the most relevant items given a query from a huge item was introduced in the BERT model, to help improve pre-training of
catalog. This general problem of large-scale item recommendations language models. Similarly, other pre-training tasks like predicting
has been widely adopted in various applications. Depending on the surrounding sentences and linked sentences in Wikipedia articles
type of the query, a recommendation task could be: (i) personal- have also been used in improving dual-encoder type models in NLU
ized recommendation: when the query is a user; (ii) item to item [3]. Compared to conventional supervised learning, self-supervised
recommendation: when the query is also an item; and (iii) search: learning provides complementary objectives eliminating the pre-
when the query is a piece of free text. To model the interactions requisite of collecting labels manually. In addition, SSL enables
between a query and an item, a well-known approach leverages
supervised loss
autonomous discovery of good semantic representations by exploit-
ing the internal relationship of input features.
Despite the wide adoption in computer vision and natural lan- embedding embedding

guage understanding, the application of self-supervised learning DNN DNN

in the field of recommendation systems is less well studied. The feature embeddings feature embeddings
closest line of research studies a set of regularization techniques
[17, 23, 41], which are designed to force learned representations query features item features

(i.e., output layer (embeddings) of a multi-layer perception), of dif-

ferent examples to be farther away from each other, and spread Figure 1: Model architecture: Two-tower DNN with query
out in the entire latent embedding space. Although sharing similar and item representations.
spirit with SSL, these techniques do not explicitly construct SSL
tasks. In contrast to models in CV or NLU applications, recommen-
heterogeneous categorical features that are common in recom-
dation model takes extremely sparse input where high cardinality
mender models.
categorical features are one-hot (or multi-hot) encoded, such as
the item IDs or item categories [31]. These features are typically • Offline experiments: On one public dataset and one industry
represented as learnable embedding vectors in deep models. As scale dataset for recommendation systems, we demonstrate that
most models in computer vision and NLU deal with dense input, introducing SSL as an auxiliary task can significantly improve
existing methods for creating SSL tasks are not directly applicable model performance, especially when labels are scarce. Com-
to the sparse models in recommendation systems. More recently, a paring to the state-of-art non-SSL regularization techniques
line of research studies self-supervised learning improving sequen- [17, 23, 41], we demonstrate that SSL consistently performs
tial user modeling in recommendations [27, 37, 43]. Different from better, and improves model performance even when non-SSL
these works, this paper focuses on item representation learning, regularization does not bring any additional gains.
and shows how SSL can help improve generalization in the context • Live experiment in a web-scale recommender: We have
of long-tail item distribution. Moreover, in contrast to using SSL launched the proposed SSL technique in a fairly strong two-
on a certain sequential user feature, we design new SSL tasks and tower app-to-app recommendation model in a large-scale real-
demonstrate their effectiveness for learning with a set of hetero- world system. Live A/B testing shows significantly improved
geneous categorical features, which we believe is a more general top-tier metrics. We especially see bigger improvements for
setup for other types of recommendation models such as multitask slices without much supervision.
ranking models (e.g., [42]).
In this paper, we propose to leverage self-supervised learning 2 RELATED WORK
based auxiliary tasks to improve item representations, especially Self-supervised Learning and Pre-training. Various unsupervised
with long-tail distributions and sparse data. Different from CV or and self-supervised learning tasks have been studied in the com-
NLU applications, input space of recommendation model is highly puter vision community. The closest line of research is SimCLR [5]
sparse and represented by a set of categorical features (e.g. item ids) which also utilizes self-supervised learning and contrastive learning
with large cardinality. For such sparse models, we propose a new for visual representation learning. Different with SimCLR and other
SSL framework, where the key idea is to: (i) augment data by mask- works [2, 6] in vision, here we propose augmentations that are more
ing input information; (ii) encode each pair of augmented examples suitable and tailed for categorical features for recommendations,
by a two-tower DNN; and (iii) apply a contrastive loss to learn rep- instead of relying on image-specific augmentations such as image
resentations of augmented data. The goal of contrastive learning cropping, rotation and color distortion. In addition, the proposed
is to let augmented data from the same example be discriminated framework does not require multi-stage training schedules (such
against others. Note that the two-tower DNN for contrastive learn- as pre-training then fine-tuning). [20].
ing and the one for encoding query and item can share a certain In NLU, for dual-encoder models, [3] shows that pre-training
amount of model parameters. See more details in Section 3. tasks better aligned with the final task are more helpful than generic
Our contribution is four-fold: tasks such as next sentence prediction and masked-LM. The pre-
• SSL framework: We present a model architecture agnostic training tasks are designed to leverage large-scale public NLU con-
self-supervised learning framework for sparse neural models tent, such as Wikipedia. In this paper, we also use the dual-encoder
in recommendations. The auxiliary self-supervised loss and the model architecture. Different from the above, the proposed self-
primary supervised loss are jointly optimized via a multi-task supervision tasks do not require the use of a separate data source.
learning framework. We focus on using this framework for
Spread-out Regularization. Zhang et al. [41] and Wu et al. [36]
efficiently scoring a large corpus of items, which is also known
use spread-out regularization for improving generalization of deep
as item retrieval in two-stage recommenders [10]. We believe
models. Specifically, in [41], a regularization promoting separation
it would also shed light on designing SSL for other types of
between random instances is shown to improve training stability
models such as ranking models [7].
and generalization. In [36], one objective is to train a classifier
• Data augmentation: We propose a novel data augmentation treating each instance as its own class, therefore promoting large
method that exploits feature correlations and are tailored for instance separations in the embedding space. Both the above ap-
proaches are studied for computer vision applications.
2
Neural Recommenders. Deep learning has led to many successes
in building industry-scale recommender systems such as video h yi zi (zi , . )
xi
suggestion [10], news recommendation [34], and visual discovery
in social networks [26, 40]. For large-scale item retrieval, two-tower yi zi
models with separate query and item towers are widely used due to
its high efficiency for serving. The recommendations are typically h yj zj
xj
computed by a dot-product between query and item embeddings
so that finding top-k items can be converted to MIPS problem [9] yj zj
with sublinear time complexity. One popular factorized structure is
softmax-based multi-class classification model. The work in [10]
treats the retrieval task as an extreme multi-class classification Figure 2: Self-supervised learning framework illustration.
trained with multi-layer perceptron (MLP) model using sampled Two data augmentations ℎ and 𝑔 are applied to the input; En-
softmax as loss function. Such models leverages item ID as the coders H and G are applied to the augmented examples y𝑖
only item feature, suffering from cold-start issue. More recently, a and y𝑖′ to generate embeddings z𝑖 and z𝑖′ . The SSL loss L𝑠𝑒𝑙 𝑓
line of research [23, 39] considers applying two-tower DNNs on w.r.t. z𝑖 is optimized towards maximizing the similarity with
retrieval problems, which is also known as dual-encoder [16, 38], z𝑖′ while minimizing the similarity between z 𝑗 and z ′𝑗 .
where item embeddings are constructed by a MLP from ID and We consider a batch of 𝑁 item examples 𝑥 1, ..., 𝑥 𝑁 , where 𝑥𝑖 ∈ X
other categorical metadata features. The self-supervised approach represents a set of features for example 𝑖. In the context of recom-
proposed is applicable to both ranking and retrieval models. In this menders, an example indicates a query, an item or a query-item
paper we focus on using SSL for retrieval models, particularly, on pair. Suppose there are a pair of transform function ℎ, 𝑔 : X → X
improving item representations in two-tower DNNs. that augment 𝑥𝑖 to be 𝑦𝑖 , 𝑦𝑖′ respectively,
Self-supervised Learning in Sequential Recommendations. In rec- 𝑦𝑖 ← ℎ(𝑥𝑖 ), 𝑦𝑖′ ← 𝑔(𝑥𝑖 ). (1)
ommender systems, a line of research has been recently studied for Given the same input of example 𝑖, we want to learn different
utilizing self-supervised learning for sequential recommendation. representations 𝑦𝑖 , 𝑦𝑖′ after augmentation to make sure the model
Self-supervised learning tasks are designed to capture information still recognizes that both 𝑦𝑖 and 𝑦𝑖 represent the same input 𝑖. In
among user history [43] and learn more robust disentangled user other words, the contrastive loss learns to minimize the difference
representation [27] in user sequential recommendation. Moreover, between 𝑦𝑖 , 𝑦𝑖′ . In the mean time, for different example 𝑖 and 𝑗,
Xin et al. shows combining SSL with reinforcement learning is the contrastive loss maximizes the difference between the repre-
effective to capture long-term user interest in sequential recom- sentations learned 𝑦𝑖 , 𝑦 ′𝑗 after data different augmentations. Let
mendation. Different from the above, our proposed SSL framework
z𝑖 , z𝑖′ denote the embeddings of 𝑦𝑖 , 𝑦𝑖′ after encoded by two neural
is focusing on improving item representation with long-tail distri-
butions. The proposed SSL tasks do not require modeling sequential networks H, G : X → R𝑑 , that is
information and are generally applicable to deep models with het-
erogeneous categorical features. z𝑖 ← H (𝑦𝑖 ), z𝑖′ ← G(𝑦𝑖′ ). (2)
We treat (z𝑖 , z𝑖 ) as positive pairs, and (z𝑖 , z ′𝑗 ) as negative pairs
′

3 METHOD for 𝑖 ≠ 𝑗. Let 𝑠 (z𝑖 , z ′𝑗 ) = ⟨z𝑖 , z ′𝑗 ⟩/(∥z𝑖 ∥ · ∥z ′𝑗 ∥). To encourage the

We present our framework of self-supervised learning for deep above properties, we define the SSL loss for a batch of 𝑁 examples
neural net models for recommenders using large-vocab categorical {𝑥𝑖 } as:
features. Particularly, a general self-supervised learning framework 1 ∑︁ exp (𝑠 (z𝑖 , z𝑖′ )/𝜏)
is introduced in Section 3.1. In Section 3.2, we present a data aug- L𝑠𝑒𝑙 𝑓 ({𝑥𝑖 }; H, G) := − log Í .
𝑗 ∈ [𝑁 ] exp (𝑠 (z𝑖 , z 𝑗 )/𝜏)
𝑁 ′
mentation method to construct SSL tasks and elaborates on their 𝑖 ∈ [𝑁 ]
connections to spread-out regularization. Finally, in Section 3.3, (3)
we describe how to use SSL to improve factorized models (i.e., where 𝜏 is a tunable hyper-parameter for the softmax temperature.
two-tower DNNs as shown in Figure 1), via a multi-task learning The above loss function learns a robust embedding space such that
framework. similar items are close to each other after data augmentation, and
random examples are pushed farther away. The overall framework
3.1 Framework is illustrated in Figure 2.
Inspired by the SimCLR framework [5] for visual representation Encoder Architecture. For input examples with categorical fea-
learning, we adopt similar contrastive learning algorithms for learn- tures, H, G are typically constructed with an input layer and a
ing representations of categorical features. The basic idea is two multi-layer perceptron (MLP) on top. The input layer is often a con-
folds: first, we apply different data augmentation for the same train- catenation of normalized dense features and multiple sparse feature
ing example to learn representations; and then use contrastive loss embeddings, where the sparse feature embeddings are learnt repre-
function to encourage the representations learned for the same sentations stored in embedding tables (In contrast, the input layers
training example to be similar. Contrastive loss was also applied in for computer vision and language models directly work on raw
training two-tower DNNs (see e.g., [23, 39]), although the goal there inputs). In order to make SSL facilitate the supervised learning task,
was to make positive item agree with its corresponding queries. we share the embedding table of sparse features for both neural
3
networks H, G. Depending on the technique for data augmentation between the two augmented examples, making the SSL task too
(ℎ, 𝑔), the MLPs of H and G could also be fully or partially shared. easy. To address this issue, we propose to split features according
to feature correlation measured by mutual information. The mutual
Connection with Spread-out Regularization [41]. In the special
information of two categorical features is given by
case where (ℎ, 𝑔) are identical map and H, G are the same neural
network, loss function in equation (3) is then reduced to ∑︁ 𝑃 (𝑣𝑖 , 𝑣 𝑗 )
𝑀𝐼 (𝑉𝑖 , 𝑉 𝑗 ) = 𝑃 (𝑣𝑖 , 𝑣 𝑗 ) log , (4)
∑︁ 𝑒𝑥𝑝 (1/𝜏) 𝑃 (𝑣𝑖 )𝑃 (𝑣 𝑗 )
−𝑁 −1 log Í 𝑣𝑖 ∈𝑉𝑖 ,𝑣 𝑗 ∈𝑉 𝑗
𝑖
𝑒𝑥𝑝 (1/𝜏) + 𝑗≠𝑖 𝑒𝑥𝑝 (𝑠 (z𝑖 , z 𝑗 )/𝜏)
where 𝑉𝑖 , 𝑉 𝑗 denote their vocab sets. The mutual information for
which encourages learned representations of different examples to
all pairs of features can be pre-computed.
have small cosine similarity. The loss is similar to the spread-out reg-
ularization introduced in [41], except that the original proposal uses Correlated Feature Masking. With the pre-computed mutual in-
square loss, i.e., 𝑁 −1 𝑖 𝑗≠𝑖 ⟨z𝑖 , z 𝑗 ⟩ 2 , instead of softmax. Spread-
Í Í
formation, we propose Correlated Feature Masking (CFM) that
out regularization has been proven to improve generalization of exploits the feature-dependency patterns for more meaningful SSL
large-scale retrieval models. In Section 4, we show that by intro- tasks. For the set of masked features, 𝐹𝑚 , we seek to mask highly
ducing specific data augmentations, using SSL-based regularization correlated features together. We do so by first uniformly sample a
can further improve model performance compared to spread-out. seed feature 𝑓𝑠𝑒𝑒𝑑 from all the available features 𝐹 = {𝑓1, ..., 𝑓𝑘 }, and
then select the top-n most correlated features 𝐹𝑐 = {𝑓𝑐,1, ..., 𝑓𝑐,𝑛 }
3.2 A Two-stage Data Augmentation according to their mutual information with 𝑓𝑠𝑒𝑒𝑑 . The final 𝐹𝑚
We introduce the data augmentation, i.e., ℎ and 𝑔 in Figure 2. Given will be the union of the seed and set of correlated features, i.e.,
a set of item features, the key idea is to create two augmented exam- 𝐹𝑚 = {𝑓𝑠𝑒𝑒𝑑 , 𝑓𝑐,1, ..., 𝑓𝑐,𝑛 }. We choose 𝑛 = ⌊𝑘/2⌋ so that the masked
ples by masking part of the information. A good transformation and and retained set of features have roughly the same size. We change
data augmentation should make minimal amount of assumptions the seed feature per batch so that the SSL task will learn on various
on the data such that it can be generally applicable to a large variety kinds of masking patterns.
of tasks and models. The idea of masking is inspired by the Masked
Language Modeling in BERT [12]. Different from sequential tokens, 3.3 Multi-task Training
a general set of features does not have sequential order, and leaves To enable SSL learned representations to help improve the learning
the choice of masking pattern as an open question. We seek to for the main supervised task such as regression or classification, we
design the masking pattern by exploring feature correlation. We leverage a multi-task training strategy where the main supervised
propose Correlated Feature Masking (CFM), tailored to categorical task and the auxiliary SSL task are jointly optimized. Precisely, let
features with awareness of feature correlations. {(𝑞𝑖 , 𝑥𝑖 )} be a batch of query-item pairs sampled from the training
Before diving into the details of masking, we first present a two- data distribution D𝑡𝑟𝑎𝑖𝑛 , and let {𝑥𝑖 } be a batch of items sampled
stage augmentation algorithm. Note that without augmentation, from an item distribution D𝑖𝑡𝑒𝑚 . Then the joint loss is:
the input layer is created by concatenating the embeddings of all
categorical features. The two-stage augmentation includes: (5)

L = L𝑚𝑎𝑖𝑛 {(𝑞𝑖 , 𝑥𝑖 )} + 𝛼 · L𝑠𝑒𝑙 𝑓 {𝑥𝑖 } ,
• Masking. Apply a masking pattern on the set of item features.
We use a default embedding in the input layer to represent the where L𝑚𝑎𝑖𝑛 is the loss function for the main task capturing inter-
features that are masked. action between query and item, and 𝛼 is the regularization strength.

• Dropout. For categorical features with multiple values, we drop Heterogeneous Sample Distributions. The marginal item distribu-
out each value with a certain probability. It further reduces input tion from D𝑡𝑟𝑎𝑖𝑛 typically follows a power-law. Therefore, using
information and increase the hardness of SSL task. the training item distribution for L𝑠𝑒𝑙 𝑓 would cause the learned
The masking step can be interpreted as a special case of dropout feature relationship to be biased towards head items. Instead, we
with a 100% dropout rate. One strategy is the complementary mask- sample items uniformly from the corpus for L𝑠𝑒𝑙 𝑓 . In other words,
ing pattern, that we split the feature set into two exclusive features D𝑖𝑡𝑒𝑚 is the uniform item distribution. In practice, we find using
sets into the two augmented examples. Specifically, we could ran- the heterogeneous distributions for main and ssl tasks is critical for
domly split the feature set into two disjoint subsets. We call this SSL to achieve superior performance.
method Random Feature Masking (RFM), and will use it as one
Loss for Main Task. There could be many choices for the main
of our baselines. We now introduce Correlated Feature Masking
loss depending on the objectives. In this paper, we consider the
(CFM) where we further explore the feature correlations when
batch softmax loss used in both recommenders [39] and NLP [16] for
creating masking patterns.
optimizing top-k accuracy. In detail, let q𝑖 , x𝑖 be the embeddings of
Mutual Information of Categorical Features. If the set of masked query and item examples (𝑞𝑖 , 𝑥𝑖 ) after being encoded by two neural
features are chosen at random, (ℎ, 𝑔) are essentially sampled from networks, then for a batch of pairs {(𝑞𝑖 , 𝑥𝑖 )}𝑖=1
𝑁 and temperature 𝜏,

2𝑘 different masking patterns over the whole feature set with 𝑘 the batch softmax cross entropy loss is
features. Different masking patterns would naturally lead to dif-
1 ∑︁ exp (𝑠 (q𝑖 , x𝑖 )/𝜏)
ferent effects for the SSL task. For instance, the SSL contrastive L𝑚𝑎𝑖𝑛 = − log Í . (6)
learning task may exploit the shortcut of highly correlated features 𝑁
𝑖 ∈ [𝑁 ] 𝑗 ∈ [𝑁 ] exp (𝑠 (q𝑖 , x 𝑗 )/𝜏)
4
supervised loss self-supervised loss

embedding embedding embedding embedding

DNN DNN share DNN share DNN

feature embeddings feature embeddings feature embeddings feature embeddings

query features item features augmented item features augmented item features

feature augmentation: masking + dropout

Figure 3: Model architecture: Two-tower model with SSL. In the SSL task, we apply feature masking and dropout on the item
features to learn item embeddings. The whole item tower (in red) is shared with the supervised task.

Other Baselines. As mentioned in Section 2, we use two-tower training and evaluation using a (90%, 10%) split, following the same
DNNs as the baseline model for main task. Two-tower model has treatment in [23] and [39].
the unique property of encoding item features compared to classic App-to-App Install (AAI): The AAI dataset was collected on
matrix factorization (MF) and classification models. While the latter the app landing pages from a commercial mobile app store. On a
two methods are also applicable to large-scale item retrieval, they particular app’s (seed app) landing page, the app installs (candidate
only learn item embeddings based on IDs, and thus do not fit in our apps) from the section of recommended apps were collected. Each
proposal of using SSL for exploiting item feature relations. training example represents a pair of seed-candidate pairs denoted
as (𝑥𝑠𝑒𝑒𝑑 , 𝑥𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ) and their metadata features. The goal is to
4 OFFLINE EXPERIMENTS recommend highly similar apps given a seed app. This is also formu-
We provide empirical results to demonstrate the effectiveness of lated as an item-to-item recommendation problem via a multi-class
our proposed self-supervised framework both in academic public classification loss. Note that we only collect positive examples, i.e.,
dataset and in actual large-scale recommendation products. The ex- 𝑥𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 is an installed app from the landing page of 𝑥𝑠𝑒𝑒𝑑 . All the
periments are designed to answer the following research questions. impressed recommended apps with no installs are all ignored since
we consider them more like weak positives instead of negatives
• RQ1: Does the proposed SSL Framework improve deep models
for building retrieval models. Each item (app) is represented by a
for recommendations?
feature vector x with the following features:
• RQ2: SSL is designed to improve primary supervised task through
• id: Application id as a one-hot categorical feature.
introduced SSL task on unlabeled examples. What is the impact
of the amount of training data on the improvement from SSL? • developer_name: Name of the app developer as a one-hot cate-
gorical feature.
• RQ3: How do the SSL parameters, i.e., loss multiplier 𝛼 and
dropout rate in data augmentation, affect model quality? • categories: Semantic categories of the app as a multi-hot cate-
gorical feature.
• RQ4: How does RFM perform compared to CFM? What is the
benefit of leveraging feature correlations in data augmentation? • title_unigram: Uni-grams of the app title as a multi-hot categor-
ical feature.
The above questions are addressed in order from Section 4.3 - 4.5.
4.2 Experiment Setup
4.1 Datasets
Backbone Network. For the main task that predicts relevant items
We conduct experiments on two large-scale datasets that both come given the query, we use the two-tower DNN to encode query and
with a rich set of item metadata features. We formulate their primary items features (see Figure 1) as the backbone network. The item-
supervised task as an item-to-item recommendation problem to to-item recommendation problem is formalized as a multi-class
study the effects of SSL on training recommender (in this case, classification problem, using the batch softmax loss presented in
retrieval) models. See Appendix .1 for details about the statistics of Equation (6) as the loss function. For discussions of the choice
these two datasets. of backbone network, we refer the readers to related sections in
Wikipedia [14]: The first dataset focuses on the problem of Section 2 and Section 3.3.
link prediction between Wikipedia pages. It consists of pairs of
pages (𝑥, 𝑦) ∈ 𝜒 × 𝜒, where 𝑥 indicates a source page, and 𝑦 is a Hyper-parameters. For the backbone two-tower DNN, we search
destination page linked from 𝑥. The goal is to predict the set of pages the set of hyper-parameters such as the learning rate, softmax tem-
that are likely to be linked to a given source page from the whole perature (𝜏) and model architecture that gives the highest Recall@50
corpus of web pages. Each page is represented by a feature vector on the validation set. Note that the training batch size in batch soft-
𝑥 = (𝑥𝑖𝑑 , 𝑥𝑛𝑔𝑟𝑎𝑚𝑠 , 𝑥𝑐𝑎𝑡𝑠 ), where all the features are categorical. max is critical for model quality as it determines the number of
Here, 𝑥𝑖𝑑 denotes the one-hot encoding of the page URL, 𝑥𝑛𝑔𝑟𝑎𝑚𝑠 negatives used for each positive item. Throughout this section, we
denotes a bag-of-words representation of the set of n-grams of the use batch sizes 1024 and 4096 for Wikipedia and AAI respectively.
page’s title, and 𝑥𝑐𝑎𝑡𝑠 denotes a bag-of-words representation of the We also tuned the number of hidden layers, hidden layer sizes
categories that the page belongs to. We partitioned the dataset into and softmax temperature 𝜏 for the baseline models. For Wikipedia
5
dataset, we use softmax temperature 𝜏 = 0.07, and ℎ𝑖𝑑𝑑𝑒𝑛_𝑙𝑎𝑦𝑒𝑟𝑠 Wikipedia
with sizes [1024, 128]. For AAI, we use 𝜏 = 0.06 and ℎ𝑖𝑑𝑑𝑒𝑛_𝑙𝑎𝑦𝑒𝑟𝑠 Method MAP@10 MAP@50 Recall@10 Recall@50
[1024, 256]. Note that the dimension of last hidden layer is also
Baseline 0.0171 0.0229 0.0537 0.1930
the dimension of final query and item embeddings. All models are
FD [35] 0.0172 0.0229 0.0535 0.1912
trained with Adagrad [13] optimizer with learning rate 0.01.
SO [41] 0.0176 0.0235 0.0549 0.1956
We consider two SSL parameters: 1) the SSL loss multiplier 𝛼
Our method 0.0183 0.0243 0.057 0.2009
in equation (5), and 2) the feature dropout rate, denoted as 𝑑𝑟 ,
in the second phase of data augmentation (see Section 3.2). For AAI
each augmentation method (e.g., CFM, RFM), we conduct grid Method MAP@10 MAP@50 Recall@10 Recall@50
search of the two parameters by ranges 𝛼 = [0.1, 0.3, 1.0, 3.0], 𝑑𝑟 = Baseline 0.1257 0.1363 0.2793 0.4983
[0.1, 0.2, ..., 0.9], and report the best result. FD [35] 0.1278 0.1384 0.2840 0.5058
Evaluation. To evaluate the recommendation performance given SO [41] 0.1300 0.1406 0.2870 0.5076
a seed item, we compute and find the top 𝐾 items with the highest Our method 0.1413 0.1522 0.3078 0.5355
cosine similarity from the whole corpus and evaluate the quality Table 1: Results on the full Wikipedia and AAI dataset.
based on the 𝐾 retrieved items. Note this is a relatively challenging
task, given the sparsity of the dataset and large number of items
in the corpus. We adopt popular standard metrics 𝑅𝑒𝑐𝑎𝑙𝑙@𝐾 and proposed SSL method is reduced to SO. By comparing CFM and FD,
mean average precision (𝑀𝐴𝑃@𝐾) to evaluate recommendation we find the feature augmentation is more effective when applied to
performance [18]. For each configuration of experiment results, we the SSL task than to the supervised task as a standard regularization
ran the experiment 5 times and report the average. technique. Note that FD, as a well known approach for improving
generalization in some cases, applies feature augmentation together
4.3 Effectiveness of SSL with Correlated with supervised training.
Feature Masking Head-tail Analysis. To understand the gain from SSL, we further
To answer RQ1, we first evaluate the impact of SSL on model break down the overall performance by looking at different item
quality. We focus on using CFM followed by dropout as the data slices by item popularity. The splitting of the head and tail test
augmentation technique. We will show the superior performance set is described in the appendix .2. Our hypothesis is that SSL
of CFM over other variants in Section 4.5. generally helps improve the performance for slices without much
We consider three baseline methods: supervision (e.g., tail items). The results evaluated on the tail and
• Baseline: Vanilla backbone network with the two-tower DNN head test sets are reported the results in Table 3. We observe that
architecture. the proposed SSL methods improve the performance for both head
and tail item recommendations, with larger gains from the tail
• Feature Dropout (FD) [35]: Backbone model with random feature items. For instance, in AAI, the CFM improves over 51.5% of the
dropout on the item tower in the supervised learning task. The Recall@10 on tail items, while the improvement is 8.57% on head.
feature dropout on item features could be treated as data aug-
mentation. FD does not have the additional SSL regularization Effects of SSL Parameters (RQ3). Figure 5 summarizes the Re-
compared to our approach. call@50 evaluated on the Wikipedia and AAI dataset w.r.t. the
• Spread-out Regularization (SO) [41]: Backbone model with spread- regularization strength 𝛼. It also shows the results of SO which
out regularization on the item tower as a regularization. The SO shares the same regularization parameter. We observe that with
regularization shares similar contrastive loss as that in our SSL increasing 𝛼, the model performance is worse than the baseline
framework. However, it applies contrastive learning on original model (shown in dash line) after certain threshold. This is expected,
examples without any data augmentation, and is thus different since large SSL weight 𝛼 leads to the multitask loss L dominated
from our approach. by 𝛼 · L𝑠𝑒𝑙 𝑓 in equation (5). By further comparing our approach
with SO, we show that the SSL based regularization outperforms
The latter two methods are chosen since they are (1) model-agnostic
SO in a wide range of 𝛼. Figure 6 shows the model performance
and scalable for industrial-size recommendation systems; (2) com-
across different dropout rates 𝑑𝑟 . It also shows 𝐷𝑂 which shares
patible with categorical sparse features for improving generaliza-
the same parameter. As 𝑑𝑟 increases, the model performance of 𝐷𝑂
tion. In addition, FD can be viewed as an ablation study to isolate
continues to deteriorate. For most choices of 𝛼 (except 𝛼 = 0.1), 𝐷𝑂
the potential improvement from contrastive learning. Similarly, SO
is worse than the baseline. For the SSL task with feature dropout,
is included to isolate the improvement from feature augmentation.
the model performance peaks when 𝑑𝑟 = 0.3 and then deteriorates
We observe that with full datasets (see Table 1), CFM consis-
when we further improve dropout rate. The model starts to under-
tently performs the best compared with non-SSL regularization
perform the baseline when 𝑑𝑟 is too large. This observation aligns
techniques. On AAI, CFM out-performs the next best method by
with our expectation in the sense that when the rate is too large,
8.69% relatively and on AAI by 3.98%. This helps answer RQ1 that
the input information becomes too little for to learn meaningful
the proposed SSL framework and tasks indeed improves model per-
representations through SSL.
formance for recommenders. By comparing CFM with SO, it shows
that the data augmentation is critical for the SSL regularization Visualization of Item Representations. We visualize the learned
to have better performance. Without any data augmentation, the app embeddings from the AAI dataset in t-SNE plot. We postpone
6
10% Wikipedia Dataset
Method MAP@10 MAP@50 Recall@10 Recall@50
Baseline 0.0077 0.0105 0.0237 0.0924
FD [35] 0.0089 0.0120 0.0272 0.1046
SO [41] 0.0083 0.0112 0.0254 0.0978
Our method 0.0093 0.0126 0.0286 0.1093
10% AAI Dataset
Method MAP@10 MAP@50 Recall@10 Recall@50
Baseline 0.1112 0.1194 0.2406 0.4068
(a) Baseline Model FD [35] 0.1217 0.1302 0.2611 0.4324
SO [41] 0.1220 0.1308 0.2632 0.4400
Our method 0.1409 0.1507 0.3024 0.5014
Table 2: Experiment results trained on the sparse (10% down-
sampled) Wikipedia and AAI datasets.

Wikipedia
Method Tail Head
Recall@10 Recall@50 Recall@10 Recall@50
Baseline 0.0472 0.1621 0.0610 0.2273
(b) Best SSL Model
FD 0.0474 0.1638 0.0593 0.2212
SO 0.0481 0.1644 0.0606 0.2268
Figure 4: Comparison of t-SNE plots for app embeddings for
Our method 0.0524 0.1749 0.0619 0.2283
baseline, and our best SSL model.
AAI
Baseline 0.0475 0.2333 0.2846 0.4993
the detailed setup to Appendix .3. As shown in in Figure 4, we clearly FD 0.0727 0.2743 0.2849 0.5069
see that apps embeddings learned with our SSL regularization are SO 0.0661 0.2602 0.2879 0.5086
better clustered according to their own categories, compared to the
Our method 0.0720 0.2906 0.309 0.537
counter parts of our baseline, which demonstrates that representa-
tions learned through SSL have stronger semantic structures. This Table 3: Results of Wikipedia and AAI on tail and head item
partially explains the gain from SSL. slices.

4.4 Data Sparsity

We study the effectiveness of CFM in presence of sparse data to
address RQ2. We uniformly down-sampled 10% of training data and
evaluate on the same (full) test dataset. Experiment results are re-
ported in Table 2. With increased data sparsity, CFM demonstrates
larger improvement for Wikipedia and AAI respectively. In partic-
ular, the CFM on the full Wikipedia dataset improves Recall@10
by 6.1% compared to the baseline, while the relative improvement Figure 5: Impact of regularization strength 𝛼 for SSL-based
is 20.6% on the 10% dataset. Similar trend is observed for the AAI and spread-out regularization methods. The vertical dash
dataset (10.2% vs 25.7% on the down-sampled dataset). It’s worth line indicates the baseline model’s metric.
noting that, CFM consistently outperforms FD and the gap is larger
as data becomes sparser. This demonstrates that having dropout
for data augmentation in SSL tasks is more effective than directly exploiting feature correlation in masking and 2) benefit of using
applying dropout in supervised task. dropout as part of augmentation. In specific, we consider the fol-
As a summary, these findings answer research questions raised lowing alternatives:
in RQ2 that the proposed SSL framework improves model perfor- • 𝑅𝐹 𝑀: Random Feature Masking. In this method, random set of
mance more with even less supervision. features are masked, instead of guided by mutual information
in CFM.
4.5 Comparison of Different Data • 𝑅𝐹 𝑀𝑛𝑜_𝑐𝑜𝑚𝑝𝑙 : Random Feature Masking with no Complemen-
Augmentations tary sets of features applied. In this method, 2 independent sets
In this section, we compare several feature augmentation alter- of features are masked at random, instead of a complementary
natives with CFM to answer RQ4 by studying: 1) the benefit of pair of masks in CFM.
7
(a) Global (b) Localized

Figure 7: Top Business Metric Improvement Percentage (y-

axis) over Days (x-axis) in Online Experiments: (a) improve-
ment globally; (b) improvement on a localized market.
Figure 6: Impact of dropout rate 𝑑𝑟 for our method and the
standard dropout training on the Wikipedia dataset. given the query. One of the models surfacing this recommendation
is trained on the AAI dataset as described in Section 4.1, with the
Comparison of Multiple Data Augmentations same backbone network structure as the two-tower DNN struc-
Method MAP@10 MAP@50 Recall@10 Recall@50 ture in Figure 1 (with modifications). For a natural extension of
Baseline 0.1257 0.1363 0.2793 0.4983 the offline experiments conducted on Section 4.3 for the AAI ex-
CFM 0.1413 0.1522 0.3078 0.5355 periments, we conducted an A/B experiment for investigating the
RFM 0.1281 0.1389 0.2851 0.5104 synergy of deploying the best SSL-based model online. While we
𝑅𝐹 𝑀𝑛𝑜_𝑐𝑜𝑚𝑝𝑙 0.1363 0.1472 0.3007 0.5290 already presented improved offline metrics on this dataset, in many
𝐶𝐹 𝑀𝑛𝑜_𝑑𝑟𝑜𝑝𝑜𝑢𝑡 0.1309 0.1417 0.2898 0.5150 real-world systems, offline studies might not align with live impact
𝑁 𝑜𝑀𝑎𝑠𝑘𝑖𝑛𝑔 0.1303 0.1408 0.2868 0.5053 due to 1) lack of implicit feedback, since the offline evaluation data
Table 4: Results of CFM and other data augmentation tech- is collected via user engagement history based on the production
niques on the AAI dataset. system; 2) failing to capture product’s multiple objectives optimiza-
tion goal, where it’s very likely that recommending more engaging
apps hurts other business goals. Therefore, this experiment is criti-
cal in demonstrating the effectiveness of the proposed framework
• 𝐶𝐹 𝑀𝑛𝑜_𝑑𝑟𝑜𝑝𝑜𝑢𝑡 : Correlated Feature Masking with No Dropout
in real-world settings.
applied. In other words, only apply correlated masking as the
In our live A/B testing, we add the best performing SSL task with
augmentation in the SSL task.
the same set of hyper-parameters on top of the existing well-tuned
• 𝑁𝑜𝑀𝑎𝑠𝑘𝑖𝑛𝑔: Correlated Feature Masking but skipping the mask- two-tower DNN model used in production. In a time frame of 14
ing phrase in augmentation. In other words, we only apply days, the model improved the overall business metrics significantly,
dropout to features as the augmentation. with +0.67% increase in key user engagement (Figure 8a) and +1.5%
We apply these feature augmentation functions in the SSL frame- increase in top business metric (Figure 7a). To echo the study on
work and report the results on AAI dataset in Table 4. the Head-tail Analysis in Section 4.3 and the data sparsity analysis
First, we observe that all the variants are worse than CFM, but in Section 4.4, we see significant improvements on two slices: 1)
still out-perform the baseline model. In particular, we see having cold-starting for fresh apps: the model improves +4.5% on user en-
mutual information in selecting the masking set is critical to model gagement for fresh apps (Figure 8b); and (2) international countries
improvement, since we see the biggest performance drop is from that have sparser training data compared to major markets: we see
RFM where masking set is selected at random. By comparing CFM significant +5.47% top business metric gains (Figure 7b right). Again,
with results from the two methods (𝑅𝐹 𝑀𝑛𝑜_𝑐𝑜𝑚𝑝𝑙 and 𝑁𝑜𝑀𝑎𝑠𝑘𝑖𝑛𝑔) both of these results verify our hypothesis that our SSL framework
that allow feature overlap between the two augmented examples indeed significantly improves model performance for slices without
via independent dropout, we see the contrastive learning task is much supervision. Given the results, the SSL empowered model
more helpful with complementary information that potentially was successfully launched in the current production system.
avoids shortcut in learning. Finally, by comparing 𝐶𝐹 𝑀𝑛𝑜_𝑑𝑟𝑜𝑝𝑜𝑢𝑡
and CFM, we see the second phrase of randomly dropping out 6 CONCLUSION
feature values also helps, which could be potentially explained by In this paper, we proposed a model architecture agnostic self-supervised
introducing more feature variants in the SSL task. learning (SSL) framework for large-scale neural recommender mod-
els. Within the SSL framework, we also introduced a novel data
5 LIVE EXPERIMENT augmentation applicable to heterogeneous categorical features and
In this section, we describe how we apply our proposed SSL frame- showed its superior performance over other variants.
work to a web-scale commercial app recommender system. Specifi- For future works, we plan to investigate how different training
cally, given an app as the query, the system identifies similar apps schemes impact the model quality. One direction is to first pre-train
8
Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of
the Eighth International Workshop on Data Mining for Online Advertising.
[20] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. [n.d.]. Revisiting Self-
Supervised Visual Representation Learning. In CVPR 2019.
[21] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech-
niques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30–37.
[22] Yehuda Koren and Robert M. Bell. 2015. Advances in Collaborative Filtering.
Springer, 77–118.
[23] Walid Krichene, Nicolas Mayoraz, Steffen Rendle, Li Zhang, Xinyang Yi, Lichan
Hong, Ed Chi, and John Anderson. [n.d.]. Efficient Training on Very Large
Corpora via Gramian Estimation. In ICLR 2019.
[24] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
Sharma, and Radu Soricut. [n.d.]. ALBERT: A Lite BERT for Self-supervised
Learning of Language Representations. In ICLR 2020.
(a) All Apps (b) Fresh Apps [25] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. [n.d.]. Learning
Representations for Automatic Colorization. In ECCV 2016.
Figure 8: Top User Engagement Improvement Percentage (y- [26] David C. Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C. Ma,
Zhigang Zhong, Jenny Liu, and Yushi Jing. [n.d.]. Related Pins at Pinterest: The
axis) over Days (x-axis) in Online Experiments: (a) improve- Evolution of a Real-World Recommender System. In WWW 2017.
ment on all apps; (b) improvement on fresh apps. [27] Jianxin Ma, Chang Zhou, Hongxia Yang, Peng Cui, Xin Wang, and Wenwu Zhu.
[n.d.]. Disentangled Self-Supervision in Sequential Recommenders. In KDD 2020.
on SSL task to learn query and item representations and fine-tune [28] Klaas Bosteels Mark Levy. [n.d.]. Music Recommendation and the Long Tail. In
on primary supervised tasks. Alternatively, it would be interesting 1st Workshop On Music Recommendation And Discovery (WOMRAD), ACM RecSys,
2010.
to extend the technique for deep models in application domains [29] Rishabh Mehrotra, Mounia Lalmas, Doug Kenney, Thomas Lim-Meng, and Golli
such as search ranking or pCTR prediction. Hashemian. [n.d.]. Jointly Leveraging Intent and Interaction Signals to Predict
User Satisfaction with Slate Recommendations. In WWW 2019.
[30] Staša Milojević. 2010. Power Law Distributions in Information Science: Making
REFERENCES the Case for Logarithmic Binning. J. Am. Soc. Inf. Sci. Technol. 61, 12 (Dec. 2010),
[1] Alex Beutel, Ed H. Chi, Zhiyuan Cheng, Hubert Pham, and John Anderson. [n.d.]. 2417–2425.
Beyond Globally Optimal: Focused Learning for Improved Recommendations. In [31] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,
WWW 2017. Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean
[2] L. Beyer, X. Zhai, A. Oliver, and A. Kolesnikov. [n.d.]. S4L: Self-Supervised Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherni-
Semi-Supervised Learning. In ICCV 2019. avskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kon-
[3] Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. dratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang
[n.d.]. Pre-training Tasks for Embedding-based Large-scale Retrieval. In ICLR Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model
2020. for Personalization and Recommendation Systems. CoRR abs/1906.00091 (2019).
[4] Tianqi Chen and Carlos Guestrin. [n.d.]. XGBoost: A Scalable Tree Boosting [32] Wei Niu, James Caverlee, and Haokai Lu. [n.d.]. Neural Personalized Ranking
System. In KDD 2016. for Image Recommendation. In WSDM 2018.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. [33] Mehdi Noroozi and Paolo Favaro. [n.d.]. Unsupervised Learning of Visual Repre-
A Simple Framework for Contrastive Learning of Visual Representations. https: sentations by Solving Jigsaw Puzzles. In ECCV 2016.
//arxiv.org/abs/2002.05709 [34] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. [n.d.].
[6] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Embedding-Based News Recommendation for Millions of Users. In KDD 2017.
Hinton. 2020. Big Self-Supervised Models are Strong Semi-Supervised Learners. [35] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. [n.d.]. DropoutNet: Ad-
arXiv preprint arXiv:2006.10029 (2020). dressing Cold Start in Recommender Systems. In Neurips 2017.
[7] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, [36] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. 2018. Unsupervised
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Feature Learning via Non-Parametric Instance-level Discrimination. CoRR
Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. abs/1805.01978 (2018). http://arxiv.org/abs/1805.01978
[n.d.]. Wide & Deep Learning for Recommender Systems (DLRS 2016). [37] Xin Xin, Alexandros Karatzoglou, I. Arapakis, and J. Jose. [n.d.]. Self-Supervised
[8] Evangelia Christakopoulou and George Karypis. [n.d.]. Local Latent Space Models Reinforcement Learning for Recommender Systems. SIGIR 2020 ([n. d.]).
for Top-N Recommendation. [38] Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar,
[9] Edith Cohen and David D. Lewis. [n.d.]. Approximating Matrix Multiplication Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning
for Pattern Recognition Tasks. In SODA 1997. Semantic Textual Similarity from Conversations. In Proceedings of The Third
[10] Paul Covington, Jay Adams, and Emre Sargin. [n.d.]. Deep Neural Networks for Workshop on Representation Learning for NLP. ACL, 164–174.
YouTube Recommendations. In RecSys 2016. [39] Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee
[11] Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. [n.d.]. Are Kumthekar, Zhe Zhao, Li Wei, and Ed Chi. [n.d.]. Sampling-Bias-Corrected
We Really Making Much Progress? A Worrying Analysis of Recent Neural Rec- Neural Modeling for Large Corpus Item Recommendations. In RecSys 2019.
ommendation Approaches. In RecSys 2019. [40] Andrew Zhai, Dmitry Kislyuk, Yushi Jing, Michael Feng, Eric Tzeng, Jeff Donahue,
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [n.d.]. BERT: Yue Li Du, and Trevor Darrell. [n.d.]. Visual Discovery at Pinterest. In WWW
Pre-training of Deep Bidirectional Transformers for Language Understanding. In 2017.
NAACL-HLT 2019. [41] Xu Zhang, Felix X. Yu, Sanjiv Kumar, and Shih-Fu Chang. [n.d.]. Learning
[13] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods Spread-Out Local Feature Descriptors. In ICCV 2017.
for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12, null [42] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews,
(July 2011), 2121–2159. Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. [n.d.].
[14] Wikimedia Foundation. [n.d.]. Wikimedia. https://dumps.wikimedia.org/ Recommending What Video to Watch next: A Multitask Ranking System. In
[15] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. [n.d.]. Unsupervised RecSys 2019.
Representation Learning by Predicting Image Rotations. In ICLR 2018. [43] Kun Zhou, Haibo Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang,
[16] Daniel Gillick, Alessandro Presta, and Gaurav Singh Tomar. 2018. End-to-End Zhong yuan Wang, and Jirong Wen. [n.d.]. S3-Rec: Self-Supervised Learning for
Retrieval in Continuous Space. CoRR abs/1811.08008 (2018). http://arxiv.org/ Sequential Recommendation with Mutual Information Maximization. CIKM 2020
abs/1811.08008 ([n. d.]).
[17] Chuan Guo, Ali Mousavi, Xiang Wu, Daniel N Holtmann-Rice, Satyen Kale,
Sashank Reddi, and Sanjiv Kumar. 2019. Breaking the Glass Ceiling for
Embedding-Based Classifiers for Large Output Spaces. In Neurips, H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.).
[18] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng
Chua. [n.d.]. Neural Collaborative Filtering. In WWW 2017.
[19] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine
Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. 2014.
9
baseline. This suggests a naive TopPopular recommender achieves
𝑅𝑒𝑐𝑎𝑙𝑙@50 ≈ 0.1 for AAI and 𝑅𝑒𝑐𝑎𝑙𝑙@50 ≈ 0.05 for Wikipedia. We
present that all the proposed methods outperform this baseline by
a large margin in Section 4.
.2 Head and Tail Item Evaluation Dataset. We partition the full
test dataset based on popularity of the groundtruth item. For the
AAI test dataset, the Head dataset consists examples where the
groundtruth items are in the top 10% most frequent items, and the
Figure 9: CDF of most frequent items in the Wikipedia and
rest of the test examples are treated as tail. For Wikipedia, we follow
AAI datasets. The distribution is extremely dominated by
the data partitions in [23], where test examples containing items
popular items. For example, the top 50 items out of the 2.4M
not included in the training set are treated as Tail, and the rest test
items already constitute 10% of data in the AAI dataset.
examples are treated as Head.

Dataset # queries # items # examples .3 Visualization of Learned Embeddings. Besides better model per-
Wikipedia 5.3M 5.3M 490M formance, we expect the representations learned with SSL to have
AAI 2.4M 2.4M 1B better quality than the counterparts without SSL. To verify our
Table 5: Corpus sizes of the Wikipedia and the AAI datasets. hypothesis, we take the app embeddings learned in the models
trained on AAI dataset, and plot them using t-SNE plot in Figure 4.
Apps from different categories are plotted in different colors, as
illustrated in the legends in Figure 4. Compared to the apps in (Fig-
A APPENDIX ure 4a), the apps in the best SSL model (Figure 4b) tend to group
.1 Dataset Statistics. Table 5 shows some basic stats for the Wikipedia much better with similar apps in the same category, and the sepa-
and AAI datasets. Figure 9 shows the CDF of most frequent items ration of different category looks much more clear. For example,
for the two datasets, indicating a highly skewed data distribution. we could see that the “Sports & Recreation” apps (in red) are mixed
For example, the top 50 items in the AAI dataset collectively ap- together with “Law & Government” and “Travel” apps in Figure 4a.
peared roughly 10% in the training data. If we consider a naive While in Figure 4b, we clearly see the 4 categories of apps grouped
baseline (i.e., TopPopular recommender [11]) that recommends the together among themselves. This indicates that the representations
most frequent top-K items for every query, the 𝐶𝐷𝐹 of the 𝐾-th learned with SSL carry more semantic information, and is also why
frequent item essentially represents the 𝑅𝑒𝑐𝑎𝑙𝑙@𝐾 metric of such SSL leads to better model performance in our experiments.