0% found this document useful (0 votes)

10 views

A_Survey_on_Self-supervised_Learning_Algorithms_Applications_and_Future_Trends

Uploaded by

mmhameedkhan6

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

A_Survey_on_Self-supervised_Learning_Algorithms_Applications_and_Future_Trends

Uploaded by

mmhameedkhan6

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

A Survey on Self-supervised Learning:

Algorithms, Applications, and Future Trends
Jie Gui, Senior Member, IEEE, Tuo Chen, Jing Zhang, Senior Member, IEEE, Qiong Cao,
Zhenan Sun, Senior Member, IEEE, Hao Luo, Dacheng Tao, Fellow, IEEE

Abstract—Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance.
However, the process of collecting and labeling such data can be expensive and time-consuming. Self-supervised learning (SSL),
a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying on human-annotated
labels. SSL has garnered significant attention recently, leading to the development of numerous related algorithms. However, there is a
dearth of comprehensive studies that elucidate the connections and evolution of different SSL variants. This paper presents a review of
diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions. Firstly,
we provide a detailed introduction to the motivations behind most SSL algorithms and compare their commonalities and differences.
Secondly, we explore representative applications of SSL in domains such as image processing, computer vision, and natural language
processing. Lastly, we discuss the three primary trends observed in SSL research and highlight the open questions that remain. A
curated collection of valuable resources can be accessed at https://github.com/guijiejie/SSL.

Index Terms—Self-supervised learning, Contrastive learning, Generative model, Representation learning, Transfer learning

1 I NTRODUCTION

D EEP supervised learning algorithms have demon-

strated impressive performance in various domains,
including computer vision (CV) and natural language pro-
of labeled examples is frequently costly, arduous, or time-
consuming due to the requirement of skilled human annota-
tors with sufficient domain expertise [12], [13]. To illustrate,
cessing (NLP). To address this, models pre-trained on large- consider the analysis of web user profiles, where a substan-
scale datasets like ImageNet [1] are commonly employed tial amount of data can be readily collected. However, the
as a starting point and subsequently fine-tuned for specific labeling of non-profitable or profitable users necessitates
downstream tasks (Table 1). This practice is motivated by thorough scrutiny, judgment, and sometimes even time-
two primary reasons. Firstly, the parameters acquired from intensive tracing tasks performed by experienced human
large-scale datasets offer a favorable initialization, enabling assessors, resulting in significant expenses. Another instance
faster convergence of models trained on other tasks [2]. pertains to the medical field, where unlabeled examples
Secondly, a network trained on a large-scale dataset has can be easily obtained through routine medical exami-
already learned discriminative features, which can be easily nations. Nevertheless, assigning diagnoses individually to
transferred to downstream tasks and mitigate the overfitting such a large number of cases places a substantial burden
issue arising from limited training data in such tasks [3], [4]. on medical experts. For example, in the case of breast
Unfortunately, numerous real-world data mining and cancer diagnosis, radiologists must label each focus in a
machine learning applications face a common challenge vast collection of easily attainable, high-resolution mammo-
where an abundance of unlabeled training instances coexists grams. This process often proves to be highly inefficient and
with a limited number of labeled ones. The acquisition time-consuming. Additionally, supervised learning methods
are susceptible to spurious correlations and generalization
errors, and vulnerable to adversarial attacks.
• J. Gui is with the School of Cyber Science and Engineering, Southeast
University and with Purple Mountain Laboratories, Nanjing 210000,
To address the aforementioned limitations of supervised
China (e-mail: [email protected]). learning, various machine learning paradigms have been in-
troduced, including active learning, semi-supervised learn-
• T. Chen is with the School of Cyber Science and Engineering, Southeast ing, and self-supervised learning (SSL). This paper specifi-
University (e-mail: [email protected]).
cally emphasizes SSL. SSL algorithms aim to learn discrim-
• J. Zhang and D. Tao are with with the School of Computer Science in inative features from vast quantities of unlabeled instances
the University of Sydney, Australia. E-mail: [email protected], without relying on human annotations. The general pipeline
[email protected].
of SSL is depicted in Fig. 1. In the self-supervised pre-
• Q. Cao is with JD Explore Academy (e-mail: [email protected]). training phase, a pre-defined pretext task is formulated for
the deep learning algorithm to solve. Pseudo-labels for the
• Z. Sun is with the Center for Research on Intelligent Perception and pretext task are automatically generated based on specific
Computing, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected]). attributes of the input data. Once the self-supervised pre-
training process is completed, the acquired model can be
• H. Luo is with Alibaba Group, Hangzhou 310052, China (e-mail: haolu- transferred to downstream tasks.
[email protected]).
One notable advantage of SSL algorithms is their abil-

Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

TABLE 1: Comparison between supervised and self-supervised pre-training and fine-tuning.

Pre-training Data Pre-training Tasks Downstream Tasks
detection / segmentation /
image categorization [5]
Supervised extensive labeled data pose estimation / depth estimation, etc.
video action categorization [6] action recognition / object tracking, etc.
detection / segmentation /
Image: rotation [7], jigsaw [8], etc.
pose estimation / depth estimation, etc.
SSL extensive unlabeled data Video: the order of frames [9], playing direction [10], etc. action recognition / object tracking, etc.
question answering / textual entailment recognition /
NLP: masked language modeling [11]
natural language inference, etc.

Unlabeled Labeled Google Scholar reports a substantial volume of SSL-related

Data Data
publications, with approximately 18,900 papers published
in 2021 alone. This accounts for an average of 52 papers per
day or more than two papers per hour (Fig. 2). To assist
researchers in navigating this vast number of SSL papers
and to consolidate the latest research findings, we aim to
Initialization Transfer provide a timely and comprehensive survey on this subject.
Differences from previous work: Previous works have
provided reviews on SSL that cater to specific applications
SSL Downstream such as recommender systems [26], graphs [27], sequential
Tasks Tasks transfer learning [28], videos [29], and adversarial pre-
training of self-supervised deep networks [30]. Besides, Liu
Fig. 1: The general pipeline of applying SSL methods to et al. [4] primarily focused on papers published before 2020,
downstream tasks. The SSL models are first pre-trained on lacking the latest advancements. Jaiswal et al. [31] centered
the unlabeled data and then fine-tuned, or directly evalu- their survey on contrastive learning (CL). Notably, recent
ated, on the labeled data of the downstream tasks. breakthroughs in SSL research within the CV domain are
of significant importance. Thus, this review predominantly
encompasses recent SSL research derived from the CV com-
munity, particularly those influential and classic findings.
The primary objectives of this review are to elucidate the
concept of SSL, its categories and subcategories, its dif-
ferentiation and relationship with other machine learning
paradigms, as well as its theoretical foundations. We present
an extensive and up-to-date review of the frontiers of visual
SSL, dividing it into four key areas: context-based, CL,
generative, and contrastive generative algorithms, aiming
to outline prominent research trends for scholars.

2 A LGORITHMS
This section begins by providing an introduction to SSL, fol-
lowed by an explanation of the pretext tasks associated with
SSL and their integration with other learning paradigms.
Fig. 2: Google Scholar search results for “self-supervised
learning”. The vertical and horizontal axes denote the num-
ber of SSL publications and the year, respectively. 2.1 What is SSL?
The introduction of SSL is attributed to [32] (Fig. 3), who
employed this architecture to learn in natural environments
ity to leverage extensive unlabeled data since the gener- featuring diverse modalities. Although the cow image may
ation of pseudo-labels does not necessitate human anno- not warrant a cow label, it is frequently associated with a
tations. By utilizing these pseudo-labels during training, “moo” sound. The crux lies in the co-occurrence relationship
self-supervised algorithms have demonstrated promising between them.
outcomes, resulting in a reduced performance disparity Subsequently, the machine learning community has ad-
compared to supervised algorithms in downstream tasks. vanced the concept of SSL, which falls within the realm
Asano et al. [14] demonstrated that SSL can produce gen- of unsupervised learning. SSL involves generating output
eralizable features that exhibit robust generalization even labels “intrinsically” from input data examples by revealing
when applied to a single image. the relationships between data components or various views
The advancement of SSL [3], [4], [15]–[24] has exhib- of the data. These output labels are derived directly from the
ited rapid progress, capturing significant attention within data examples. According to this definition, an autoencoder
the research community (Fig. 2), and is recognized as a (AE) can be perceived as a type of SSL algorithms, where
crucial element for achieving human-level intelligence [25]. the output labels correspond to the data itself. AEs have

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

supervised unsupervised self-supervised θ=?

derives label from

label: cow
co-ocurring input

Rotation Jigsaw Colorization

Fig. 4: Illustration of three common context-based methods:

rotation, jigsaw, and colorization.

Moo~
CL, generative algorithms, and contrastive generative meth-
ods. In our paper, generative algorithms
Fig. 4: Illustration primarily
of three commonrefer to
masked image modeling (MIM) methods.
context-based methods:
Fig. 3: The differences among supervised learning, unsuper- rotation, jigsaw, and colorization.
vised learning, and SSL. The image is reproduced from [32]. 2.2.1 Context-based methods
SSL utilizes freely derived labels as supervision instead of Context-based methods rely on the inherent contextual re-
manually annotated labels. lationships among the provided examples, encompassing
aspects such as spatial structures and the preservation of
both local and global consistency. We illustrate the concept
gained extensive usage across multiple domains, including of context-based pretext tasks using rotation as a simple
dimensionality reduction and anomaly detection. example [36]. Subsequently, we progressively introduce ad-
In the keynote talk at ICLR 2020 [33], Yann LeCun ditional tasks (Fig. 4).
elucidated the concept of SSL as an analogous process to Rotation: Gidaris et al. [7] trained deep neural networks
completing missing information (reconstruction). He pre- (DNNs) to learn image representations by recognizing the
sented multiple variations as follows: 1) Predict any part of random geometric transformations. They streamlined image
the input from any other part; 2) Predict the future from the augmentation by introducing rotations of 0◦ , 90◦ , 180◦ , and
past; 3) Predict the invisible from the visible; and 4) Predict 270◦ to generate three additional images from each original.
any occluded, masked, or corrupted part from all available This method employs rotation angles as self-supervised
parts. In summary, a portion of the input is unknown in SSL, labels, using a set of K = 4 geometric transformations
and the objective is to predict that particular segment. G = {g(·|y)}K y=1 . Here, g(·|y) applies a geometric transfor-
Jing et al. [34] expanded the definition of SSL to encom- mation labeled y to an image X , resulting in a transformed
pass methods that operate without human-annotated labels. image X y = g(X|y).
Consequently, any approach devoid of such labels can be Gidaris et al. utilized a deep convolutional neural net-
categorized under SSL, effectively equating SSL with unsu- work (CNN), F(·), to perform rotation prediction through
pervised learning. This categorization includes generative a four-class categorization task. This CNN processes an
∗
adversarial networks (GANs) [35], thereby positioning them input image X y , with y ∗ being unknown to F(·), and
within the realm of SSL. outputs a probability distribution over possible geometric
Pretext tasks, also referred to as surrogate or proxy transformations, expressed as
tasks, are a fundamental concept in the field of SSL. The ∗ n ∗ oK
term “pretext” denotes that the task being solved is not F X y |θ = F y X y |θ . (1)
y=1
the primary objective but serves as a means to generate a ∗
robust pre-trained model. Prominent examples of pretext Here, F y X y |θ represents the predicted probability for
tasks include rotation prediction and instance discrimina- the geometric transformation labeled as y , while θ denotes
tion, among others. Each pretext task necessitates the use of the learnable parameters of F(·).
distinct loss functions to achieve its intended goal. Given the Given training instances D = {Xi }N i=1 , the training
significance of pretext tasks in SSL, we proceed to introduce objective can be formulated as
them in further detail.
N
1 X
min L(Xi , θ). (2)
θ N i=1
2.2 Pretext tasks
Here, the loss function is defined as
This section provides a comprehensive overview of the
K
pretext tasks employed in SSL. A prevalent approach in 1 X
SSL involves devising pretext tasks for networks to solve, L(Xi , θ) = − log(F y (g (Xi |y) |θ)). (3)
K y=1
where the networks are trained by optimizing the objective
functions associated with these tasks. Pretext tasks typi- In [37], the relative rotation angle was confined to the
cally exhibit two key characteristics. Firstly, deep learning interval of [−30o , 30o ]. These rotations were discretized into
methods are employed to learn features that facilitate the bins of 3o each, leading to a total of 20 classes (or bins).
resolution of pretext tasks. Secondly, supervised signals Colorization: The concept of colorization was initially
are derived from the data itself, a process known as self- introduced in [38], and subsequent studies [39]–[41] demon-
supervision. Commonly employed techniques encompass strated its effectiveness as a pretext task for SSL. Color pre-
four categories of pretext tasks: context-based methods, diction offers the advantageous feature of requiring freely

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

available training data. In this context, a model can utilize consistency. The following section outlines the various CL
the lightness channel of any color image as input and utilize methods currently available (Fig. 5).
the corresponding ab color channels in the CIE Lab color 2.2.2.1 Negative example-based CL: Negative
space as self-supervised signals. The objective is to predict examples-based CL adheres to a pretext task known as
the ab color channels Y ∈ RH×W ×2 given an input lightness instance discrimination, which involves generating distinct
channel X ∈ RH×W ×1 . A commonly employed learning views of an instance. In negative examples-based CL,
objective is views originating from the same instance are treated as
2 positive examples for an anchor sample, while views
L = Ŷ − Y , (4) from different instances serve as negative examples. The
F
underlying principle is to promote proximity between
where Y and Ŷ denote the ground truth and predicted positive examples and maximize the separation between
values, respectively. negative examples within the latent space. The definition
Besides, [38] utilized the multinomial cross-entropy loss of positive and negative examples varies depending on
instead of (4) to enhance robustness. Upon completing the factors such as the modality being considered and specific
training process, the ab color channels would be predicted requirements, including spatial and temporal consistency
for any grayscale image. Consequently, the lightness chan- in video understanding or the co-occurrence of modalities
nel and the ab color channels can be concatenated to restore in multi-modal learning scenarios. In the context of
the original grayscale image to a colorful representation. conventional 2D image CL, image augmentation techniques
Jigsaw: The jigsaw approach leverages jigsaw puzzles are utilized to generate diverse views from a single image.
as surrogate tasks, operating under the assumption that a MoCo: He et al. [50] framed CL as a dictionary look-
model accomplishes these tasks by comprehending the con- up task. In this framework, a query q and a set of encoded
textual information embedded within the examples. Specif- examples {k0 , k1 , k2 , · · ·} serve as the keys in a dictionary.
ically, images are fragmented into discrete patches, and Assuming a single key, denoted as k+ in the dictionary,
their positions are randomly rearranged, with the objective matches the query q , a contrastive loss [57] function is
of reconstructing the original order. In [42], the impact of employed. The value of this function is low when q is similar
scaling two self-supervised methods, namely jigsaw [8], [43] to its positive key k+ and dissimilar to all other negative
and colorization, was investigated along three dimensions: keys. In the MoCo v1 [50] framework, the InfoNCE loss
data size, model capacity, and problem complexity. The function [58], a form of contrastive loss, is utilized, i.e.,
results indicated that transfer performance exhibits a log-
exp(q · k+ /τ )
linear growth pattern in relation to data size. Furthermore, Lq = − log PK , (5)
representation quality was found to improve with higher- i=0 exp(q · ki /τ )
capacity models and increased problem complexity. where τ represents the temperature hyper-parameter and (·)
Others: The pretext task employed in [44], [45] involved denotes vector product. The summation is computed over
a conditional motion propagation problem. To enforce a one positive example and K negative examples. InfoNCE is
specific constraint on the feature representation process, derived from noise contrastive estimation (NCE) [59].
Noroozi et al. [46] introduced an additional requirement MoCo v2 [51] builds upon MoCo v1 [50] and SimCLR v1
where the sum of feature representations of all image [52], incorporating a multilayer perceptron (MLP) projection
patches should approximate the feature representation of head and more data augmentations.
the entire image. While many pretext tasks yield represen- SimCLR: SimCLR v1 [52] employs a mini-batch sam-
tations that exhibit covariance with image transformations, pling strategy with N instances, wherein a contrastive pre-
[47] argued for the importance of semantic representations diction task is formulated on pairs of augmented instances
being invariant to such transformations. In response, they from the mini-batch, generating a total of 2N instances.
proposed a pretext-invariant representation learning ap- Notably, SimCLR v1 does not explicitly select negative
proach that enables the learning of invariant representations instances. Instead, for a given positive pair, the remaining
through pretext tasks. 2(N − 1) augmented instances in the mini-batch are treated
as negatives. Let sim(u, v) = uT v (∥u∥ ∥v∥) represent the
2.2.2 Contrastive Learning cosine similarity between two instances u and v . The loss
Numerous SSL methods based on CL have emerged, build- function of SimCLR v1 for a positive instance pair (i, j) is
ing upon the foundation of simple instance discrimination defined as
tasks [48], [49]. Notable examples include MoCo v1 [50], exp(sim(zi , zj )/τ )
MoCo v2 [51], SimCLR v1 [52] and SimCLR v2 [53]. Pioneer- Li,j = − log P2N , (6)
ing algorithms, such as MoCo, have significantly enhanced k=1 1[k̸=i] exp(sim(zi , zk )/τ )

the performance of self-supervised pre-training, reaching where 1[k̸=i] ∈ {0, 1} is an indicator function equal to 1 if
a level comparable to that of supervised learning, thus k ̸= i, and τ denotes the temperature hyper-parameter. The
rendering SSL highly pertinent for large-scale applications. overall loss is computed across all positive pairs, including
Early CL approaches were built upon the concept of utiliz- both (i, j) and (j, i), within the mini-batch.
ing negative examples. However, as CL has progressed, a In MoCo, the features generated by the momentum
range of methods have emerged that eliminate the need for encoder are stored in a feature queue as negative examples.
negative examples. These methods embrace distinct ideas These negative examples do not undergo gradient updates
such as self-distillation and feature decorrelation, yet all during backpropagation. Conversely, SimCLR utilizes neg-
adhere to the principle of maintaining positive example ative examples from the current mini-batch, and all of them

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

similarity & similarity &

similarity
dissimilarity decorrelation

encoder ..................... encoder encoder ..................... encoder encoder ..................... encoder

image image image

negative samples self-distillation feature decorrelation

Fig. 5: Illustration of different CL methods: CL based on negative examples (left), CL based on self-distillation (middle),
Fig. 5: Illustration of different CL
and CL based on feature decorrelation (right). For a demonstration of the concepts of similarity and dissimilarity, one can
methods: CL based on
refer to [52], [54], while for insights intonegative
decorrelation,
examples [55], [56]
(left), CLprovide a comprehensive overview.
based on
self-distillation (middle),
and CL based on feature
are subjected to gradient updates during backpropagation.
decorrelation (right). address the overfitting issue arising from strong data aug-
Both MoCo and SimCLR rely on data augmentation tech- mentation, [66] proposes an alternative approach. Instead
niques, including cropping, resizing, and color distortion. of employing a one-hot distribution, they suggest using
Notably, SimCLR made a significant contribution by high- the distribution generated by weak data augmentation as
lighting the crucial role of robust data augmentation in CL, a mimic. This mitigates the negative impact of strong data
a finding subsequently confirmed by MoCo v2. Additional augmentation by aligning the distribution of augmented
augmentation methods have also been explored [60]. For examples with that of weakly augmented examples.
instance, in [61], foreground saliency levels were estimated 2.2.2.2 Self-distillation-based CL: Bootstrap Your
in images, and augmentations were created by selectively Own Latent (BYOL) [67] is a prominent self-distillation
copying and pasting image foregrounds onto diverse back- algorithm designed specifically for self-supervised image
grounds, such as grayscale images with random grayscale representation learning, eliminating the need for negative
levels, texture images, and ImageNet images. Furthermore, pairs. This approach employs two identical DNNs, known
views can be derived from various sources, including dif- as Siamese networks, with the same architecture but differ-
ferent modalities such as photos and sounds [62], as well as ent weights. One serves as the online network, while the
coherence among different image channels [63]. other is the target network. Similar to MoCo [50], BYOL
Minimizing the contrastive loss is known to effec- enhances the target network through a gradual averaging
tively maximize a lower bound of the mutual information of the online network. Siamese networks have emerged
I(x1 ; x2 ) between the variables x1 and x2 [58]. Building as prevalent architectures in contemporary self-supervised
upon this understanding, [64] proposes principles for de- visual representation learning models, including SimCLR,
signing diverse views based on information theory. These BYOL, and SwAV [68]. These models aim to maximize
principles suggest that the views should aim to maximize the similarity between two augmented versions of a single
I(v1 ; y) and I(v2 ; y) (v1 , v2 , and y denoting the first view, image while incorporating specific conditions to mitigate
the second view, and the label, respectively), representing the risk of collapsing solutions.
the amount of information contained about the task label, Simple Siamese (SimSiam) networks, introduced by [69],
while simultaneously minimizing I(v1 ; v2 ), indicating the offers a straightforward approach to learning effective rep-
shared information between inputs encompassing both task- resentations in SSL without the need for negative example
relevant and irrelevant details. Consequently, the optimal pairs, large batches, or momentum encoders. Given a data
data augmentation method is contingent on the specific point x and two randomly augmented views x1 and x2 ,
downstream task. In the context of dense prediction tasks, an encoder f and an MLP prediction head h process these
[65] introduces a novel approach for generating different views. The resulting outputs are denoted as p1 = h (f (x1 ))
views. This study reveals that commonly employed data and z2 = f (x2 ). The objective of [69] is to minimize their
augmentation methods, as utilized in SimCLR, are more negative cosine similarity:
suitable for categorization tasks rather than dense prediction p1 z2
tasks such as object detection and semantic segmentation. D (p1 , z2 ) = − . (7)
∥p1 ∥2 ∥z2 ∥2
Consequently, the design of data augmentation methods
tailored to specific downstream tasks has emerged as a Here, ∥∥2 represents the l2 -norm. Similar to [67], a symmet-
significant area of exploration. ric loss [69] is defined as
Given the observed benefits of strong data augmenta- 1
L = (D (p1 , z2 ) + D (p2 , z1 )) . (8)
tion in enhancing CL performance [52], there has been a 2
growing interest in leveraging more robust augmentation This loss is defined based on the example x, and the overall
techniques. However, it is worth noting that solely relying loss is the average of all examples. Notably, [69] employs
on strong data augmentation can actually lead to a decline a stop-gradient (stopgrad) operation by modifying Eq. (7)
in performance [64]. The distortions introduced by strong as D (p1 , stopgrad (z2 )). This implies that z2 is treated as a
data augmentation can alter the image structure, resulting constant. Similarly, Eq. (8) is revised as
in a distribution that differs from that of weakly augmented 1
images. This discrepancy poses optimization challenges. To L= (D (p1 , stopgrad (z2 )) + D (p2 , stopgrad (z1 ))) . (9)
2

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

grad
grad similarity & grad
similarity training joint embedding architectures that simultaneously
dissimilarity
considers variance, invariance, and covariance. Similar to
predictor Barlow Twins, VICReg generates two distorted views Y A
share
weights
moving
average
and Y B via a distribution of the data augmentation T and
encoder ..................... encoder encoder
momentum
encoder gets their embeddings Z A ∈ Rn×d and Z B ∈ Rn×d . Let
the subscript j index the embedding in the batch and d, n
image image
represents the dimensionality of the vectors in Z A and the
SimCLR
batch size, respectively. The main contribution of VICReg
BYOL
is the variance preservation term, which explicitly prevents
grad
similarity grad
similarity a collapse due to a shrinkage of the embedding vectors
toward zero. The variance regularization term v in VICReg
Sinkhorn-
predictor
Knopp is defined as a hinge loss function applied to the standard
deviation of the embeddings along the batch dimension:
encoder encoder encoder encoder
d
1X
v ZA = max(0, γ − S zjA , ε ). (12)
image image
d j=1
SwAV SimSiam Here, zjA represents the vector composed of each value
at dimension j in Z A and S represents the regularized
Fig. 6: Comparison among different Siamese architectures. standard deviation, defined as
The image is reproduced from [69]. q
S(y, ε) = Var(y) + ε. (13)
Figure 6 illustrates the distinctions among SimCLR, The constant γ determines the standard deviation and is
BYOL, SwAV, and SimSiam. The categorization of BYOL and set to 1 in the experiments, while ε is a small scalar used to
SimSiam as CL methods is a subject of debate due to their prevent numerical instabilities. This criterion encourages the
exclusion of negative examples. However, to be consistent variance within the current batch to be equal to or greater
with [70], this paper considers BYOL and SimSiam to belong than γ for every dimension, thereby preventing collapse
to CL methods. scenarios where all data are mapped to the same vector.
2.2.2.3 Feature decorrelation-based CL: The objec- The invariance criterion s in VICReg, which captures
tive of feature decorrelation is to learn decorrelated features. the similarity between Z A and Z B , is defined as the mean-
Barlow Twins: Barlow Twins [55] introduced a novel squared Euclidean distance between each pair of data with-
loss function that encourages the similarity of embedding out any normalization:
vectors from distorted versions of an example while min- n
imizing redundancy between their components. Similar to
1X A 2
s Z A, Z B = z − zbB . (14)
other SSL methods such as MoCo [50] and SimCLR [52], n b=1 b 2
Barlow Twins generates two distorted views Y A and Y B
In addition, the covariance criterion c(Z) in VICReg is
via a distribution of data augmentations T for each image
defined as
in a data batch sampled from a dataset, resulting in batches
of embeddings Z A and Z B . The loss function of Barlow 1X 2
c (Z) = [C(Z)]i,j , (15)
Twins is defined as d i̸=j
2
X XX
2
LBT = (1 − Cii ) + λ Cij . (10) where C(Z) represents the covariance matrix of Z . The
i i j̸=i overall loss of VICReg is a weighted sum of the variance,
Here, λ is a hyper-parameter, and C represents the cross- invariance, and covariance:
correlation matrix computed between the two batches of L = s Z A, Z B

ZA + v ZB
+ α v

embeddings Z A and Z B , defined as +β C Z A + C Z B ,
(16)

P A B where α and β are two hyper-parameters. Note that both

b zb,i zb,j
Cij = r r , (11) regularization terms — the variance regularization term and
P A 2 P B 2 the covariance regularization term — are applied indepen-
b zb,i b zb,j
dently to each branch of the architecture. This differs from
where b indexes batch samples and i, j index the vector the Barlow Twins, which uses a cross-correlation matrix
dimension of the networks’ outputs. C is a square matrix between the two branches of the Siamese architecture.
that measures the correction between the two batches of em- 2.2.2.4 Analysis of CL: Despite the impressive re-
beddings Z A and Z B . The first term in Eq. (10) encourages sults achieved by contrastive SSL, the underlying mech-
the diagonal elements of C to be close to 1, while the second anisms remain obscure and not fully understood. Several
term encourages the off-diagonal elements to be close to 0. studies have delved into this area [54], [71]–[81]. Theoretical
Variance-Invariance-Covariance Regularization: Bor- investigations by [72], [76], [79] have provided support for
rowing the covariance regularization from the Barlow Twins the value of feature representations generated through CL.
method, Variance-invariance-covariance regularization (VI- In the Appendix, we also provide explanations of the con-
CReg) [56] proposes a new self-supervised method for nections between contrastive learning and other concepts,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

joint The landscape changed significantly with the introduc-

recover
embedding tion of the original ViT [5], which marked a pivotal moment.
decoder Alexey Dosovitskiy et al. conducted pioneering research
on applying MIM to CV, drawing inspiration from BERT’s
masked image prediction paradigm. Their smaller ViT-B/16
model achieved 79.9% accuracy on ImageNet [1] through
encoder ..................... encoder encoder self-supervised pre-training, an impressive 2% improve-
ment over training from scratch. However, it still fell short
view 1 view 2 masked view
of the accuracy attained by supervised pre-training. iGPT
[103] further employs the GPT-style next token prediction,
Fig. 7: The broad differences between CL and MIM. Note
but it received limited attention due to its subpar accuracy
that the actual differences between their pipelines are not
and computational efficiency. Beyond ViTs, a separate early
limited to what is shown.
investigation adopted context encoders [104], employing a
Fig. 7: The broad differences between
concept akin to MAE, i.e., image inpainting.
CL and MIM. Note However, the differences between natural language and
such as Principal Component that Analysis,
the actual differences
Spectral between
Clustering, visual signals limit the effectiveness of naive paradigm
and Supervised Learning.their pipelines are not transfer. BEiT introduces a tailored MIM task for visual
2.2.2.5 Others: Besideslimitedthe
to aforementioned
what is shownworks, pre-training, i.e., an extra tokenization procedure which
several other approaches have employed CL. Among them, breaks down the input image into visual tokens, and then
[82], [83] investigated the utilization of vision transform- predicts randomly masked subset of the image tokens. To
ers (ViTs) as the backbone for contrastive SSL, employ- address the challenge of tokenization, the authors leveraged
ing multi-crop and cross-entropy losses [83]. Notably, [83] a discrete variational autoencoder (dVAE) [105] to create a
discovered that the resultant features exhibited exceptional predefined visual vocabulary. In contrast to BEiT, MAE does
performance as K -nearest neighbors (K -NN) classifiers and not utilize image tokens; instead, it approaches the problem
effectively encoded explicit information regarding the se- from the perspective of image signal sparsity. MAE identi-
mantic segmentation of images. These desirable properties fies a significant amount of redundancy in image signals,
have also motivated specific downstream tasks [84]. In a necessitating a higher masking rate, such as 75%.
different study, [85] adopted patches extracted from the Here, we define
same image as a positive pair, while patches from different
images served as negative pairs. A mixing operation is fur- MIM := L (D (E (T1 (I))) , T2 (I)) , (17)
ther explored in RegionCL [86] to diversify the contrastive where E denotes the encoder, D denotes the decoder, T
1
pairs. Yang et al. [87] integrated CL and MIM in the context represents the transformation applied to the input before it
of text recognition, utilizing a weighted objective function. is fed into the network, and T represents the transformation
2
Numerous CL-based methods are available in the literature used to derive the target label. It is noteworthy that this
[88]–[96]. It should be noted that CL is not restricted solely representation is provided for the sake of clarity and ease of
to SSL, as it can also be used in supervised learning [97]. understanding rather than serving as a strict definition.
The primary distinction between BEiT and MAE lies in
2.2.3 Generative algorithms their choice of T . While BEiT employs the token output from
the pre-trained tokenizer as its target, MAE directly uses
For the category of generative algorithms, this study pri- the original pixels as its target. BEiT adopts a two-stage
marily focuses on MIM methods. MIM methods [98] (Fig. approach, initially training a tokenizer to convert images
7)—namely, bidirectional encoder representation from im- into visual tokens, followed by BERT-style training. On the
age transformers (BEiT) [99], masked AE (MAE) [70], con- other hand, MAE is a one-stage end-to-end approach, incor-
text AE (CAE) [100], and a simple framework for MIM porating a decoder to decode the encoder-derived represen-
(SimMIM) [101]—have gained significant popularity and tation into the original pixels. The two representative MIM
pose a considerable challenge to the prevailing dominance approaches BEiT and MAE, showcase different architectural
of CL. MIM leverages co-occurrence relationships among designs, with subsequent MIM methods often following one
image patches as supervision signals. of these techniques. A central challenge in MIM lies in the
MIM is a variant of the denoising AE (DAE) [16]. No- selection of the target representation T2 , which leads to the
tably, the Bidirectional Encoder Representations from Trans- categorization of MIM methods, as presented in Table 2.
formers (BERT) [11] and Generative Pre-trained Transformer Following the introduction of BEiT and MAE, several
(GPT) [102] have emerged as a renowned variant of the DAE variants have been proposed. iBOT [98] is an “online tok-
and achieved remarkable success in NLP. Researchers aspire enizer” adaptation of BEiT, aiming to address the limitation
to extend this success to CV by employing BERT-like pre- of dVAE in capturing only low-level semantics within local
training strategies. However, it is crucial to acknowledge details. The CAE introduces an alignment constraint to
that BERT’s success in NLP can be attributed not only to encourage masked patch representations (predicted by a
its large-scale self-supervised pre-training but also to its “latent contextual regressor”) to lie in the encoded rep-
scalable network architecture. A notable distinction between resentation space. This decoupling of the representation
the NLP and CV communities is their use of different learning task and pretext task enhances the model’s capacity
primary models, with transformers being prevalent in NLP for representation learning. Furthermore, MAE has been
and CNNs being widely adopted in CV. extended to other modalities beyond images [112]–[114].

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE 2: Categorization of MIM methods based on the reconstruction target. The second and third rows denote MIM
methods and reconstructing targets, respectively.
Low-Level Targets High-Level Targets Self-Distillation Contrastive / Multi-modal Teacher
Algorithm ViT [5] MAE [70] SimMIM [101] Maskfeat [106] BEiT [99] CAE [100] PeCo [107] data2vec [108] SdAE [109] MimCo [110] BEiT v2 [111]
Target Raw Pixel HOG VQ-VAE VQ-GAN self MoCo v3 CLIP

SSL

Context Masked Image

Contrastive
Based Modeling
learning

Negative Contrastive/Multi-
Clustering Self-Distillation Feature Decorrelation Low-level targets High-level targets Self-Distillation
Samples modal Teacher

Barlow ViT MAE BEiT PeCo data2vec SdAE MaskFeat BEiT v2

Rotation Jigsaw Colorization MoCos SimCLR SwAV BYOL DINO VICReg
Twins

Fig. 8: Several representative pretext tasks of SSL.

Generative pre-training has also evolved in the video do- supervised learning have motivated researchers to explore
main. BEVT [115] decouples video representation learning the combination of these two kinds of approaches.
into spatial representation learning and temporal dynamics To elaborate further, let us compare the challenges faced
learning. It first undertakes masked image modeling on by contrastive self-supervised methods and generative self-
image data, followed by a joint approach of masked im- supervised methods. Generative self-supervised methods
age modeling and masked video modeling on video data. are characterized as data-filling approaches [124]. For a
This accelerates training and achieves results comparable to model of a certain size, when the dataset reaches a certain
those of strongly-supervised baselines. Similarly, VideoMAE magnitude, further scaling of the data does not lead to
[116] extends the MAE to videos and discovers that an significant performance gains in generative self-supervised
extremely high proportion of masking ratio (90% to 95%) is methods. In contrast, recent studies have revealed the po-
permissible in video mask modeling. Moreover, it remains tential of data scaling to enhance the performance of CL
effective even on very small datasets, consisting of only [125]. As data increases, CL shows substantial performance
3,000 to 4,000 videos. OmniMAE [117] demonstrates that improvements, demonstrating remarkable generalization
a unified model can be concurrently trained across multi- without additional fine-tuning on downstream tasks. How-
ple visual modalities, breaking the paradigm of previously ever, the scenario differs in low-data regimes. Contrastive
studying different modes in isolation. This significantly models may find shortcuts with trivial representations that
streamlines the training process, enabling more efficient overfit the limited data [50], thus leading to inconsistent
development of large-scale model architectures. SiamMAE improvements in generalization performance for down-
[118] indicates that, contrary to images that are (approxi- stream tasks using pre-trained models with contrastive self-
mately) isotropic, the temporal dimension is unique, neces- supervised methods [123]. On the other hand, generative
sitating an asymmetric approach to processing temporal and methods are more adept at handling low-data scenarios and
spatial information, as not all spatiotemporal orientations can even achieve notable performance improvements when
are equally probable. data is extremely scarce, such as with only 10 images [126].
MIM has demonstrated significant potential in pre-
Several endeavors have sought to integrate both types
training vision transformers [119]–[121]. However, in prior
of algorithms [123], [127]. In [127], GANs are employed
works, the random masking of image patches led to an
for online data augmentation in CL. The study devises a
underutilization of valuable semantic information essential
contrastive module that learns view-invariant features for
for effective visual representation learning. Liu et al. [122]
generation and introduces a view-invariant loss function to
introduced an attention-driven masking strategy to explore
facilitate learning between original and generated views. On
improvements over random masking for insufficient seman-
the other hand, [98] draws inspiration from both BEiT and
tic utilization.
DINO [83]. It modifies the tokenizer of BEiT to an online dis-
tilled teacher while integrating cross-view distillation from
2.2.4 Contrastive Generative Methods the DINO framework. As a result, iBOT [98] significantly
As stated in [123], contrastive models tend to be data- enhances linear probing accuracy compared to the MIM
hungry and vulnerable to overfitting issues, whereas gen- method. RePre [128] integrates local feature learning into
erative models encounter data-filling challenges and ex- self-supervised vision transformers through reconstructive
hibit inferior data scaling capabilities when compared to pre-training, an approach that enhances contrastive frame-
contrastive models. While contrastive models often fo- works. This is achieved by incorporating an additional
cus on global views [83], overlooking internal structures branch dedicated to reconstructing raw image pixels, which
within images, MIM primarily models local relationships. operates concurrently with the established contrastive objec-
The divergent characteristics and challenges encountered tive. CMAE [129] concurrently performs CL and MIM tasks.
in contrastive self-supervised learning and generative self- To align CL with MIM effectively, CMAE introduces two

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

novel components: pixel shifting for generating plausible The SS-GAN [144] is defined by combining the objective
positive views, and a feature decoder for enhancing the functions of GANs with the concept of rotation [7]:
features of contrastive pairs. This approach significantly
improves the quality of representation and transfer perfor- LG (G, D) = −V (G, D)
mance compared to its MIM-only counterparts. SiameseIM − αEx∼pG Er∼R [log QD (R = r|xr )], (19)
[130] does not simply merge the objectives of CL and MIM,
but rather utilizes the views generated by CL as the target LD (G, D) = V (G, D)
for MIM reconstruction in the latent space.
− β Ex∼pdata Er∼R [log QD (R = r|xr )], (20)
Despite attempts to combine both types of approaches,
naive combinations may not always yield performance where V (G, D) represents the objective function of GANs as
gains and can even perform worse than the generative given in Eq. (18), and r ∼ R refers to a rotation selected from
model baseline, thereby exacerbating the issue of repre- a set of possible rotations, similar to the concept presented
sentation over-fitting [123]. The performance degradation in [7]. Here, xr denotes an image x rotated by r degrees,
could be attributed to the disparate properties of CL and and Q (R|xr ) corresponds to the discriminator’s predictive
generative methods. For instance, CL methods typically distribution over the angles of rotation for a given example
exhibit longer attention distances, whereas generative meth- x. Notably, rotation [7] serves as a classical SSL method. The
ods tend to favor local attention [131]. In light of this SS-GAN incorporates rotation invariance into the GANs’
challenge, RECON [123] emerges as a solution by training generation process by integrating the rotation prediction
generative modeling to guide CL, thereby leveraging the task during training.
benefits of both paradigms.
2.3.2 Semi-supervised learning
2.2.5 Summary SSL and semi-supervised learning are contrasting
As described above, numerous pretext tasks for SSL have paradigms that can be effectively combined. One notable
been devised, with several significant milestone variants example of this combination is self-supervised semi-
depicted in Fig. 8. Several other pretext tasks are avail- supervised learning (S4 L) [145]. In S4 L, the objective
able [132], [133], encompassing diverse approaches such function is given by
as relative patch location [134], noise prediction [135], fea-
ture clustering [136]–[138], cross-channel prediction [139], L = min Ll (Dl , θ) + wLu (Du , θ) . (21)
θ
and combining different cues [140]. Kolesnikov et al. [141]
This means optimizing the corresponding loss objectives on
conducted a comprehensive investigation of previously
a labeled dataset Dl and an unlabeled dataset Du . Ll is the
proposed SSL pretext tasks, yielding significant insights.
categorization loss (e.g., cross-entropy) and Lu stands for
Besides, Krähenbühl et al. [142] proposed an alternative
the self-supervised loss (e.g., rotation task in Eq. (3)). θ is the
approach to pretext tasks and demonstrated the ease of
learnable parameters.
obtaining data from video games.
Incorporating SSL as an auxiliary task is a well-
It has been observed that context-based approaches ex-
established approach in semi-supervised learning. Another
hibit limited applicability due to their inferior performance.
classical method to leverage SSL within this context involves
In the realm of visual SSL, two dominant types of algorithms
implementing SSL on unlabeled data, followed by fine-
are CL and MIM. While visual CL may encounter overfit-
tuning the resultant model on labeled data, as demonstrated
ting issues, CL algorithms that incorporate multi-modality,
in the SimCLR.
exemplified by CLIP [2], have gained popularity.
To demonstrate the robustness of self-supervision
against adversarial perturbations, Hendrycks et al. [146]
2.3 Combinations with other learning paradigms proposed an overall loss function as a linear combination
of supervised and self-supervised losses:
It is essential to acknowledge that the advancements in
SSL did not occur in isolation; instead, they have been L(x, y, θ) = LCE (y, p (y|P GD(x)) , θ)
the result of continuous development over time. In this (22)
+λLSS (P GD(x), θ) ,
section, we provide a comprehensive list of relevant learning
paradigms that, when combined with SSL, contribute to a where x is the example, y is the one-hot vector of ground-
clearer understanding of their collective impact. truth and θ denotes the model parameters. The adversarial
example is generated from x by projected gradient descent
(PGD) and adversarial training is implemented by cross-
2.3.1 GANs
entropy loss LCE . LSS is the self-supervised loss.
GANs represent classical unsupervised learning methods
and were among the most successful approaches in this do- 2.3.3 Multi-instance learning (MIL)
main before the surge of SSL techniques. The integration of
Miech et al. [13] introduced an extension of the InfoNCE
GANs with SSL offers various avenues, with self-supervised
loss (5) for MIL and termed it MIL-NCE:
GANs (SS-GAN) serving as one such example. The GANs’
T
 
objective function [35], [143] is given as ef (x) g(y)
P
n
X  (x,y)∈Pi 
min max V (G, D) = Ex∼pdata (x) [log D (x)] max log  T
, (23)
′ )T g(y ′ ) 
f (x) g(y) f (x
P P
G D (18) f,g
i=1
 e + e
+Ez∼pz (z) [log (1 − D (G (z)))] . (x,y)∈Pi (x′ ,y ′ )∈Ni

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

where x and y represent a video clip and a narration, CLIP’s advancements have significantly propelled multi-
respectively. The functions f and g generate embeddings modal learning to the forefront of research attention.
of x and y , respectively. For a specific example indexed by 2.3.4.3 Point clouds and other modalities: Several
i, Pi denotes the set of positive video/narration pairs, while SSL methods have been proposed for joint learning of 3D
Ni corresponds to the set of negative video/narration pairs. point cloud features and 2D image features by leverag-
ing cross-modality and cross-view correspondences through
2.3.4 Multi-view/multi-modal(ality) learning triplet and cross-entropy losses [149]. Additionally, there are
Observation plays a vital role in infants’ acquisition of efforts to jointly learn view-invariant and mode-invariant
knowledge about the world. Notably, they can grasp the characteristics from diverse modalities, such as images,
concept of apples through observational and comparative point clouds, and meshes, using heterogeneous networks
processes, which distinguishes their learning approach from for 3D data [150]. SSL has also been employed for point
traditional supervised algorithms that rely on extensive cloud datasets, with approaches including CL and cluster-
labeled apple data. This phenomenon was demonstrated by ing based on graph CNNs [151]. Furthermore, AEs have
Orhan et al. [22], who gathered perceptual data from infants been used for point clouds in works like [113], [114], [152],
and employed an SSL algorithm to model how infants learn [153], while capsule networks have been applied to point
the concept of “apple”. Moreover, infants’ learning about the cloud data in [154].
world extends to multi-view and multi-modal(ality) learn-
ing [2], encompassing various sensory inputs such as video 2.3.5 Test time training
and audio. Hence, SSL and multi-view/multi-modal(ality) Sun et al. [155] introduced “test time training (TTT) with
learning converge naturally in infants’ learning mechanisms self-supervision” to enhance the performance of predictive
as they explore and comprehend the workings of the world. models when the training and test data come from distinct
2.3.4.1 Multiview CL: The objective function in distributions. TTT converts an individual unlabeled test
standard multiview CL, as proposed by Tian et al. [64], is example into an SSL problem, enabling model parameter
given by updates before making predictions. Recently, Gandelsman
et al. [156] combined TTT with MAE for improved perfor-
LN CE = E [Lq ] , (24) mance. They argued that by treating TTT as a one-sample
where Lq corresponds to Eq. (5). Multiview CL treats dif- learning problem, optimizing a model for each test input
ferent views of the same sample as positive examples for could be addressed using the MAE as
contrastive learning. Tian et al. [64] introduced both un- n
1X
supervised and semi-supervised multiview learning based h0 = arg min Lm (h ◦ f0 (xi ) , yi ) , (27)
h n i=1
on adversarial learning. Let X̂ denote g(X), i.e., X̂ =
g(X). Two encoders, f1 and f2 , were trained to maximize fx , gx = arg min Ls (g ◦ f (mask(x)), x). (28)
f,g
IN CE (X̂1 , X̂2:3 ) as stated in Eq. (24). A flow-based model
g was trained to minimize IN CE (X̂1 , X̂2:3 ) and {X1 , X2:3 } Here, f and g refer to the encoder and decoder of MAE, and
is obtained from image splitting over its channels. Formally, h denotes the main task head, respectively.
the objective function for unsupervised view learning can TTT achieves an improved bias-variance tradeoff under
be expressed as distribution shifts. A static model heavily depends on train-
ing data that may not accurately represent the new test
min max INf1CE
,f2
(g(X)1 , g(X)2:3 ). (25) distribution, leading to bias. On the other hand, training
g f1 ,f2
a new model from scratch for each test input, ignoring all
In the context of semi-supervised view learning, when training data, is undesirable. This approach results in an
several labeled examples are available, the objective func- unbiased representation for each test input but exhibits high
tion is formulated as variance due to its singularity.
min max INf1CE
,f2
(g(X)1 , g(X)2:3 )
g,c1 ,c2 f1 ,f2 (26) 2.3.6 Summary
+Lce (c1 (g(X)1 ) , y) + Lce (c2 (g(X)2:3 ) , y) ,
The evolution of SSL is characterized by its dynamic and
where y represents the labels, c1 and c2 are classifiers, and interconnected nature. Analyzing the amalgamation of var-
Lce denotes the cross-entropy. Further relevant works can ious methods allows for a clearer grasp of SSL’s develop-
be found in [63], [64], [147]. Table 3 summarizes different mental trajectory. An exemplar of this success is evident
SSL losses. in CLIP, which effectively combines CL with multi-modal
2.3.4.2 Images and text: In the study conducted by learning, leading to remarkable achievements. SSL has been
Gomez et al. [148], the authors employed a topic mod- extensively integrated with various machine learning tasks,
eling framework to project the text of an article into the showcasing its versatility and potential. It has been com-
topic probability space. This semantic-level representation bined with clustering [68], semi-supervised learning [145],
was then utilized as the self-supervised signal for train- multi-task learning [157], [158], transfer learning [159]–[161],
ing CNN models on images. On a similar note, CLIP graph NNs [147], [162], [163], reinforcement learning [164]–
[2] leverages a CL-style pre-training task to predict the [166], few-shot learning [167], [168], neural architecture
correspondence between captions and images. Benefiting search [169], robust learning [146], [170]–[172], and meta-
from the CL paradigm, CLIP is capable of training models learning [173], [174]. This diverse integration underscores
from scratch on an extensive dataset comprising 400 million the widespread applicability and impact of SSL in the ma-
image-text pairs collected from the internet. Consequently, chine learning domain.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

TABLE 3: Different losses of SSL.

Category Method Loss Equation

Context-Based Rotation [7] Rotation Prediction (3)
MoCo v1 [50] InfoNCE (5)
SimCLR v1 [52] InfoNCE (6)
Pretext SimSiam [69] Cosine Similarity (9)
CL
Barlow Twins [55] Invariance, and Covariance (10)
VICReg [56] Variance, Invariance, and Covariance (16)
SS-GAN [144] GAN loss + Rotation Prediction (19 & 20)
Combinations
S4 L [145] Supervised and Unsupervised Loss (21)
with Other
SSL improving robustness [146] Supervised and Self-supervised Adversarial Training Loss (22)
Learning Paradigms
unsupervised multi-view learning [64] Self-supervised Loss on Multiple Views (25)

3 A PPLICATIONS Xu et al. [211] utilized temporally shuffled clips as inputs

SSL initially emerged in the context of vowel class recogni- instead of individual frames, training 3D CNNs to sort
tion [175], and subsequently, it was extended to encompass these shuffled clips. 2) Video playback direction. Temporal
object extraction tasks [176]. SSL has found widespread ap- direction analysis in videos, as studied by Wei et al. [10],
plications in diverse domains, including CV, NLP, medical involves discerning the arrow of time to determine if a
image analysis, and remote sensing (RS). video sequence progresses in the forward or backward
direction. 3) Video playback speed. Video playback speed
has been a subject of investigation in several studies. Benaim
3.1 CV
et al. [212] focused on predicting the speeds of moving
Sharma et al. [177] introduced a fully convolutional vol- objects in videos, determining whether they moved faster
umetric AE for unsupervised deep embeddings learning or slower than the normal speed. Yao et al. [213] leveraged
of object shapes. In addition, SSL has been extensively playback rates and their corresponding video content as
applied to various aspects of image processing and CV: self-supervision signals for video representation learning.
image inpainting [104], human parsing [178], [179], scene Additionally, Wang et al. [214] addressed the challenge of
deocclusion [180], semantic image segmentation [181], [182], self-supervised video representation learning through the
monocular vision [183], person reidentification (re-ID) [184], lens of video pace prediction.
[185], visual odometry [186], scene flow estimation [187],
3.1.1.2 Motions of objects in videos: Diba et al.
knowledge distillation [188], optical flow prediction [189],
[215] focused on SSL of motions in videos by employing
vision-language navigation [190], physiological signal es-
dynamic motion filters to enhance motion representations,
timation [191], [192], image denoising [193], [194], object
particularly for improving human action recognition. The
detection [195]–[197], super-resolution [198], [199], voxel
concept of SSL with videos (CoCLR) [216] bears similarities
prediction from 2D images [200], and ego-motion [201],
to SimCLR [52].
[202]. These applications highlight the broad impact and
relevance of SSL in the realm of image processing and CV. 3.1.1.3 Multi-modal(ality) data in videos: The au-
ditory and visual components in a video are intrinsically
3.1.1 SSL models for videos interconnected. Leveraging this correlation, Korbar et al.
SSL has garnered widespread usage across various appli- [217] employed a self-supervised temporal synchronization
cations, including video representation learning [203]–[205] approach to learn comprehensive and effective models for
and video retrieval [206]. both video and audio analysis. Similarly, other methodolo-
3.1.1.1 Temporal information in videos: Various gies [62], [218] are also founded on joint video and audio
forms of temporal information in videos can be employed, modalities while certain studies [219]–[221] incorporated
encompassing frame order, video playback direction, video both video and text modalities. Moreover, Alayrac et al.
playback speed, and future prediction information [207], [222] explored a tri-modal approach involving vision, audio,
[208]. 1) The order of the frames. Several studies have and language in videos. On a different note, Sermanet et al.
explored the significance of frame order in videos. Misra et [223] proposed a self-supervised technique for learning rep-
al. [9] introduced a method for learning visual representa- resentations and robotic behaviors from unlabeled videos
tions from raw spatiotemporal signals and determining the captured from various viewpoints.
correct temporal sequence of frames extracted from videos. 3.1.1.4 Spatial-temporal coherence of objects in
Fernando et al. [209] proposed a novel self-supervised CNN videos: Wang et al. [224] introduced a self-supervised al-
pre-training approach called “odd-one-out learning,” where gorithm for learning visual correspondence in unlabeled
the objective is to identify the unrelated or odd element videos by utilizing cycle consistency in time as a self-
within a set of related elements. This odd element corre- supervised signal. Extensions of this work have been ex-
sponds to a video subsequence with an incorrect temporal plored by Li et al. [225] and Jabri et al. [226]. Lai et al. [227]
frame order, while the related elements maintain the cor- presented a memory-augmented self-supervised method
rect temporal order. Lee et al. [210] employed temporally that enables generalizable and accurate pixel-level tracking.
shuffled frames, presented in a non-chronological order, as Zhang et al. [228] employed spatial-temporal consistency
inputs to train a CNN for predicting the correct order of of depth maps to mitigate forgetting during the learning
the shuffled sequences, effectively using temporal coher- process. Zhao et al. [229] proposed a novel self-supervised
ence as a self-supervised signal. Building upon this work, algorithm named the “video cloze procedure (VCP),” which

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

facilitates learning rich spatial-temporal representations for 3.3 Other fields

videos. Feichtenhofer et al. [112] extended the MAE to video Within the medical field [239], the availability of labeled
representation learning and demonstrated that leveraging data is typically limited, while a vast amount of unlabeled
the naive ViT along with the spatiotemporal co-occurrence data exists. This natural scenario makes SSL a compelling
of videos can outperform the vanilla supervised training. approach, which has been effectively employed for vari-
Gupta et al. [118] demonstrated the importance of asymmet- ous tasks like medical image segmentation [240] and 3D
rically modeling the spatiotemporal information of videos. medical image analysis [241]. Recently, SSL has also found
applications in the remote sensing domain, benefiting from
3.1.2 Universal sequential SSL models for image process- the abundance of large-scale unlabeled data that remains
ing and CV largely unexplored. For example, SeCo [242] leverages sea-
sonal changes in RS images to construct positive pairs and
Contrastive predictive coding (CPC) [58] operates on the perform CL. On the other hand, RVSA [243] introduces a
fundamental concept of acquiring informative representa- novel rotated varied-size window attention mechanism that
tions through latent space predictions of future data using advances the plain vision transformer to serve as a funda-
robust autoregressive models. While initially applied to mental model for various remote sensing tasks. Notably, it
sequential data like speech and text, CPC has also found is pre-trained using the generative SSL method MAE [70] on
applicability to images [230]. the large-scale MillionAID dataset.
Drawing inspiration from the accomplishments of GPT
[102], [231] in NLP, iGPT [103] investigates whether similar
models can effectively learn representations for images. 4 P ERFORMANCE COMPARISON
iGPT explores two training objectives, namely autoregres- Once a pre-trained model is obtained through SSL, the
sive prediction and a denoising objective, thereby sharing assessment of its performance becomes necessary. The con-
similarities with BERT [11]. In high-resolution scenarios, ventional approach involves gauging the achieved perfor-
this approach [103] competes favorably with other self- mance on downstream tasks to ascertain the quality of the
supervised methods on ImageNet [1]. Similar to iGPT, ViT extracted features. However, this evaluation metric does
[5] also adopts a transformer architecture for vision tasks. By not provide insights into what the network has specifically
applying a pure transformer to sequences of image patches, learned during self-supervised pre-training. To delve into
ViT has demonstrated outstanding performance in image the interpretability of self-supervised features, alternative
recognition tasks. The transformer architecture has been evaluation metrics, such as network dissection [245] and
further extended to various vision-related applications, as other unsupervised methods [246], can be employed. In
evidenced by [70], [82], [83], [99], [232]. this section, we aim to present a clear demonstration of
the performance comparison. We summarize the pre-trained
dataset performance and transfer learning efficacy of typical
3.2 NLP SSL methods on well-established datasets. Note that SSL can
technically be applied to diverse modalities. However, for
In the realm of NLP, pioneering works for performing SSL the sake of simplicity, we narrow our focus to SSL in the
on word embeddings include the continuous bag-of-words vision domain.
model and the continuous skip-gram model [233]. They can
be considered as belonging to generative self-supervised
learning algorithms, which have long dominated the field 4.1 Comprehensive comparison
of NLP. Despite their diverse forms, these algorithms are We present the results in Table 4 and 5. In cases where a
fundamentally based on language models that employ method reproduced from another subsequent work achieves
maximum likelihood estimation. Discriminative algorithms superior accuracy compared to the original paper, we report
(e.g., contrastive learning) were initially deemed ineffective the results with the higher one. Please note that although
due to the distinct semantics inherent in language. Some we have endeavored to align the experimental settings,
discriminative algorithms aim to challenge the conventions, minor variations in hyper-parameters can still affect the
among which ELECTRA [234] stands out as a pioneer. performance. Refer to the original paper if necessary. The
ELECTRA employs the Replaced Token Detection (RTD) experimental results are obtained according to the default
task and draws upon the structure and ideas of GANs backbone specified in the original papers, such as ResNet-
(notably, without adopting GAN’s training paradigm) to 50 or ViT-B/16. Additionally, results from alternative back-
pre-train a language model. [235] demonstrated that super- bones were provided in instances where data using the de-
vised contrastive pretraining enables zero-shot prediction of fault backbone was not available, and marked accordingly.
unseen text classes and enhances few-shot performance. A Setup. Upon the 2D image, the model was pre-trained on
series of subsequent works have demonstrated that task- ImageNet-1k [1] and evaluated on semantic segmentation
agnostic self-supervised contrastive pre-training has been tasks on PASCAL VOC [247], COCO [248], and ADE20k
shown to improve language modeling [236]–[238]. How- [249], [250], as well as on object detection tasks on VOC and
ever, the automatic creation of textual input augmentation COCO, and on classification tasks on ImageNet-1k. Upon
remains a significant challenge, as a single token can reverse the video, the model is pre-trained on the Kinetics [251]
the meaning of a sentence. Generative SSL algorithms con- or Something Something-v2 (SSv2) [252] datasets, and its
tinue to dominate NLP, from early works such as BERT and performance is evaluated on action detection tasks on the
GPT to recent trillion-scale large language models. Kinetics, SSv2, and AVA [253] datasets.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

TABLE 4: Experimental results of the tested algorithms for linear classification and transfer learning tasks. DB denotes the
default batch size. The symbol “-” indicates the absence or unavailability of the data point in the respective paper. The
subscripts A, R, and V represent AlexNet, ResNet-50, and ViT-B, respectively. The superscript “e” indicates the utilization
of extra data, specifically VOC2012.
Methods Linear Probe Fine-Tuning VOC det VOC seg COCO det COCO seg ADE20K seg DB
Random: 17.1A [8] - 60.2eR [69] 19.8A [8] 36.7R [50] 33.7R [50] - -
R50 Sup 76.5 [68] 76.5 [68] 81.3e [69] 74.4 [67] 40.6 [50] 36.8 [50] - -
ViT-B Sup 82.3 [70] 82.3 [70] - - 47.9 [70] 42.9 [70] 47.4 [70] -
Context-Based:
Jigsaw [8] 45.7R [68] 54.7 61.4R [42] 37.6 - - - 256
Colorization [38] 39.6R [68] 40.7 [7] 46.9 35.6 - - - -
Rotation [7] 38.7 50.0 54.4 39.1 - - - 128
CL Based on Negative Examples:
Examplar [132] 31.5 [48] - - - - - - -
Instdisc [48] 54.0 - 65.4 - - - - 256
MoCo v1 [50] 60.6 - 74.9 - 40.8 36.9 - 256
SimCLR [52] 73.9V [82] - 81.8e [69] - 37.9 [69] 33.3 [69] - 4096
MoCo v2 [51] 72.2 [69] - 82.5e - 39.8 [56] 36.1 [56] - 256
MoCo v3 [82] 76.7 83.2 - - 47.9 [70] 42.7 [70] 47.3 [70] 4096
CL Based on Clustering:
SwAV [68] 75.3 - 82.6e [56] - 41.6 37.8 [56] - 4096
CL Based on Self-distillation:
BYOL [67] 74.3 - 81.4e [69] 76.3 40.4 [56] 37.0 [56] - 4096
SimSiam [69] 71.3 - 82.4e [69] - 39.2 34.4 - 512
DINO [83] 78.2 83.6 [98] - - 46.8 [100] 41.5 [100] 44.1 [99] 1024
CL Based on Feature Decorrelation:
Barlow Twins [55] 73.2 - 82.6e [56] - 39.2 34.3 - 2048
VICReg [56] 73.2 - 82.4e - 39.4 36.4 - 2048
Masked Image Modeling (ViT-B by default):
Context Encoder [104] 21.0A [7] - 44.5A [7] 30.0A - - - -
BEiT v1 [99] 56.7 [111] 83.4 [98] - - 49.8 [70] 44.4 [70] 47.1 [70] 2000
MAE [70] 67.8 83.6 - - 50.3 44.9 48.1 4096
SimMIM [101] 56.7 83.8 - - 52.3Swin−B [244] - 52.8Swin−B [244] 2048
PeCo [107] - 84.5 - - 43.9 39.8 46.7 2048
iBOT [98] 79.5 84.0 - - 51.2 44.2 50.0 1024
MimCo [110] - 83.9 - - 44.9 40.7 48.91 2048
CAE [100] 70.4 83.9 - - 50 44 50.2 2048
data2vec [108] - 84.2 - - - - - 2048
SdAE [109] 64.9 84.1 - - 48.9 43.0 48.6 768
BEiT v2 [111] 80.1 85.5 - - - - 53.1 2048

The evaluation of object detection on the PASCAL VOC tasks. MIM-based approaches consistently exhibit substan-
dataset employs mean average precision (mAP), specifically tial performance enhancements in downstream tasks, while
AP50 . By default, the object detection task on PASCAL VOC CL-based methods offer comparatively limited assistance.
employs VOC2007 for training. However, certain methods Thirdly, CL-based methods tend to employ resource-
employ the combined 07+12 dataset and are annotated with intensive techniques like momentum encoders, memory
a superscript “e”. As for the object detection and instance queues, and multi-crop, significantly increasing the de-
segmentation tasks on COCO, we adopt the bounding-box mands on computing, storage, and communication re-
AP (APbb ) and mask AP (APmk ) metrics, in accordance with sources. In contrast, MIM-based methods have a more ef-
[50]. The results on video understanding are evaluated using ficient resource utilization, possibly attributed to the ab-
fine-tuned Top-1 accuracy as the metric. sence of example interactions. This advantageous property
allows MIM-based algorithms to easily scale up models and
data, efficiently leveraging modern GPUs for high parallel
4.2 Summary
computing. We compared the computational complexity of
First, the linear probe performance of contrastive learning different SSL methods in Table 1 of the Appendix. Note
models typically surpasses that of other algorithms, and that the primary sources of time complexity and memory
contrastive learning approaches tend to regard the linear consumption are the neural network other than SSL compo-
probe as a significant performance metric. This superiority is nents, e.g., the calculation of the cross-correlation matrix in
attributed to contrastive learning generating well-structured Barlow Twins.
latent spaces, wherein distinct categories are effectively sep-
arated, and similar categories are appropriately clustered.
Secondly, it is observed that pre-trained models using
5 C ONCLUSIONS , F UTURE T RENDS , AND O PEN
MIM can be fine-tuned to achieve superior performance in Q UESTIONS
most cases. Conversely, pre-trained models based on CL In summary, this comprehensive review offers essential in-
lack this property. One primary reason for this discrepancy sights into contemporary SSL research, providing newcom-
lies in the increased susceptibility of CL-based models to ers with an overall picture of the field. The paper presents a
overfitting [66], [262], [263]. This observation also extends thorough survey of SSL from three main perspectives: algo-
to the fine-tuning of pre-trained models for downstream rithms, applications, and future trends. We focus on main-

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

TABLE 5: Performance comparison of SSL methods for video.

Contrastive Methods
Downstream Dataset
Pre-training Linear
Method Backbone UCF101 [254] HMDB51 [255]
Dataset Probe
Linear Fine-tune Linear Fine-tune
DSM [256] K400 R3D34 - - 78.2 - 52.8
TCE [257] K400 R50 - - 71.2 - 36.6
CoCRL [216] K400 S3D-G - 74.5 [258] 87.9 46.1 [258] 54.6
CoCRL K400 2×S3D-G - - 90.6 - 62.9
VTHCL [259] K400 R3D50 37.8 [260] - 82.1 - 49.2
CVRL [261] K400 R3D50 66.1 89.2 92.2 57.3 66.7
CVRL K600 R3D50 70.4 90.6 93.4 59.7 68.0
ρBYOL [260] K400 R3D50 71.5 - 95.5 - 73.6
ρBYOL K400 S3D-G - - 96.3 - 75.0
BraVe [258] K400 R3D50 - 90.6 93.7 65.1 72.0
BraVe K600 R3D50 69.1 91.9 94.4 67.6 73.9
Masked Image Modeling Methods
Downstream Dataset
Pre-training
Method Backbone K400 [251] SSv2 [252] AVA [253]
Dataset
MaskFeat [106] K400 MViTv2-L/312 86.4 74.4 37.5
BEVT [115] K400 Swin-B 76.2 67.1 -
BEVT IN1K + K400 Swin-B 80.6 70.6 -
VidelMAE [116] K400 ViT-B 80.0 68.5 26.7
VidelMAE SSv2 ViT-B 69.6 79.6 -
VidelMAE SSv2 ViT-L - 75.4 34.3
MAE-ST [112] K400 ViT-L 84.8 72.1 32.3
OmniMAE [117] IN1K + K400 ViT-B 80.8 69.0 -
OmniMAE IN1K + SSv2 ViT-B 80.6 69.5 -
OmniMAE IN1K + SSv2 ViT-L 84.0 74.2 -

stream visual SSL algorithms, classifying them into four tomatic design of an optimal pretext task to enhance the
major types: context-based methods, generative methods, performance of a fixed downstream task. Various methods
contrastive methods, and contrastive generative methods. have been proposed to address this challenge, including
Furthermore, we investigate the correlation between SSL the pixel-to-propagation consistency method [65] and dense
and other learning paradigms. Lastly, we will delve into contrastive learning [269]. However, this problem remains
future trends and open problems as outlined below. insufficiently resolved, and further theoretical investigations
Main trends: Firstly, the theoretical cloud still looms are warranted in this direction.
over SSL. How can we understand different SSL algorithms Thirdly, there is a pressing need for a unified SSL
and unify them in the same way physics seeks to unify paradigm that encompasses multiple modalities. MIM has
the four fundamental forces? [54] analyzed the key prop- demonstrated remarkable progress in vision tasks, akin to
erties of contrastive learning based on negative samples, en- the success of masked language model in NLP, suggesting
hancing the understanding of representation distributions. the possibility of unifying learning paradigms. Additionally,
[78] rethought contrastive learning from the perspective of the ViT architecture bridges the gap between visual and
spectral decomposition, providing a high-level understand- verbal modalities, enabling the construction of a unified
ing of why contrastive learning is effective. [264] showed transformer model for both CV and NLP tasks. Recent en-
practical properties, with InfoMin [64] indicating that the deavors [108], [270] have sought to unify SSL models, yield-
design of views should consider downstream tasks. [265] in- ing impressive results in downstream tasks and showing
vestigated why distillation-based methods do not collapse. broad applicability. Nevertheless, NLP has advanced further
[266] demonstrated the duality between negative example- in leveraging SSL models, prompting the CV community
based contrastive learning and covariance regularization- to draw inspiration from NLP approaches to effectively
based methods such as Barlow Twins, indicating the lat- harness the potential of pre-trained models.
ter can be seen as contrastive between the dimensions Open problems: Can SSL effectively leverage vast
of the embeddings instead of between the samples. [267] amounts of unlabeled data? How does it consistently benefit
demonstrated that introducing discrete sparse overcomplete from additional unlabeled data, and how can we determine
representations for SSL can improve generalization. [268] the theoretical inflection point?
presented the connections and distinctions among various Secondly, it is pertinent to explore the interconnection
SSL methods from the perspective of gradients. We antici- between SSL and multi-modality learning, as both method-
pate that new theoretical studies will aid in comprehending ologies share resemblances with the cognitive processes
and unifying various SSL approaches, particularly in har- observed in infants. Consequently, a critical inquiry arises:
monizing CL-base methods with MIM-based methods. how can these two approaches be synergistically integrated
Secondly, a crucial question arises concerning the au- to forge a robust and comprehensive learning model?

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

Thirdly, determining the most optimal or recommended [14] Y. M. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis
SSL algorithm poses a challenge as there is no universally of self-supervision, or what we can learn from a single image,”
in Int. Conf. Learn. Represent., 2020.
applicable solution. The ideal selection of an algorithm [15] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimension-
should align with the specific problem structure, yet prac- ality of data with neural networks,” Science, vol. 313, no. 5786,
tical situations often complicate this process. Consequently, pp. 504–507, 2006.
the development of a checklist to aid users in identifying the [16] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex-
tracting and composing robust features with denoising autoen-
most suitable method under particular circumstances war- coders,” in Int. Conf. Mach. Learn., pp. 1096–1103, 2008.
rants investigation and should be pursued as a promising [17] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning
avenue for future research. to grasp from 50k tries and 700 robot hours,” in IEEE Int. Conf.
Robot. Autom., pp. 3406–3413, 2016.
Fourthly, the assumption that unlabeled data invariably
[18] Y. Li, M. Paluri, J. M. Rehg, and P. Dollár, “Unsupervised learning
leads to improved outcomes warrants scrutiny. Our hy- of edges,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1619–
pothesis challenges this notion, especially concerning semi- 1627, 2016.
supervised learning methods, as the no free lunch theorem [19] D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H.
Yang, “Unsupervised visual representation learning by graph-
comes into play. Performance degradation can arise when based consistent constraints,” in Eur. Conf. Comput. Vis., pp. 678–
model assumptions fail to align effectively with the un- 694, 2016.
derlying problem structure. For instance, if a model as- [20] H. Lee, S. J. Hwang, and J. Shin, “Rethinking data aug-
sumes a substantial separation between decision boundaries mentation: Self-supervision and self-distillation,” arXiv preprint
arXiv:1910.05872, 2019.
and regions of high data density, it may perform poorly [21] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and
when faced with data originating from heavily overlapping Q. Le, “Rethinking pre-training and self-training,” in Neural Inf.
Cauchy distributions, as the decision boundary would tra- Process. Syst., pp. 1–13, 2020.
verse through dense areas. However, preemptively identi- [22] A. E. Orhan, V. V. Gupta, and B. M. Lake, “Self-supervised
learning through the eyes of a child,” in Neural Inf. Process. Syst.,
fying such mismatches remains intricate and an unresolved pp. 9960–9971, 2020.
matter. Consequently, this topic merits further research to [23] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell,
shed light on the matter. “Representation learning via invariant causal mechanisms,” in
Int. Conf. Learn. Represent., pp. 1–19, 2021.
[24] T. Hua, W. Wang, Z. Xue, S. Ren, Y. Wang, and H. Zhao, “On
feature decorrelation in self-supervised learning,” in IEEE Int.
R EFERENCES Conf. Comput. Vis., pp. 9598–9608, 2021.
[25] VentureBeat, “Yann LeCun, Yoshua Bengio: Self-
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, supervised learning is key to human-level intelligence.”
“Imagenet: A large-scale hierarchical image database,” in IEEE https://cacm.acm.org/news/244720-yann-lecun-yoshua-
Conf. Comput. Vis. Pattern Recognit., pp. 248–255, 2009. bengio-self-supervised-learning-is-key-to-human-level-
[2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- intelligence/fulltext.
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning [26] J. Yu, H. Yin, X. Xia, T. Chen, J. Li, and Z. Huang, “Self-supervised
transferable visual models from natural language supervision,” learning for recommender systems: A survey,” arXiv preprint
in Int. Conf. Mach. Learn., pp. 8748–8763, 2021. arXiv:2203.15876, 2022.
[3] L. Ericsson, H. Gouk, and T. M. Hospedales, “How well do self- [27] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and P. Yu, “Graph
supervised models transfer?,” in IEEE Conf. Comput. Vis. Pattern self-supervised learning: A survey,” IEEE T. Knowl. Data Eng.,
Recognit., pp. 5414–5423, 2021. 2022.
[4] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, [28] H. H. Mao, “A survey on self-supervised pre-training for se-
“Self-supervised learning: Generative or contrastive,” IEEE T. quential transfer learning in neural networks,” arXiv preprint
Knowl. Data Eng., 2022. arXiv:2007.00800, 2020.
[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [29] M. C. Schiappa, Y. S. Rawat, and M. Shah, “Self-supervised
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, learning for videos: A survey,” arXiv preprint arXiv:2207.00419,
et al., “An image is worth 16x16 words: Transformers for image 2022.
recognition at scale,” in Int. Conf. Learn. Represent., 2021. [30] G.-J. Qi and M. Shah, “Adversarial pretraining of self-supervised
[6] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, deep networks: Past, present and future,” arXiv preprint
“Learning spatiotemporal features with 3d convolutional net- arXiv:2210.13463, 2022.
works,” in IEEE Int. Conf. Comput. Vis., pp. 4489–4497, 2015. [31] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon,
[7] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen- “A survey on contrastive self-supervised learning,” Technologies,
tation learning by predicting image rotations,” in Int. Conf. Learn. vol. 9, no. 1, pp. 1–22, 2020.
Represent., pp. 1–14, 2018. [32] V. R. de Sa, “Learning classification with unlabeled data,” in
[8] M. Noroozi and P. Favaro, “Unsupervised learning of visual Neural Inf. Process. Syst., pp. 112–119, 1994.
representations by solving jigsaw puzzles,” in Eur. Conf. Comput. [33] Y. LeCun and Y. Bengio, “Reflections from the turing award
Vis., pp. 69–84, 2016. winners.” https://iclr.cc/virtual 2020/speaker 7.html.
[9] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: un- [34] L. Jing and Y. Tian, “Self-supervised visual feature learning with
supervised learning using temporal order verification,” in Eur. deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach.
Conf. Comput. Vis., pp. 527–544, 2016. Intell., vol. 43, no. 11, pp. 4037–4058, 2021.
[10] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning [35] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on genera-
and using the arrow of time,” in IEEE Conf. Comput. Vis. Pattern tive adversarial networks: Algorithms, theory, and applications,”
Recognit., pp. 8052–8060, 2018. IEEE T. Knowl. Data Eng., 2022.
[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- [36] T. Nathan Mundhenk, D. Ho, and B. Y. Chen, “Improvements
training of deep bidirectional transformers for language under- to context based self-supervised learning,” in IEEE Conf. Comput.
standing,” arXiv preprint arXiv:1810.04805, 2018. Vis. Pattern Recognit., pp. 9339–9348, 2018.
[12] X. Zeng, Y. Pan, M. Wang, J. Zhang, and Y. Liu, “Realistic face [37] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by mov-
reenactment via self-supervised disentangling of identity and ing,” in IEEE Int. Conf. Comput. Vis., pp. 37–45, 2015.
pose,” in AAAI Conf.Artif. Intell., pp. 12154–12163, 2020. [38] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”
[13] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zis- in Eur. Conf. Comput. Vis., pp. 649–666, 2016.
serman, “End-to-end learning of visual representations from un- [39] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning repre-
curated instructional videos,” in IEEE Conf. Comput. Vis. Pattern sentations for automatic colorization,” in Eur. Conf. Comput. Vis.,
Recognit., pp. 9879–9889, 2020. pp. 577–593, 2016.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

[40] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. [66] X. Wang and G.-J. Qi, “Contrastive learning with stronger aug-
Efros, “Real-time user-guided image colorization with learned mentations,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–12,
deep priors,” arXiv preprint arXiv:1705.02999, 2017. 2022.
[41] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a [67] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond,
proxy task for visual understanding,” in IEEE Conf. Comput. Vis. E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar,
Pattern Recognit., pp. 6874–6883, 2017. et al., “Bootstrap your own latent: A new approach to self-
[42] P. Goyal, D. Mahajan, A. Gupta, and I. Misra, “Scaling and supervised learning,” in Neural Inf. Process. Syst., pp. 1–14, 2020.
benchmarking self-supervised visual representation learning,” in [68] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and
IEEE Int. Conf. Comput. Vis., pp. 6391–6400, 2019. A. Joulin, “Unsupervised learning of visual features by contrast-
[43] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised ing cluster assignments,” in Neural Inf. Process. Syst., 2020.
learning of spatiotemporal context for video action recognition,” [69] X. Chen and K. He, “Exploring simple siamese representation
in Proc. Winter Conf. Appl. Comput. Vis., pp. 179–189, 2019. learning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 15750–
[44] X. Zhan, X. Pan, Z. Liu, D. Lin, and C. C. Loy, “Self-supervised 15758, 2021.
learning via conditional motion propagation,” in IEEE Conf. [70] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
Comput. Vis. Pattern Recognit., pp. 1881–1889, 2019. autoencoders are scalable vision learners,” in IEEE Conf. Comput.
[45] K. Wang, L. Lin, C. Jiang, C. Qian, and P. Wei, “3d human Vis. Pattern Recognit., pp. 16000–16009, 2022.
pose machines with self-supervised learning,” IEEE Trans. Pattern [71] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lu-
Anal. Mach. Intell., vol. 42, no. 5, pp. 1069–1082, 2019. cic, “On mutual information maximization for representation
[46] M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learn- learning,” in Int. Conf. Learn. Represent., pp. 1–12, 2020.
ing by learning to count,” in IEEE Int. Conf. Comput. Vis., [72] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khan-
pp. 5898–5906, 2017. deparkar, “A theoretical analysis of contrastive unsupervised
[47] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext- representation learning,” in Int. Conf. Mach. Learn., pp. 5628–5637,
invariant representations,” in IEEE Conf. Comput. Vis. Pattern 2019.
Recognit., pp. 6707–6717, 2020. [73] Y. Yang and Z. Xu, “Rethinking the value of labels for improving
[48] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature class-imbalanced learning,” in Neural Inf. Process. Syst., 2020.
learning via non-parametric instance discrimination,” in IEEE [74] Y.-H. H. Tsai, Y. Wu, R. Salakhutdinov, and L.-P. Morency,
Conf. Comput. Vis. Pattern Recognit., pp. 3733–3742, 2018. “Self-supervised learning from a multi-view perspective,” arXiv
[49] N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “What makes instance preprint arXiv:2006.05576, 2020.
discrimination good for transfer learning?,” in Int. Conf. Learn. [75] C.-Y. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, and
Represent., pp. 1–11, 2021. S. Jegelka, “Debiased contrastive learning,” in Int. Conf. Learn.
[50] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con- Represent., 2020.
trast for unsupervised visual representation learning,” in IEEE
[76] J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo, “Predicting what you
Conf. Comput. Vis. Pattern Recognit., pp. 9729–9738, 2020.
already know helps: Provable self-supervised learning,” arXiv
[51] X. Chen, H. Fan, R. Girshick, and K. He, “Improved base- preprint arXiv:2008.01064, 2020.
lines with momentum contrastive learning,” arXiv preprint
[77] S. Chen, G. Niu, C. Gong, J. Li, J. Yang, and M. Sugiyama,
arXiv:2003.04297, 2020.
“Large-margin contrastive learning with distance polarization
[52] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
regularizer,” in Int. Conf. Mach. Learn., pp. 1673–1683, 2021.
framework for contrastive learning of visual representations,” in
Int. Conf. Mach. Learn., pp. 1597–1607, 2020. [78] J. Z. HaoChen, C. Wei, A. Gaidon, and T. Ma, “Provable guaran-
tees for self-supervised deep learning with spectral contrastive
[53] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton,
loss,” in Neural Inf. Process. Syst., Nov. 2021.
“Big self-supervised models are strong semi-supervised learn-
ers,” in Neural Inf. Process. Syst., pp. 1–13, 2020. [79] C. Tosh, A. Krishnamurthy, and D. Hsu, “Contrastive learning,
[54] T. Wang and P. Isola, “Understanding contrastive representation multi-view redundancy, and linear models,” in Algorithmic Learn-
learning through alignment and uniformity on the hypersphere,” ing Theory, pp. 1179–1206, 2021.
in Int. Conf. Mach. Learn., pp. 9929–9939, 2020. [80] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of
[55] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: self-training with deep networks on unlabeled data,” in Int. Conf.
Self-supervised learning via redundancy reduction,” in Int. Conf. Learn. Represent., pp. 1–15, 2021.
Mach. Learn., 2021. [81] Y. Tian, “Deep contrastive learning is provably (almost) principal
[56] A. Bardes, J. Ponce, and Y. LeCun, “Vicreg: Variance-invariance- component analysis,” arXiv preprint arXiv:2201.12680, 2022.
covariance regularization for self-supervised learning,” in Int. [82] X. Chen, S. Xie, and K. He, “An empirical study of training self-
Conf. Learn. Represent., pp. 1–12, 2022. supervised visual transformers,” in IEEE Int. Conf. Comput. Vis.,
[57] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction pp. 9640–9649, 2021.
by learning an invariant mapping,” in IEEE Conf. Comput. Vis. [83] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski,
Pattern Recognit., pp. 1735–1742, 2006. and A. Joulin, “Emerging properties in self-supervised vision
[58] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with transformers,” in IEEE Int. Conf. Comput. Vis., pp. 9650–9660,
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2021.
2019. [84] Y. Wang, X. Shen, S. X. Hu, Y. Yuan, J. L. Crowley, and D. Vaufrey-
[59] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: daz, “Self-supervised transformers for unsupervised object dis-
A new estimation principle for unnormalized statistical models,” covery using normalized cut,” in IEEE Conf. Comput. Vis. Pattern
in Int. Conf. Artif. Intell. Statist., pp. 297–304, 2010. Recognit., pp. 14543–14553, 2022.
[60] M. Zheng, S. You, F. Wang, C. Qian, C. Zhang, X. Wang, and [85] E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learn-
C. Xu, “Ressl: Relational self-supervised learning with weak ing through spatial contrasting,” arXiv preprint arXiv:1610.00243,
augmentation,” arXiv preprint arXiv:2107.09282, 2021. 2016.
[61] N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “Distilling localization [86] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “Regioncl: exploring con-
for self-supervised representation learning,” in AAAI Conf.Artif. trastive region pairs for self-supervised representation learning,”
Intell., pp. 10990–10998, 2021. in Eur. Conf. Comput. Vis., pp. 477–494, Springer, 2022.
[62] R. Arandjelovic and A. Zisserman, “Objects that sound,” in Eur. [87] M. Yang, M. Liao, P. Lu, J. Wang, S. Zhu, H. Luo, Q. Tian,
Conf. Comput. Vis., pp. 435–451, 2018. and X. Bai, “Reading and writing: Discriminative and generative
[63] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod- modeling for self-supervised text recognition,” arXiv preprint
ing,” in Eur. Conf. Comput. Vis., pp. 776–794, 2020. arXiv:2207.00193, 2022.
[64] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, [88] R. Zhu, B. Zhao, J. Liu, Z. Sun, and C. W. Chen, “Improving
“What makes for good views for contrastive learning,” in Neural contrastive learning by visualizing feature transformation,” in
Inf. Process. Syst., pp. 1–13, 2020. IEEE Int. Conf. Comput. Vis., pp. 10306–10315, 2021.
[65] Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, and H. Hu, “Propagate [89] M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng, “Par-
yourself: Exploring pixel-level consistency for unsupervised vi- tially view-aligned representation learning with noise-robust
sual representation learning,” in IEEE Conf. Comput. Vis. Pattern contrastive loss,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
Recognit., pp. 16684–16693, 2021. pp. 1134–1143, 2021.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

[90] A. Islam, C.-F. Chen, R. Panda, L. Karlinsky, R. Radke, and [114] Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked
R. Feris, “A broad study on the transferability of visual repre- autoencoders for point cloud self-supervised learning,” in Eur.
sentations with contrastive learning,” in IEEE Int. Conf. Comput. Conf. Comput. Vis., pp. 604–621, 2022.
Vis., pp. 8845–8855, 2021. [115] R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y.-G. Jiang,
[91] J. Li, C. Xiong, and S. C. Hoi, “Learning from noisy data with L. Zhou, and L. Yuan, “Bevt: Bert pretraining of video transform-
robust representation learning,” in IEEE Int. Conf. Comput. Vis., ers,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 9485–9494, 2021. pp. 14733–14743, 2022.
[92] L. Jing, P. Vincent, Y. LeCun, and Y. Tian, “Understanding di- [116] Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked
mensional collapse in contrastive self-supervised learning,” in autoencoders are data-efficient learners for self-supervised video
Int. Conf. Learn. Represent., pp. 1–11, 2022. pre-training,” Neural Inf. Process. Syst., vol. 35, pp. 10078–10093,
[93] J. Zhang, X. Xu, F. Shen, Y. Yao, J. Shao, and X. Zhu, “Video 2022.
representation learning with graph contrastive augmentation,” [117] R. Girdhar, A. El-Nouby, M. Singh, K. V. Alwala, A. Joulin, and
in ACM Int. Conf. Multimedia, pp. 3043–3051, 2021. I. Misra, “Omnimae: Single model masked pretraining on images
[94] Q. Hu, X. Wang, W. Hu, and G.-J. Qi, “Adco: Adversarial contrast and videos,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern
for efficient learning of unsupervised representations from self- Recognit., pp. 10406–10417, June 2023.
trained negative adversaries,” in IEEE Conf. Comput. Vis. Pattern [118] A. Gupta, J. Wu, J. Deng, and L. Fei-Fei, “Siamese masked
Recognit., 2021. autoencoders,” in Neural Inf. Process. Syst., Nov. 2023.
[95] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and [119] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao,
D. Larlus, “Hard negative mixing for contrastive learning,” in Z. Zhang, L. Dong, et al., “Swin transformer v2: Scaling up capac-
Neural Inf. Process. Syst., pp. 1–12, 2020. ity and resolution,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[96] S. Purushwalkam and A. Gupta, “Demystifying contrastive self- pp. 12009–12019, 2022.
supervised learning: Invariances, augmentations and dataset bi- [120] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
ases,” in Neural Inf. Process. Syst., pp. 1–12, 2020. transformer backbones for object detection,” in Eur. Conf. Comput.
[97] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, Vis., pp. 280–296, 2022.
A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive [121] Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision
learning,” in Neural Inf. Process. Syst., pp. 18661–18673, 2020. transformer baselines for human pose estimation,” in Neural Inf.
[98] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, Process. Syst., pp. 38571–38584, 2022.
“ibot: Image bert pre-training with online tokenizer,” in Int. Conf.
[122] Z. Liu, J. Gui, and H. Luo, “Good helper is around you: Attention-
Learn. Represent., pp. 1–12, 2022.
driven masked image modeling,” in AAAI Conf.Artif. Intell.,
[99] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of pp. 1799–1807, 2023.
image transformers,” in Int. Conf. Learn. Represent., pp. 1–13, 2022.
[123] Z. Qi, R. Dong, G. Fan, Z. Ge, X. Zhang, K. Ma, and
[100] X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, L. Yi, “Contrast with reconstruct: Contrastive 3d representa-
G. Zeng, and J. Wang, “Context autoencoder for self-supervised tion learning guided by generative pretraining,” arXiv preprint
representation learning,” arXiv preprint arXiv:2202.03026, 2022. arXiv:2302.02318, 2023.
[101] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu,
[124] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, Y. Wei, Q. Dai, and H. Hu, “On
“Simmim: A simple framework for masked image modeling,” in
data scaling in masked image modeling,” in IEEE Conf. Comput.
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9653–9663, 2022.
Vis. Pattern Recognit., pp. 10365–10374, 2023.
[102] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
[125] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al.,
et al., “Language models are few-shot learners,” arXiv preprint
“Dinov2: Learning robust visual features without supervision,”
arXiv:2005.14165, 2020.
arXiv preprint arXiv:2304.07193, 2023.
[103] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, P. Dhariwal,
D. Luan, and I. Sutskever, “Generative pretraining from pixels,” [126] X. Kong and X. Zhang, “Understanding masked image modeling
in Int. Conf. Mach. Learn., pp. 1691–1703, 2020. via learning occlusion invariant feature,” in IEEE Conf. Comput.
Vis. Pattern Recognit., pp. 6241–6251, 2023.
[104] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
“Context encoders: Feature learning by inpainting,” in IEEE Conf. [127] H. Chen, Y. Wang, B. Lagadec, A. Dantcheva, and F. Bremond,
Comput. Vis. Pattern Recognit., pp. 2536–2544, 2016. “Joint generative and contrastive learning for unsupervised per-
son re-identification,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[105] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,
pp. 2004–2013, 2021.
M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,”
in Int. Conf. Mach. Learn., pp. 8821–8831, 2021. [128] L. Wang, F. Liang, Y. Li, H. Zhang, W. Ouyang, and J. Shao,
[106] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichten- “Repre: Improving self-supervised vision transformer with re-
hofer, “Masked feature prediction for self-supervised visual pre- constructive pre-training,” Jan. 2022.
training,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 14668– [129] Z. Huang, X. Jin, C. Lu, Q. Hou, M.-M. Cheng, D. Fu, X. Shen,
14678, 2022. and J. Feng, “Contrastive masked autoencoders are stronger
[107] X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, vision learners,” IEEE Transactions on Pattern Analysis and Machine
F. Wen, and N. Yu, “Peco: Perceptual codebook for bert pre- Intelligence, pp. 1–13, 2023.
training of vision transformers,” arXiv preprint arXiv:2111.12710, [130] C. Tao, X. Zhu, W. Su, G. Huang, B. Li, J. Zhou, Y. Qiao, X. Wang,
2021. and J. Dai, “Siamese image modeling for self-supervised vision
[108] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, representation learning,” in Proceedings of the IEEE Conf. Comput.
“Data2vec: A general framework for self-supervised learning in Vis. Pattern Recognit., pp. 2132–2141, 2023.
speech, vision and language,” arXiv preprint arXiv:2202.03555, [131] Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y. Cao, “Revealing
2022. the dark secrets of masked image modeling,” in IEEE Conf.
[109] Y. Chen, Y. Liu, D. Jiang, X. Zhang, W. Dai, H. Xiong, and Q. Tian, Comput. Vis. Pattern Recognit., pp. 14475–14485, 2023.
“Sdae: Self-distillated masked autoencoder,” in Eur. Conf. Comput. [132] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox,
Vis., pp. 108–124, 2022. “Discriminative unsupervised feature learning with convolu-
[110] Q. Zhou, C. Yu, H. Luo, Z. Wang, and H. Li, “Mimco: Masked tional neural networks,” in Neural Inf. Process. Syst., pp. 766–774,
image modeling pre-training with contrastive teacher,” in ACM 2014.
Int. Conf. Multimedia, pp. 4487–4495, 2022. [133] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller,
[111] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked and T. Brox, “Discriminative unsupervised feature learning with
image modeling with vector-quantized visual tokenizers,” arXiv exemplar convolutional neural networks,” IEEE Trans. Pattern
preprint arXiv:2208.06366, 2022. Anal. Mach. Intell., vol. 38, no. 9, pp. 1734–1747, 2015.
[112] C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders [134] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual
as spatiotemporal learners,” arXiv preprint arXiv:2205.09113, 2022. representation learning by context prediction,” in IEEE Int. Conf.
[113] Y. Liang, S. Zhao, B. Yu, J. Zhang, and F. He, “Meshmae: Masked Comput. Vis., pp. 1422–1430, 2015.
autoencoders for 3d mesh data analysis,” in Eur. Conf. Comput. [135] P. Bojanowski and A. Joulin, “Unsupervised learning by predict-
Vis., pp. 37–54, 2022. ing noise,” in Int. Conf. Mach. Learn., 2017.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

[136] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed- [161] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting
ding for clustering analysis,” in Int. Conf. Mach. Learn., pp. 478– self-supervised learning via knowledge transfer,” in IEEE Conf.
487, 2016. Comput. Vis. Pattern Recognit., pp. 9359–9367, 2018.
[137] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of [162] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun, “Gpt-
deep representations and image clusters,” in IEEE Conf. Comput. gnn: Generative pre-training of graph neural networks,” in ACM
Vis. Pattern Recognit., pp. 5147–5156, 2016. SIGKDD International Conference on Knowledge Discovery and Data
[138] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clus- Mining, pp. 1857–1867, 2020.
tering for unsupervised learning of visual features,” in Eur. Conf. [163] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang,
Comput. Vis., pp. 132–149, 2018. “Self-supervised graph transformer on large-scale molecular
[139] R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: data,” in Neural Inf. Process. Syst., 2020.
Unsupervised learning by cross-channel prediction,” in IEEE [164] U. Buchler, B. Brattoli, and B. Ommer, “Improving spatiotempo-
Conf. Comput. Vis. Pattern Recognit., pp. 1058–1067, 2017. ral self-supervision by deep reinforcement learning,” in Eur. Conf.
[140] X. Wang, K. He, and A. Gupta, “Transitive invariance for self- Comput. Vis., pp. 770–786, 2018.
supervised visual representation learning,” in IEEE Int. Conf. [165] D. Guo, B. A. Pires, B. Piot, J.-b. Grill, F. Altché, R. Munos, and
Comput. Vis., pp. 1329–1338, 2017. M. G. Azar, “Bootstrap latent-predictive representations for mul-
[141] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised titask reinforcement learning,” arXiv preprint arXiv:2004.14646,
visual representation learning,” in IEEE Conf. Comput. Vis. Pattern 2020.
Recognit., pp. 1920–1929, 2019. [166] N. Hansen, Y. Sun, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang,
[142] P. Krähenbühl, “Free supervision from video games,” in IEEE “Self-supervised policy adaptation during deployment,” arXiv
Conf. Comput. Vis. Pattern Recognit., pp. 2955–2964, 2018. preprint arXiv:2007.04309, 2020.
[143] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [167] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord,
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- “Boosting few-shot visual learning with self-supervision,” in
sarial nets,” in Neural Inf. Process. Syst., pp. 2672–2680, 2014. IEEE Int. Conf. Comput. Vis., pp. 8059–8068, 2019.
[144] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby, “Self- [168] J.-C. Su, S. Maji, and B. Hariharan, “Boosting supervision
supervised gans via auxiliary rotation loss,” in IEEE Conf. Com- with self-supervision for few-shot learning,” arXiv preprint
put. Vis. Pattern Recognit., pp. 12154–12163, 2019. arXiv:1906.07079, 2019.
[145] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self- [169] C. Li, T. Tang, G. Wang, J. Peng, B. Wang, X. Liang, and X. Chang,
supervised semi-supervised learning,” in IEEE Int. Conf. Comput. “Bossnas: Exploring hybrid cnn-transformers with block-wisely
Vis., pp. 1476–1485, 2019. self-supervised neural architecture search,” in IEEE Int. Conf.
[146] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song, “Using Comput. Vis., 2021.
self-supervised learning can improve model robustness and un- [170] L. Fan, S. Liu, P.-Y. Chen, G. Zhang, and C. Gan, “When does
certainty,” in Neural Inf. Process. Syst., pp. 15663–15674, 2019. contrastive learning preserve adversarial robustness from pre-
[147] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view training to finetuning?,” in Neural Inf. Process. Syst., 2021.
representation learning on graphs,” in Int. Conf. Mach. Learn.,
[171] M. Kim, J. Tack, and S. J. Hwang, “Adversarial self-supervised
2020.
contrastive learning,” in Neural Inf. Process. Syst., pp. 1–12, 2020.
[148] L. Gomez, Y. Patel, M. Rusiñol, D. Karatzas, and C. Jawahar,
[172] T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang,
“Self-supervised learning of visual features through embedding
“Adversarial robustness: From self-supervised pre-training to
images into text topic spaces,” in IEEE Conf. Comput. Vis. Pattern
fine-tuning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 699–
Recognit., pp. 4230–4239, 2017.
708, 2020.
[149] L. Jing, Y. Chen, L. Zhang, M. He, and Y. Tian, “Self-supervised
[173] Y. Lin, X. Guo, and Y. Lu, “Self-supervised video representation
feature learning by cross-modality and cross-view correspon-
learning with meta-contrastive network,” in IEEE Int. Conf. Com-
dences,” arXiv preprint arXiv:2004.05749, 2020.
put. Vis., pp. 8239–8249, 2021.
[150] L. Jing, Y. Chen, L. Zhang, M. He, and Y. Tian, “Self-supervised
modal and view invariant feature learning,” arXiv preprint [174] Y. An, H. Xue, X. Zhao, and L. Zhang, “Conditional self-
arXiv:2005.14169, 2020. supervised learning for few-shot classification,” in Int. Joint Conf.
[151] L. Zhang and Z. Zhu, “Unsupervised feature learning for point Artif. Intell., pp. 2140–2146, 2021.
cloud understanding by contrasting and clustering using graph [175] S. Pal, A. Datta, and D. D. Majumder, “Computer recognition
convolutional neural networks,” in International Conference on 3D of vowel sounds using a self-supervised learning algorithm,”
Vision, pp. 395–404, 2019. Journal of the Anatomical Society of India, pp. 117–123, 1978.
[152] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud [176] A. Ghosh, N. R. Pal, and S. K. Pal, “Self-organization for object
auto-encoder via deep grid deformation,” in IEEE Conf. Comput. extraction using a multilayer neural network and fuzziness mear-
Vis. Pattern Recognit., pp. 206–215, 2018. sures,” IEEE Transactions on Fuzzy Systems, pp. 54–68, 1993.
[153] M. Gadelha, R. Wang, and S. Maji, “Multiresolution tree networks [177] A. Sharma, O. Grau, and M. Fritz, “Vconv-dae: Deep volumetric
for 3d point cloud processing,” in Eur. Conf. Comput. Vis., pp. 103– shape learning without object labels,” in Eur. Conf. Comput. Vis.,
118, 2018. pp. 236–250, 2016.
[154] Y. Zhao, T. Birdal, H. Deng, and F. Tombari, “3d point capsule [178] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin, “Look into
networks,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1009– person: Self-supervised structure-sensitive learning and a new
1018, 2019. benchmark for human parsing,” in IEEE Conf. Comput. Vis. Pat-
[155] Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt, tern Recognit., pp. 932–940, 2017.
“Test-time training with self-supervision for generalization under [179] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint
distribution shifts,” in Int. Conf. Mach. Learn., 2020. body parsing & pose estimation network and a new benchmark,”
[156] Y. Gandelsman, Y. Sun, X. Chen, and A. A. Efros, “Test-time train- IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 871–885,
ing with masked autoencoders,” arXiv preprint arXiv:2209.07522, 2018.
2022. [180] X. Zhan, X. Pan, B. Dai, Z. Liu, D. Lin, and C. C. Loy, “Self-
[157] J. J. Sun, A. Kennedy, E. Zhan, D. J. Anderson, Y. Yue, and supervised scene de-occlusion,” in IEEE Conf. Comput. Vis. Pattern
P. Perona, “Task programming: Learning data efficient behavior Recognit., pp. 3784–3792, 2020.
representations,” in IEEE Conf. Comput. Vis. Pattern Recognit., [181] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan,
pp. 2876–2885, 2021. “Learning features by watching objects move,” in IEEE Conf.
[158] Z. Ren and Y. Jae Lee, “Cross-domain self-supervised multi-task Comput. Vis. Pattern Recognit., pp. 2701–2710, 2017.
feature learning using synthetic imagery,” in IEEE Conf. Comput. [182] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised
Vis. Pattern Recognit., pp. 762–771, 2018. equivariant attention mechanism for weakly supervised seman-
[159] K. Saito, D. Kim, S. Sclaroff, and K. Saenko, “Universal domain tic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
adaptation through self supervision,” in Neural Inf. Process. Syst., pp. 12275–12284, 2020.
pp. 1–11, 2020. [183] Z. Chen, X. Ye, L. Du, W. Yang, L. Huang, X. Tan, Z. Shi,
[160] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros, “Unsupervised F. Shen, and E. Ding, “Aggnet for self-supervised monocular
domain adaptation through self-supervision,” arXiv preprint depth estimation: Go an aggressive step furthe,” in ACM Int.
arXiv:1909.11825, 2019. Conf. Multimedia, pp. 1526–1534, 2021.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19

[184] H. Chen, B. Lagadec, and F. Bremond, “Ice: Inter-instance con- [208] T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense
trastive encoding for unsupervised person re-identification,” in predictive coding for video representation learning,” in Eur. Conf.
IEEE Int. Conf. Comput. Vis., pp. 14960–14969, 2021. Comput. Vis., 2020.
[185] T. Isobe, D. Li, L. Tian, W. Chen, Y. Shan, and S. Wang, “Towards [209] B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised
discriminative representation learning for unsupervised person video representation learning with odd-one-out networks,” in
re-identification,” in IEEE Int. Conf. Comput. Vis., pp. 8526–8536, IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3636–3645, 2017.
2021. [210] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised
[186] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self- representation learning by sorting sequences,” in IEEE Int. Conf.
supervised deep visual odometry with online adaptation,” in Comput. Vis., pp. 667–676, 2017.
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6339–6348, 2020. [211] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-
[187] W. Wu, Z. Y. Wang, Z. Li, W. Liu, and L. Fuxin, “Pointpwc-net: supervised spatiotemporal learning via video clip order predic-
Cost volume on point clouds for (self-) supervised scene flow tion,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 10334–
estimation,” in Eur. Conf. Comput. Vis., 2020. 10343, 2019.
[188] G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets [212] S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman,
self-supervision,” arXiv preprint arXiv:2006.07114, 2020. M. Rubinstein, M. Irani, and T. Dekel, “Speednet: Learning the
speediness in videos,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[189] J. Walker, A. Gupta, and M. Hebert, “Dense optical flow predic-
pp. 9922–9931, 2020.
tion from a static image,” in IEEE Int. Conf. Comput. Vis., pp. 2443–
2451, 2015. [213] Y. Yao, C. Liu, D. Luo, Y. Zhou, and Q. Ye, “Video playback rate
perception for self-supervised spatio-temporal representation
[190] F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language nav- learning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6548–
igation with self-supervised auxiliary reasoning tasks,” in IEEE 6557, 2020.
Conf. Comput. Vis. Pattern Recognit., pp. 10012–10022, 2020.
[214] J. Wang, J. Jiao, and Y.-H. Liu, “Self-supervised video represen-
[191] X. Niu, S. Shan, H. Han, and X. Chen, “Rhythmnet: End-to-end tation learning by pace prediction,” in Eur. Conf. Comput. Vis.,
heart rate estimation from face via spatial-temporal representa- 2020.
tion,” IEEE Trans. Image Process., vol. 29, pp. 2409–2423, 2020. [215] A. Diba, V. Sharma, L. V. Gool, and R. Stiefelhagen, “Dynamonet:
[192] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, and G. Zhao, “Video-based Dynamic action and motion network,” in IEEE Int. Conf. Comput.
remote physiological measurement via cross-verified feature dis- Vis., pp. 6192–6201, 2019.
entangling,” in Eur. Conf. Comput. Vis., 2020. [216] T. Han, W. Xie, and A. Zisserman, “Self-supervised co-training
[193] Y. Xie, Z. Wang, and S. Ji, “Noise2same: Optimizing a self- for video representation learning,” in Neural Inf. Process. Syst.,
supervised bound for image denoising,” in Neural Inf. Process. pp. 1–12, 2020.
Syst., 2020. [217] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of
[194] T. Huang, S. Li, X. Jia, H. Lu, and J. Liu, “Neighbor2neighbor: audio and video models from self-supervised synchronization,”
Self-supervised denoising from single noisy images,” in IEEE in Neural Inf. Process. Syst., pp. 7763–7774, 2018.
Conf. Comput. Vis. Pattern Recognit., 2021. [218] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in
[195] C. Yang, Z. Wu, B. Zhou, and S. Lin, “Instance localization for IEEE Int. Conf. Comput. Vis., pp. 609–617, 2017.
self-supervised detection pretraining,” in IEEE Conf. Comput. Vis. [219] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
Pattern Recognit., pp. 3987–3996, 2021. “Videobert: A joint model for video and language representation
[196] I. Croitoru, S.-V. Bogolin, and M. Leordeanu, “Unsupervised learning,” in IEEE Int. Conf. Comput. Vis., pp. 7464–7473, 2019.
learning from video to detect foreground objects in single im- [220] A. Nagrani, C. Sun, D. Ross, R. Sukthankar, C. Schmid, and
ages,” in IEEE Int. Conf. Comput. Vis., pp. 4335–4343, 2017. A. Zisserman, “Speech2action: Cross-modal supervision for ac-
[197] E. Xie, J. Ding, W. Wang, X. Zhan, H. Xu, Z. Li, and P. Luo, tion recognition,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
“Detco: Unsupervised contrastive learning for object detection,” pp. 10317–10326, 2020.
arXiv preprint arXiv:2102.04803, 2021. [221] J. C. Stroud, D. A. Ross, C. Sun, J. Deng, R. Sukthankar, and
[198] G. Wu, J. Jiang, X. Liu, and J. Ma, “A practical contrastive C. Schmid, “Learning video representations from textual web
learning framework for single image super-resolution,” arXiv supervision,” arXiv preprint arXiv:2007.14937, 2020.
preprint arXiv:2111.13924, 2021. [222] J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Rama-
[199] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin, “Pulse: puram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman,
Self-supervised photo upsampling via latent space exploration of “Self-supervised multimodal versatile networks,” arXiv preprint
generative models,” in IEEE Conf. Comput. Vis. Pattern Recognit., arXiv:2006.16228, 2020.
pp. 2437–2445, 2020. [223] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and
[200] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta, “Learning S. Levine, “Time-contrastive networks: Self-supervised learning
a predictable and generative vector representation for objects,” in from video,” in IEEE Int. Conf. Robot. Autom., pp. 1134–1141, 2018.
Eur. Conf. Comput. Vis., pp. 484–499, 2016. [224] X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence
from the cycle-consistency of time,” in IEEE Conf. Comput. Vis.
[201] D. Jayaraman and K. Grauman, “Learning image representations
Pattern Recognit., pp. 2566–2576, 2019.
tied to ego-motion,” in IEEE Int. Conf. Comput. Vis., pp. 1413–
[225] X. Li, S. Liu, S. De Mello, X. Wang, J. Kautz, and M.-H.
1421, 2015.
Yang, “Joint-task self-supervised learning for temporal corre-
[202] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, spondence,” in Neural Inf. Process. Syst., pp. 318–328, 2019.
optical flow and camera pose,” in IEEE Conf. Comput. Vis. Pattern [226] A. Jabri, A. Owens, and A. A. Efros, “Space-time correspondence
Recognit., pp. 1983–1992, 2018. as a contrastive random walk,” in Neural Inf. Process. Syst.,
[203] L. Huang, Y. Liu, B. Wang, P. Pan, Y. Xu, and R. Jin, “Self- pp. 19545–19560, 2020.
supervised video representation learning by context and mo- [227] Z. Lai, E. Lu, and W. Xie, “Mast: A memory-augmented self-
tion decoupling,” in IEEE Conf. Comput. Vis. Pattern Recognit., supervised tracker,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 13886–13895, 2021. pp. 6479–6488, 2020.
[204] K. Hu, J. Shao, Y. Liu, B. Raj, M. Savvides, and Z. Shen, “Contrast [228] Z. Zhang, S. Lathuiliere, E. Ricci, N. Sebe, Y. Yan, and J. Yang,
and order representations for video self-supervised learning,” in “Online depth learning against forgetting in monocular videos,”
IEEE Int. Conf. Comput. Vis., pp. 7939–7949, 2021. in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4494–4503, 2020.
[205] M. Tschannen, J. Djolonga, M. Ritter, A. Mahendran, N. Houlsby, [229] D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang,
S. Gelly, and M. Lucic, “Self-supervised learning of video- “Video cloze procedure for self-supervised spatio-temporal
induced visual invariances,” in IEEE Conf. Comput. Vis. Pattern learning,” in AAAI Conf.Artif. Intell., pp. 11701–11708, 2020.
Recognit., pp. 13806–13815, 2020. [230] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch,
[206] X. He, Y. Pan, M. Tang, Y. Lv, and Y. Peng, “Learn from unlabeled S. Eslami, and A. v. d. Oord, “Data-efficient image recognition
videos for near-duplicate video retrieval,” in International Confer- with contrastive predictive coding,” in Int. Conf. Mach. Learn.,
ence on Research on Development in Information Retrieval, pp. 1–10, 2020.
2022. [231] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Im-
[207] T. Han, W. Xie, and A. Zisserman, “Video representation learning proving language understanding by generative pre-training,”
by dense predictive coding,” in ICCV Workshops, 2019. 2018.

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20

[232] C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, sual actions,” in IEEE Conf, ComputVis.Pattern Recognit., pp. 6047–
and J. Gao, “Efficient self-supervised vision transformers for 6056, 2018.
representation learning,” arXiv preprint arXiv:2106.09785, 2021. [254] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101
[233] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, human actions classes from videos in the wild,” arXiv preprint
“Distributed representations of words and phrases and their arXiv:1212.0402, 2012.
compositionality,” in Neural Inf. Process. Syst., pp. 3111–3119, [255] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre,
2013. “Hmdb: a large video database for human motion recognition,”
[234] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre- in IEEE Int. Conf. Comput. Vis., pp. 2556–2563, IEEE, 2011.
training text encoders as discriminators rather than generators,” [256] J. Wang, Y. Gao, K. Li, J. Hu, X. Jiang, X. Guo, R. Ji, and
in Int. Conf. Learn. Represent., 2020. X. Sun, “Enhancing unsupervised video representation learning
[235] N. Pappas and J. Henderson, “Gile: A generalized input-label by decoupling the scene and the motion,” in Proceedings of the
embedding for text classification,” Transactions of the Association AAAI Conference on Artificial Intelligence, vol. 35, pp. 10129–10137,
for Computational Linguistics, vol. 7, pp. 139–155, 2019. 2021.
[236] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Pre- [257] J. Knights, B. Harwood, D. Ward, A. Vanderkop, O. Mackenzie-
training transformers as energy-based cloze models,” arXiv Ross, and P. Moghadam, “Temporally coherent embeddings for
preprint arXiv:2012.08561, 2020. self-supervised video representation learning,” in 2020 25th Inter-
national Conference on Pattern Recognition (ICPR), pp. 8914–8921,
[237] Z. Wu, S. Wang, J. Gu, M. Khabsa, F. Sun, and H. Ma, “Clear:
IEEE, 2021.
Contrastive learning for sentence representation,” arXiv preprint
[258] A. Recasens, P. Luc, J.-B. Alayrac, L. Wang, F. Strub, C. Tal-
arXiv:2012.15466, 2020.
lec, M. Malinowski, V. Pătrăucean, F. Altché, M. Valko, et al.,
[238] J. Giorgi, O. Nitski, B. Wang, and G. Bader, “Declutr: Deep
“Broaden your views for self-supervised video learning,” in IEEE
contrastive learning for unsupervised textual representations,”
Int. Conf. Comput. Vis., pp. 1255–1265, 2021.
arXiv preprint arXiv:2006.03659, 2020.
[259] C. Yang, Y. Xu, B. Dai, and B. Zhou, “Video representa-
[239] H.-Y. Zhou, C. Lu, S. Yang, X. Han, and Y. Yu, “Preservational tion learning with visual tempo consistency,” arXiv preprint
learning improves self-supervised medical image models by re- arXiv:2006.15489, 2020.
constructing diverse contexts,” in IEEE Int. Conf. Comput. Vis., [260] C. Feichtenhofer, H. Fan, B. Xiong, R. Girshick, and K. He, “A
pp. 3499–3509, 2021. large-scale study on unsupervised spatiotemporal representation
[240] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu, “Contrastive learning,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern
learning of global and local features for medical image segmenta- Recognit., pp. 3299–3309, 2021.
tion with limited annotations,” in Neural Inf. Process. Syst., 2020. [261] R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie,
[241] J. Zhu, Y. Li, Y. Hu, K. Ma, S. K. Zhou, and Y. Zheng, “Rubik’s and Y. Cui, “Spatiotemporal contrastive video representation
cube+: A self-supervised feature learning framework for 3d med- learning,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern
ical image analysis,” Medical Image Analysis, p. 101746, 2020. Recognit., pp. 6964–6974, 2021.
[242] O. Manas, A. Lacoste, X. Giró-i Nieto, D. Vazquez, and P. Ro- [262] J. Robinson, L. Sun, K. Yu, K. Batmanghelich, S. Jegelka, and
driguez, “Seasonal contrast: Unsupervised pre-training from un- S. Sra, “Can contrastive learning avoid shortcut solutions?,” in
curated remote sensing data,” in IEEE Int. Conf. Comput. Vis., Neural Inf. Process. Syst., pp. 4974–4986, 2021.
pp. 9414–9423, 2021. [263] Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen,
[243] D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang, and B. Guo, “Contrastive learning rivals masked image mod-
“Advancing plain vision transformer toward remote sensing eling in fine-tuning via feature distillation,” arXiv preprint
foundation model,” IEEE Trans. Geoscience and Remote Sensing, arXiv:2205.14141, 2022.
vol. 61, pp. 1–15, 2022. [264] T. Chen, C. Luo, and L. Li, “Intriguing properties of contrastive
[244] J. Liu, X. Huang, Y. Liu, and H. Li, “Mixmim: Mixed and masked losses,” in Neural Inf. Process. Syst., vol. 34, pp. 11834–11845,
image modeling for efficient visual representation learning,” Curran Associates, Inc., 2021.
arXiv preprint arXiv:2205.13137, 2022. [265] Y. Tian, X. Chen, and S. Ganguli, “Understanding self-supervised
[245] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network learning dynamics without contrastive pairs,” in Int. Conf. Mach.
dissection: Quantifying interpretability of deep visual representa- Learn., pp. 10268–10278, 2021.
tions,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6541–6549, [266] Q. Garrido, Y. Chen, A. Bardes, L. Najman, and Y. LeCun, “On the
2017. duality between contrastive and non-contrastive self-supervised
[246] Q. Garrido, R. Balestriero, L. Najman, and Y. Lecun, “Rankme: learning,” in Int. Conf. Learn. Represent., 2023.
Assessing the downstream performance of pretrained self- [267] S. Lavoie, C. Tsirigotis, M. Schwarzer, A. Vani, M. Noukhovitch,
supervised representations by their rank,” in Int. Conf. Mach. K. Kawaguchi, and A. Courville, “Simplicial embeddings in self-
Learn., pp. 10929–10974, PMLR, July 2023. supervised learning and downstream classification,” in Int. Conf.
[247] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and Learn. Represent., 2023.
A. Zisserman, “The pascal visual object classes (voc) challenge,” [268] C. Tao, H. Wang, X. Zhu, J. Dong, S. Song, G. Huang, and J. Dai,
Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010. “Exploring the equivalence of siamese self-supervised learning
via a unified gradient framework,” in Proceedings of the IEEE Conf.
[248] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
Comput. Vis. Pattern Recognit., pp. 14431–14440, 2022.
P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft
[269] X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense con-
coco: Common objects in context,” 2015.
trastive learning for self-supervised visual pre-training,” in IEEE
[249] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Conf. Comput. Vis. Pattern Recognit., pp. 3024–3033, 2021.
“Scene parsing through ade20k dataset,” in IEEE Conf. Comput. [270] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal,
Vis. Pattern Recognit., 2017. O. K. Mohammed, S. Singhal, S. Som, et al., “Image as a foreign
[250] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and language: Beit pretraining for all vision and vision-language
A. Torralba, “Semantic understanding of scenes through the tasks,” arXiv preprint arXiv:2208.10442, 2022.
ade20k dataset,” Int. J. Comput. Vis., vol. 127, no. 3, pp. 302–321,
2019.
[251] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-
jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al.,
“The kinetics human action video dataset,” arXiv preprint
arXiv:1705.06950, 2017.
[252] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska,
S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-
Freitag, et al., “The” something something” video database for
learning and evaluating visual common sense,” in IEEE Int. Conf.
Comput. Vis., pp. 5842–5850, 2017.
[253] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li,
S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al.,
“Ava: A video dataset of spatio-temporally localized atomic vi-

Teaching With AI
100% (5)
Teaching With AI
154 pages
Instant Download Machine Learning Algorithms in Depth MEAP V01 Vadim Smolyakov PDF All Chapters
100% (3)
Instant Download Machine Learning Algorithms in Depth MEAP V01 Vadim Smolyakov PDF All Chapters
50 pages
The Role of Artificial Intelligence in Supply Chain Management Mapping The Territory
100% (1)
The Role of Artificial Intelligence in Supply Chain Management Mapping The Territory
25 pages
00 Intro
No ratings yet
00 Intro
18 pages
Essentials of Business Analytics
No ratings yet
Essentials of Business Analytics
74 pages
A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends
No ratings yet
A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends
20 pages
A Survey of Self Superviaed Learning
No ratings yet
A Survey of Self Superviaed Learning
26 pages
Self-Supervised Learning: Generative or Contrastive
No ratings yet
Self-Supervised Learning: Generative or Contrastive
20 pages
Self-Supervised Learning Generative or Contrastive
No ratings yet
Self-Supervised Learning Generative or Contrastive
20 pages
【TPAMI 综述】时间序列分析的自我监督学习
No ratings yet
【TPAMI 综述】时间序列分析的自我监督学习
20 pages
Self-Supervised Learning For Time Series Analysis Taxonomy Progress and Prospects
No ratings yet
Self-Supervised Learning For Time Series Analysis Taxonomy Progress and Prospects
20 pages
fault_detection
No ratings yet
fault_detection
6 pages
Wan JNL Survey 2023 A Survey of Deep Active Learning For Foundation Models
No ratings yet
Wan JNL Survey 2023 A Survey of Deep Active Learning For Foundation Models
16 pages
Technologies 09 00002 v2
No ratings yet
Technologies 09 00002 v2
22 pages
HybridAD A Hybrid Model-Driven Anomaly Detection Approach For Multivariate Time Series
No ratings yet
HybridAD A Hybrid Model-Driven Anomaly Detection Approach For Multivariate Time Series
13 pages
2501.04635v1
No ratings yet
2501.04635v1
8 pages
在线学习综述
No ratings yet
在线学习综述
100 pages
Design_of_Anti-Plagiarism_Mechanisms_in_Decentralized_Federated_Learning
No ratings yet
Design_of_Anti-Plagiarism_Mechanisms_in_Decentralized_Federated_Learning
15 pages
Calibrated One-Class Classification For Unsupervised Time Series Anomaly Detection
No ratings yet
Calibrated One-Class Classification For Unsupervised Time Series Anomaly Detection
14 pages
A_Survey_on_Time-Series_Pre-Trained_Models
No ratings yet
A_Survey_on_Time-Series_Pre-Trained_Models
20 pages
Categorizing The Students Activities For Automate
No ratings yet
Categorizing The Students Activities For Automate
19 pages
deepSeek_Attack
No ratings yet
deepSeek_Attack
12 pages
Semi-Supervised Learning A Brief Review
No ratings yet
Semi-Supervised Learning A Brief Review
6 pages
PDE2
No ratings yet
PDE2
20 pages
Few-Shot Learning For Palmprint Recognition Via Meta-Siamese Network
No ratings yet
Few-Shot Learning For Palmprint Recognition Via Meta-Siamese Network
12 pages
1-1
No ratings yet
1-1
14 pages
Electronics 11 03998
No ratings yet
Electronics 11 03998
15 pages
Li 2021
No ratings yet
Li 2021
16 pages
poster-garciacarrasco
No ratings yet
poster-garciacarrasco
1 page
s11831-023-09884-2
No ratings yet
s11831-023-09884-2
15 pages
EmergingArtificialIntelligenceApplicationReinforcementLearningIssuesonCurrentInternetofThings
No ratings yet
EmergingArtificialIntelligenceApplicationReinforcementLearningIssuesonCurrentInternetofThings
6 pages
entropy-24-00551-v2
No ratings yet
entropy-24-00551-v2
22 pages
DRBclassifier
No ratings yet
DRBclassifier
23 pages
A Guide To Machine Learning For Biologists
No ratings yet
A Guide To Machine Learning For Biologists
16 pages
li2017
No ratings yet
li2017
12 pages
SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
No ratings yet
SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
11 pages
Deep-learning-applications-and-challenges-in
No ratings yet
Deep-learning-applications-and-challenges-in
22 pages
A Survey On Data Collection For Machine Learning A Big Data - AI Integration Perspective
No ratings yet
A Survey On Data Collection For Machine Learning A Big Data - AI Integration Perspective
20 pages
PI-Fed_Continual_Federated_Learning_With_Parameter-Level_Importance_Aggregation
No ratings yet
PI-Fed_Continual_Federated_Learning_With_Parameter-Level_Importance_Aggregation
13 pages
Learning From Noisy Labels With Deep Neural Networks Survey
No ratings yet
Learning From Noisy Labels With Deep Neural Networks Survey
19 pages
Development of Electronic Document Archi
No ratings yet
Development of Electronic Document Archi
13 pages
Development of Electronic Document Archive Management System (EDAMS) : A Case Study of A University Registrar in The Philippines
No ratings yet
Development of Electronic Document Archive Management System (EDAMS) : A Case Study of A University Registrar in The Philippines
13 pages
A Survey On Deep Transfer Learning
No ratings yet
A Survey On Deep Transfer Learning
10 pages
Prediction of Student Performance Using Linear Regression
No ratings yet
Prediction of Student Performance Using Linear Regression
5 pages
DLAP
No ratings yet
DLAP
15 pages
11-A-SMOTE_A_new_preprocessing_approach_for_highly_im
No ratings yet
11-A-SMOTE_A_new_preprocessing_approach_for_highly_im
11 pages
research paper on collaborative learning
No ratings yet
research paper on collaborative learning
40 pages
1 s2.0 S0020025523008204 Main
No ratings yet
1 s2.0 S0020025523008204 Main
22 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
Security Risks in Deep Learning Implementations
No ratings yet
Security Risks in Deep Learning Implementations
5 pages
1711.11008
No ratings yet
1711.11008
5 pages
Learning_under_Concept_Drift_A_Review
No ratings yet
Learning_under_Concept_Drift_A_Review
18 pages
Model-Based Deep Learning
No ratings yet
Model-Based Deep Learning
35 pages
Sampling Yue Fuselage
No ratings yet
Sampling Yue Fuselage
11 pages
Futureinternet 14 00008 v2
No ratings yet
Futureinternet 14 00008 v2
17 pages
Learning To Purification For Unsupervised Person Re-Identification
No ratings yet
Learning To Purification For Unsupervised Person Re-Identification
16 pages
Neural Networks & Deep Learning Makaut & & 7th SemNotes
No ratings yet
Neural Networks & Deep Learning Makaut & & 7th SemNotes
36 pages
To Compress or Not To Compress - Self-Supervised Learning and Information Theory: A Review
No ratings yet
To Compress or Not To Compress - Self-Supervised Learning and Information Theory: A Review
38 pages
Vietnamese Sentiment Analysis Under Limited Training Data
No ratings yet
Vietnamese Sentiment Analysis Under Limited Training Data
14 pages
Perspectives: Scientific Machine Learning Benchmarks
No ratings yet
Perspectives: Scientific Machine Learning Benchmarks
8 pages
A Survey of Deep Active Learning
No ratings yet
A Survey of Deep Active Learning
40 pages
T-Trace Constructing The APTs Provenance Graphs Through Multiple Syslogs Correlation
No ratings yet
T-Trace Constructing The APTs Provenance Graphs Through Multiple Syslogs Correlation
17 pages
One-Pass Learning With Incremental and Decremental Features
No ratings yet
One-Pass Learning With Incremental and Decremental Features
17 pages
Semi-Supervised Learning A Brief Review
No ratings yet
Semi-Supervised Learning A Brief Review
6 pages
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
Machine Learning Models
No ratings yet
Machine Learning Models
14 pages
Software_ReEngineering_Day3
No ratings yet
Software_ReEngineering_Day3
7 pages
Software_ReEngineering_Day5
No ratings yet
Software_ReEngineering_Day5
7 pages
s11633-022-1406-4
No ratings yet
s11633-022-1406-4
31 pages
Software_ReEngineering_Day4
No ratings yet
Software_ReEngineering_Day4
7 pages
seg-s2s2-bart
No ratings yet
seg-s2s2-bart
5 pages
seg-lm9-position-encoding
No ratings yet
seg-lm9-position-encoding
4 pages
LLMs
No ratings yet
LLMs
24 pages
seg-mt5-neural-pt
No ratings yet
seg-mt5-neural-pt
10 pages
HDA-Net_HampE_and_RGB_Dual_Attention_Network_for_Nuclei_Instance_Segmentation
No ratings yet
HDA-Net_HampE_and_RGB_Dual_Attention_Network_for_Nuclei_Instance_Segmentation
11 pages
seg-s2s3-t5
No ratings yet
seg-s2s3-t5
5 pages
Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper
No ratings yet
Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper
11 pages
0 base paper
No ratings yet
0 base paper
16 pages
s40747-024-01471-7
No ratings yet
s40747-024-01471-7
19 pages
01 base paper
No ratings yet
01 base paper
12 pages
Stegmuller_ScoreNet_Learning_Non-Uniform_Attention_and_Augmentation_for_Transformer-Based_Histopathological_Image_WACV_2023_paper
No ratings yet
Stegmuller_ScoreNet_Learning_Non-Uniform_Attention_and_Augmentation_for_Transformer-Based_Histopathological_Image_WACV_2023_paper
10 pages
cas-dc-template
No ratings yet
cas-dc-template
14 pages
s00345-023-04489-7
No ratings yet
s00345-023-04489-7
9 pages
s12938-023-01157-0
No ratings yet
s12938-023-01157-0
38 pages
bioengineering-10-00957-v2
No ratings yet
bioengineering-10-00957-v2
16 pages
1-s2.0-S095741742202471X-main
No ratings yet
1-s2.0-S095741742202471X-main
11 pages
UJAT-Net_A_U-Net_Combined_Joint-Attention_and_Transformer_for_Breast_Tubule_Segmentation_in_HampE_Stained_Images
No ratings yet
UJAT-Net_A_U-Net_Combined_Joint-Attention_and_Transformer_for_Breast_Tubule_Segmentation_in_HampE_Stained_Images
10 pages
2307.08051v1
No ratings yet
2307.08051v1
10 pages
2304.04567v1
No ratings yet
2304.04567v1
17 pages
2203.07707v2
No ratings yet
2203.07707v2
12 pages
Kang_Benchmarking_Self-Supervised_Learning_on_Diverse_Pathology_Datasets_CVPR_2023_paper
No ratings yet
Kang_Benchmarking_Self-Supervised_Learning_on_Diverse_Pathology_Datasets_CVPR_2023_paper
11 pages
1-s2.0-S2153353924000257-main
No ratings yet
1-s2.0-S2153353924000257-main
11 pages
Li_Mask_DINO_Towards_a_Unified_Transformer-Based_Framework_for_Object_Detection_CVPR_2023_paper
No ratings yet
Li_Mask_DINO_Towards_a_Unified_Transformer-Based_Framework_for_Object_Detection_CVPR_2023_paper
10 pages
Genes Chromosomes Cancer - 2023 - Cooper - Machine learning in computational histopathology Challenges and opportunities
No ratings yet
Genes Chromosomes Cancer - 2023 - Cooper - Machine learning in computational histopathology Challenges and opportunities
17 pages
Multimodal Fusion Research Papers Survey
No ratings yet
Multimodal Fusion Research Papers Survey
1 page
Dynamic GPU
No ratings yet
Dynamic GPU
13 pages
Neuroscience-Inspired Artificial Intelligence
No ratings yet
Neuroscience-Inspired Artificial Intelligence
15 pages
1 s2.0 S0893608023002113 Main
No ratings yet
1 s2.0 S0893608023002113 Main
14 pages
Breast Cancer Detection Using Deep Learning
No ratings yet
Breast Cancer Detection Using Deep Learning
20 pages
Short Notes
No ratings yet
Short Notes
44 pages
Internet Technologies Notes 2 - TutorialsDuniya
No ratings yet
Internet Technologies Notes 2 - TutorialsDuniya
74 pages
Resarch Paper Overleaf
No ratings yet
Resarch Paper Overleaf
10 pages
How Do Generative Models Work in DeepnbspLearning Generative Models For Data Augmentation Explained
No ratings yet
How Do Generative Models Work in DeepnbspLearning Generative Models For Data Augmentation Explained
6 pages
Intersecting Machining Feature Localisation and Recognition Via Single Shot Multibox Detector.
No ratings yet
Intersecting Machining Feature Localisation and Recognition Via Single Shot Multibox Detector.
11 pages
Sms Spam Detectionn (1)
No ratings yet
Sms Spam Detectionn (1)
63 pages
How Banks Can Better Serve Their Customers Through Artificial Techniques
No ratings yet
How Banks Can Better Serve Their Customers Through Artificial Techniques
16 pages
Data Structures Notes - TutorialsDuniya-1
100% (1)
Data Structures Notes - TutorialsDuniya-1
163 pages
6g Era
No ratings yet
6g Era
12 pages
openSAP Genai1 Transcript EN
No ratings yet
openSAP Genai1 Transcript EN
28 pages
What Is Artificial Intelligence
No ratings yet
What Is Artificial Intelligence
14 pages
chapter-1
No ratings yet
chapter-1
16 pages
Grade 9 AI Revision Worksheet
No ratings yet
Grade 9 AI Revision Worksheet
20 pages
Fast Gradient Attack On Network Embedding
No ratings yet
Fast Gradient Attack On Network Embedding
13 pages
Deep Learning
No ratings yet
Deep Learning
43 pages
A Comprehensive Review On Feature Set Used For Anaphora Resolution
No ratings yet
A Comprehensive Review On Feature Set Used For Anaphora Resolution
90 pages
Massively Distributed SGD: Imagenet/Resnet-50 Training in A Flash
No ratings yet
Massively Distributed SGD: Imagenet/Resnet-50 Training in A Flash
7 pages
A Personalized Recommendation Framework Based on MOOC System
No ratings yet
A Personalized Recommendation Framework Based on MOOC System
17 pages