A_Survey_on_Self-supervised_Learning_Algorithms_Applications_and_Future_Trends
A_Survey_on_Self-supervised_Learning_Algorithms_Applications_and_Future_Trends
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Abstract—Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance.
However, the process of collecting and labeling such data can be expensive and time-consuming. Self-supervised learning (SSL),
a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying on human-annotated
labels. SSL has garnered significant attention recently, leading to the development of numerous related algorithms. However, there is a
dearth of comprehensive studies that elucidate the connections and evolution of different SSL variants. This paper presents a review of
diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions. Firstly,
we provide a detailed introduction to the motivations behind most SSL algorithms and compare their commonalities and differences.
Secondly, we explore representative applications of SSL in domains such as image processing, computer vision, and natural language
processing. Lastly, we discuss the three primary trends observed in SSL research and highlight the open questions that remain. A
curated collection of valuable resources can be accessed at https://github.com/guijiejie/SSL.
Index Terms—Self-supervised learning, Contrastive learning, Generative model, Representation learning, Transfer learning
1 I NTRODUCTION
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
2 A LGORITHMS
This section begins by providing an introduction to SSL, fol-
lowed by an explanation of the pretext tasks associated with
SSL and their integration with other learning paradigms.
Fig. 2: Google Scholar search results for “self-supervised
learning”. The vertical and horizontal axes denote the num-
ber of SSL publications and the year, respectively. 2.1 What is SSL?
The introduction of SSL is attributed to [32] (Fig. 3), who
employed this architecture to learn in natural environments
ity to leverage extensive unlabeled data since the gener- featuring diverse modalities. Although the cow image may
ation of pseudo-labels does not necessitate human anno- not warrant a cow label, it is frequently associated with a
tations. By utilizing these pseudo-labels during training, “moo” sound. The crux lies in the co-occurrence relationship
self-supervised algorithms have demonstrated promising between them.
outcomes, resulting in a reduced performance disparity Subsequently, the machine learning community has ad-
compared to supervised algorithms in downstream tasks. vanced the concept of SSL, which falls within the realm
Asano et al. [14] demonstrated that SSL can produce gen- of unsupervised learning. SSL involves generating output
eralizable features that exhibit robust generalization even labels “intrinsically” from input data examples by revealing
when applied to a single image. the relationships between data components or various views
The advancement of SSL [3], [4], [15]–[24] has exhib- of the data. These output labels are derived directly from the
ited rapid progress, capturing significant attention within data examples. According to this definition, an autoencoder
the research community (Fig. 2), and is recognized as a (AE) can be perceived as a type of SSL algorithms, where
crucial element for achieving human-level intelligence [25]. the output labels correspond to the data itself. AEs have
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Moo~
CL, generative algorithms, and contrastive generative meth-
ods. In our paper, generative algorithms
Fig. 4: Illustration primarily
of three commonrefer to
masked image modeling (MIM) methods.
context-based methods:
Fig. 3: The differences among supervised learning, unsuper- rotation, jigsaw, and colorization.
vised learning, and SSL. The image is reproduced from [32]. 2.2.1 Context-based methods
SSL utilizes freely derived labels as supervision instead of Context-based methods rely on the inherent contextual re-
manually annotated labels. lationships among the provided examples, encompassing
aspects such as spatial structures and the preservation of
both local and global consistency. We illustrate the concept
gained extensive usage across multiple domains, including of context-based pretext tasks using rotation as a simple
dimensionality reduction and anomaly detection. example [36]. Subsequently, we progressively introduce ad-
In the keynote talk at ICLR 2020 [33], Yann LeCun ditional tasks (Fig. 4).
elucidated the concept of SSL as an analogous process to Rotation: Gidaris et al. [7] trained deep neural networks
completing missing information (reconstruction). He pre- (DNNs) to learn image representations by recognizing the
sented multiple variations as follows: 1) Predict any part of random geometric transformations. They streamlined image
the input from any other part; 2) Predict the future from the augmentation by introducing rotations of 0◦ , 90◦ , 180◦ , and
past; 3) Predict the invisible from the visible; and 4) Predict 270◦ to generate three additional images from each original.
any occluded, masked, or corrupted part from all available This method employs rotation angles as self-supervised
parts. In summary, a portion of the input is unknown in SSL, labels, using a set of K = 4 geometric transformations
and the objective is to predict that particular segment. G = {g(·|y)}K y=1 . Here, g(·|y) applies a geometric transfor-
Jing et al. [34] expanded the definition of SSL to encom- mation labeled y to an image X , resulting in a transformed
pass methods that operate without human-annotated labels. image X y = g(X|y).
Consequently, any approach devoid of such labels can be Gidaris et al. utilized a deep convolutional neural net-
categorized under SSL, effectively equating SSL with unsu- work (CNN), F(·), to perform rotation prediction through
pervised learning. This categorization includes generative a four-class categorization task. This CNN processes an
∗
adversarial networks (GANs) [35], thereby positioning them input image X y , with y ∗ being unknown to F(·), and
within the realm of SSL. outputs a probability distribution over possible geometric
Pretext tasks, also referred to as surrogate or proxy transformations, expressed as
tasks, are a fundamental concept in the field of SSL. The ∗ n ∗ oK
term “pretext” denotes that the task being solved is not F X y |θ = F y X y |θ . (1)
y=1
the primary objective but serves as a means to generate a ∗
robust pre-trained model. Prominent examples of pretext Here, F y X y |θ represents the predicted probability for
tasks include rotation prediction and instance discrimina- the geometric transformation labeled as y , while θ denotes
tion, among others. Each pretext task necessitates the use of the learnable parameters of F(·).
distinct loss functions to achieve its intended goal. Given the Given training instances D = {Xi }N i=1 , the training
significance of pretext tasks in SSL, we proceed to introduce objective can be formulated as
them in further detail.
N
1 X
min L(Xi , θ). (2)
θ N i=1
2.2 Pretext tasks
Here, the loss function is defined as
This section provides a comprehensive overview of the
K
pretext tasks employed in SSL. A prevalent approach in 1 X
SSL involves devising pretext tasks for networks to solve, L(Xi , θ) = − log(F y (g (Xi |y) |θ)). (3)
K y=1
where the networks are trained by optimizing the objective
functions associated with these tasks. Pretext tasks typi- In [37], the relative rotation angle was confined to the
cally exhibit two key characteristics. Firstly, deep learning interval of [−30o , 30o ]. These rotations were discretized into
methods are employed to learn features that facilitate the bins of 3o each, leading to a total of 20 classes (or bins).
resolution of pretext tasks. Secondly, supervised signals Colorization: The concept of colorization was initially
are derived from the data itself, a process known as self- introduced in [38], and subsequent studies [39]–[41] demon-
supervision. Commonly employed techniques encompass strated its effectiveness as a pretext task for SSL. Color pre-
four categories of pretext tasks: context-based methods, diction offers the advantageous feature of requiring freely
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
available training data. In this context, a model can utilize consistency. The following section outlines the various CL
the lightness channel of any color image as input and utilize methods currently available (Fig. 5).
the corresponding ab color channels in the CIE Lab color 2.2.2.1 Negative example-based CL: Negative
space as self-supervised signals. The objective is to predict examples-based CL adheres to a pretext task known as
the ab color channels Y ∈ RH×W ×2 given an input lightness instance discrimination, which involves generating distinct
channel X ∈ RH×W ×1 . A commonly employed learning views of an instance. In negative examples-based CL,
objective is views originating from the same instance are treated as
2 positive examples for an anchor sample, while views
L = Ŷ − Y , (4) from different instances serve as negative examples. The
F
underlying principle is to promote proximity between
where Y and Ŷ denote the ground truth and predicted positive examples and maximize the separation between
values, respectively. negative examples within the latent space. The definition
Besides, [38] utilized the multinomial cross-entropy loss of positive and negative examples varies depending on
instead of (4) to enhance robustness. Upon completing the factors such as the modality being considered and specific
training process, the ab color channels would be predicted requirements, including spatial and temporal consistency
for any grayscale image. Consequently, the lightness chan- in video understanding or the co-occurrence of modalities
nel and the ab color channels can be concatenated to restore in multi-modal learning scenarios. In the context of
the original grayscale image to a colorful representation. conventional 2D image CL, image augmentation techniques
Jigsaw: The jigsaw approach leverages jigsaw puzzles are utilized to generate diverse views from a single image.
as surrogate tasks, operating under the assumption that a MoCo: He et al. [50] framed CL as a dictionary look-
model accomplishes these tasks by comprehending the con- up task. In this framework, a query q and a set of encoded
textual information embedded within the examples. Specif- examples {k0 , k1 , k2 , · · ·} serve as the keys in a dictionary.
ically, images are fragmented into discrete patches, and Assuming a single key, denoted as k+ in the dictionary,
their positions are randomly rearranged, with the objective matches the query q , a contrastive loss [57] function is
of reconstructing the original order. In [42], the impact of employed. The value of this function is low when q is similar
scaling two self-supervised methods, namely jigsaw [8], [43] to its positive key k+ and dissimilar to all other negative
and colorization, was investigated along three dimensions: keys. In the MoCo v1 [50] framework, the InfoNCE loss
data size, model capacity, and problem complexity. The function [58], a form of contrastive loss, is utilized, i.e.,
results indicated that transfer performance exhibits a log-
exp(q · k+ /τ )
linear growth pattern in relation to data size. Furthermore, Lq = − log PK , (5)
representation quality was found to improve with higher- i=0 exp(q · ki /τ )
capacity models and increased problem complexity. where τ represents the temperature hyper-parameter and (·)
Others: The pretext task employed in [44], [45] involved denotes vector product. The summation is computed over
a conditional motion propagation problem. To enforce a one positive example and K negative examples. InfoNCE is
specific constraint on the feature representation process, derived from noise contrastive estimation (NCE) [59].
Noroozi et al. [46] introduced an additional requirement MoCo v2 [51] builds upon MoCo v1 [50] and SimCLR v1
where the sum of feature representations of all image [52], incorporating a multilayer perceptron (MLP) projection
patches should approximate the feature representation of head and more data augmentations.
the entire image. While many pretext tasks yield represen- SimCLR: SimCLR v1 [52] employs a mini-batch sam-
tations that exhibit covariance with image transformations, pling strategy with N instances, wherein a contrastive pre-
[47] argued for the importance of semantic representations diction task is formulated on pairs of augmented instances
being invariant to such transformations. In response, they from the mini-batch, generating a total of 2N instances.
proposed a pretext-invariant representation learning ap- Notably, SimCLR v1 does not explicitly select negative
proach that enables the learning of invariant representations instances. Instead, for a given positive pair, the remaining
through pretext tasks. 2(N − 1) augmented instances in the mini-batch are treated
as negatives. Let sim(u, v) = uT v (∥u∥ ∥v∥) represent the
2.2.2 Contrastive Learning cosine similarity between two instances u and v . The loss
Numerous SSL methods based on CL have emerged, build- function of SimCLR v1 for a positive instance pair (i, j) is
ing upon the foundation of simple instance discrimination defined as
tasks [48], [49]. Notable examples include MoCo v1 [50], exp(sim(zi , zj )/τ )
MoCo v2 [51], SimCLR v1 [52] and SimCLR v2 [53]. Pioneer- Li,j = − log P2N , (6)
ing algorithms, such as MoCo, have significantly enhanced k=1 1[k̸=i] exp(sim(zi , zk )/τ )
the performance of self-supervised pre-training, reaching where 1[k̸=i] ∈ {0, 1} is an indicator function equal to 1 if
a level comparable to that of supervised learning, thus k ̸= i, and τ denotes the temperature hyper-parameter. The
rendering SSL highly pertinent for large-scale applications. overall loss is computed across all positive pairs, including
Early CL approaches were built upon the concept of utiliz- both (i, j) and (j, i), within the mini-batch.
ing negative examples. However, as CL has progressed, a In MoCo, the features generated by the momentum
range of methods have emerged that eliminate the need for encoder are stored in a feature queue as negative examples.
negative examples. These methods embrace distinct ideas These negative examples do not undergo gradient updates
such as self-distillation and feature decorrelation, yet all during backpropagation. Conversely, SimCLR utilizes neg-
adhere to the principle of maintaining positive example ative examples from the current mini-batch, and all of them
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Fig. 5: Illustration of different CL methods: CL based on negative examples (left), CL based on self-distillation (middle),
Fig. 5: Illustration of different CL
and CL based on feature decorrelation (right). For a demonstration of the concepts of similarity and dissimilarity, one can
methods: CL based on
refer to [52], [54], while for insights intonegative
decorrelation,
examples [55], [56]
(left), CLprovide a comprehensive overview.
based on
self-distillation (middle),
and CL based on feature
are subjected to gradient updates during backpropagation.
decorrelation (right). address the overfitting issue arising from strong data aug-
Both MoCo and SimCLR rely on data augmentation tech- mentation, [66] proposes an alternative approach. Instead
niques, including cropping, resizing, and color distortion. of employing a one-hot distribution, they suggest using
Notably, SimCLR made a significant contribution by high- the distribution generated by weak data augmentation as
lighting the crucial role of robust data augmentation in CL, a mimic. This mitigates the negative impact of strong data
a finding subsequently confirmed by MoCo v2. Additional augmentation by aligning the distribution of augmented
augmentation methods have also been explored [60]. For examples with that of weakly augmented examples.
instance, in [61], foreground saliency levels were estimated 2.2.2.2 Self-distillation-based CL: Bootstrap Your
in images, and augmentations were created by selectively Own Latent (BYOL) [67] is a prominent self-distillation
copying and pasting image foregrounds onto diverse back- algorithm designed specifically for self-supervised image
grounds, such as grayscale images with random grayscale representation learning, eliminating the need for negative
levels, texture images, and ImageNet images. Furthermore, pairs. This approach employs two identical DNNs, known
views can be derived from various sources, including dif- as Siamese networks, with the same architecture but differ-
ferent modalities such as photos and sounds [62], as well as ent weights. One serves as the online network, while the
coherence among different image channels [63]. other is the target network. Similar to MoCo [50], BYOL
Minimizing the contrastive loss is known to effec- enhances the target network through a gradual averaging
tively maximize a lower bound of the mutual information of the online network. Siamese networks have emerged
I(x1 ; x2 ) between the variables x1 and x2 [58]. Building as prevalent architectures in contemporary self-supervised
upon this understanding, [64] proposes principles for de- visual representation learning models, including SimCLR,
signing diverse views based on information theory. These BYOL, and SwAV [68]. These models aim to maximize
principles suggest that the views should aim to maximize the similarity between two augmented versions of a single
I(v1 ; y) and I(v2 ; y) (v1 , v2 , and y denoting the first view, image while incorporating specific conditions to mitigate
the second view, and the label, respectively), representing the risk of collapsing solutions.
the amount of information contained about the task label, Simple Siamese (SimSiam) networks, introduced by [69],
while simultaneously minimizing I(v1 ; v2 ), indicating the offers a straightforward approach to learning effective rep-
shared information between inputs encompassing both task- resentations in SSL without the need for negative example
relevant and irrelevant details. Consequently, the optimal pairs, large batches, or momentum encoders. Given a data
data augmentation method is contingent on the specific point x and two randomly augmented views x1 and x2 ,
downstream task. In the context of dense prediction tasks, an encoder f and an MLP prediction head h process these
[65] introduces a novel approach for generating different views. The resulting outputs are denoted as p1 = h (f (x1 ))
views. This study reveals that commonly employed data and z2 = f (x2 ). The objective of [69] is to minimize their
augmentation methods, as utilized in SimCLR, are more negative cosine similarity:
suitable for categorization tasks rather than dense prediction p1 z2
tasks such as object detection and semantic segmentation. D (p1 , z2 ) = − . (7)
∥p1 ∥2 ∥z2 ∥2
Consequently, the design of data augmentation methods
tailored to specific downstream tasks has emerged as a Here, ∥∥2 represents the l2 -norm. Similar to [67], a symmet-
significant area of exploration. ric loss [69] is defined as
Given the observed benefits of strong data augmenta- 1
L = (D (p1 , z2 ) + D (p2 , z1 )) . (8)
tion in enhancing CL performance [52], there has been a 2
growing interest in leveraging more robust augmentation This loss is defined based on the example x, and the overall
techniques. However, it is worth noting that solely relying loss is the average of all examples. Notably, [69] employs
on strong data augmentation can actually lead to a decline a stop-gradient (stopgrad) operation by modifying Eq. (7)
in performance [64]. The distortions introduced by strong as D (p1 , stopgrad (z2 )). This implies that z2 is treated as a
data augmentation can alter the image structure, resulting constant. Similarly, Eq. (8) is revised as
in a distribution that differs from that of weakly augmented 1
images. This discrepancy poses optimization challenges. To L= (D (p1 , stopgrad (z2 )) + D (p2 , stopgrad (z1 ))) . (9)
2
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
grad
grad similarity & grad
similarity training joint embedding architectures that simultaneously
dissimilarity
considers variance, invariance, and covariance. Similar to
predictor Barlow Twins, VICReg generates two distorted views Y A
share
weights
moving
average
and Y B via a distribution of the data augmentation T and
encoder ..................... encoder encoder
momentum
encoder gets their embeddings Z A ∈ Rn×d and Z B ∈ Rn×d . Let
the subscript j index the embedding in the batch and d, n
image image
represents the dimensionality of the vectors in Z A and the
SimCLR
batch size, respectively. The main contribution of VICReg
BYOL
is the variance preservation term, which explicitly prevents
grad
similarity grad
similarity a collapse due to a shrinkage of the embedding vectors
toward zero. The variance regularization term v in VICReg
Sinkhorn-
predictor
Knopp is defined as a hinge loss function applied to the standard
deviation of the embeddings along the batch dimension:
encoder encoder encoder encoder
d
1X
v ZA = max(0, γ − S zjA , ε ). (12)
image image
d j=1
SwAV SimSiam Here, zjA represents the vector composed of each value
at dimension j in Z A and S represents the regularized
Fig. 6: Comparison among different Siamese architectures. standard deviation, defined as
The image is reproduced from [69]. q
S(y, ε) = Var(y) + ε. (13)
Figure 6 illustrates the distinctions among SimCLR, The constant γ determines the standard deviation and is
BYOL, SwAV, and SimSiam. The categorization of BYOL and set to 1 in the experiments, while ε is a small scalar used to
SimSiam as CL methods is a subject of debate due to their prevent numerical instabilities. This criterion encourages the
exclusion of negative examples. However, to be consistent variance within the current batch to be equal to or greater
with [70], this paper considers BYOL and SimSiam to belong than γ for every dimension, thereby preventing collapse
to CL methods. scenarios where all data are mapped to the same vector.
2.2.2.3 Feature decorrelation-based CL: The objec- The invariance criterion s in VICReg, which captures
tive of feature decorrelation is to learn decorrelated features. the similarity between Z A and Z B , is defined as the mean-
Barlow Twins: Barlow Twins [55] introduced a novel squared Euclidean distance between each pair of data with-
loss function that encourages the similarity of embedding out any normalization:
vectors from distorted versions of an example while min- n
imizing redundancy between their components. Similar to
1X A 2
s Z A, Z B = z − zbB . (14)
other SSL methods such as MoCo [50] and SimCLR [52], n b=1 b 2
Barlow Twins generates two distorted views Y A and Y B
In addition, the covariance criterion c(Z) in VICReg is
via a distribution of data augmentations T for each image
defined as
in a data batch sampled from a dataset, resulting in batches
of embeddings Z A and Z B . The loss function of Barlow 1X 2
c (Z) = [C(Z)]i,j , (15)
Twins is defined as d i̸=j
2
X XX
2
LBT = (1 − Cii ) + λ Cij . (10) where C(Z) represents the covariance matrix of Z . The
i i j̸=i overall loss of VICReg is a weighted sum of the variance,
Here, λ is a hyper-parameter, and C represents the cross- invariance, and covariance:
correlation matrix computed between the two batches of L = s Z A, Z B
ZA + v ZB
+ α v
embeddings Z A and Z B , defined as +β C Z A + C Z B ,
(16)
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
TABLE 2: Categorization of MIM methods based on the reconstruction target. The second and third rows denote MIM
methods and reconstructing targets, respectively.
Low-Level Targets High-Level Targets Self-Distillation Contrastive / Multi-modal Teacher
Algorithm ViT [5] MAE [70] SimMIM [101] Maskfeat [106] BEiT [99] CAE [100] PeCo [107] data2vec [108] SdAE [109] MimCo [110] BEiT v2 [111]
Target Raw Pixel HOG VQ-VAE VQ-GAN self MoCo v3 CLIP
SSL
Negative Contrastive/Multi-
Clustering Self-Distillation Feature Decorrelation Low-level targets High-level targets Self-Distillation
Samples modal Teacher
Generative pre-training has also evolved in the video do- supervised learning have motivated researchers to explore
main. BEVT [115] decouples video representation learning the combination of these two kinds of approaches.
into spatial representation learning and temporal dynamics To elaborate further, let us compare the challenges faced
learning. It first undertakes masked image modeling on by contrastive self-supervised methods and generative self-
image data, followed by a joint approach of masked im- supervised methods. Generative self-supervised methods
age modeling and masked video modeling on video data. are characterized as data-filling approaches [124]. For a
This accelerates training and achieves results comparable to model of a certain size, when the dataset reaches a certain
those of strongly-supervised baselines. Similarly, VideoMAE magnitude, further scaling of the data does not lead to
[116] extends the MAE to videos and discovers that an significant performance gains in generative self-supervised
extremely high proportion of masking ratio (90% to 95%) is methods. In contrast, recent studies have revealed the po-
permissible in video mask modeling. Moreover, it remains tential of data scaling to enhance the performance of CL
effective even on very small datasets, consisting of only [125]. As data increases, CL shows substantial performance
3,000 to 4,000 videos. OmniMAE [117] demonstrates that improvements, demonstrating remarkable generalization
a unified model can be concurrently trained across multi- without additional fine-tuning on downstream tasks. How-
ple visual modalities, breaking the paradigm of previously ever, the scenario differs in low-data regimes. Contrastive
studying different modes in isolation. This significantly models may find shortcuts with trivial representations that
streamlines the training process, enabling more efficient overfit the limited data [50], thus leading to inconsistent
development of large-scale model architectures. SiamMAE improvements in generalization performance for down-
[118] indicates that, contrary to images that are (approxi- stream tasks using pre-trained models with contrastive self-
mately) isotropic, the temporal dimension is unique, neces- supervised methods [123]. On the other hand, generative
sitating an asymmetric approach to processing temporal and methods are more adept at handling low-data scenarios and
spatial information, as not all spatiotemporal orientations can even achieve notable performance improvements when
are equally probable. data is extremely scarce, such as with only 10 images [126].
MIM has demonstrated significant potential in pre-
Several endeavors have sought to integrate both types
training vision transformers [119]–[121]. However, in prior
of algorithms [123], [127]. In [127], GANs are employed
works, the random masking of image patches led to an
for online data augmentation in CL. The study devises a
underutilization of valuable semantic information essential
contrastive module that learns view-invariant features for
for effective visual representation learning. Liu et al. [122]
generation and introduces a view-invariant loss function to
introduced an attention-driven masking strategy to explore
facilitate learning between original and generated views. On
improvements over random masking for insufficient seman-
the other hand, [98] draws inspiration from both BEiT and
tic utilization.
DINO [83]. It modifies the tokenizer of BEiT to an online dis-
tilled teacher while integrating cross-view distillation from
2.2.4 Contrastive Generative Methods the DINO framework. As a result, iBOT [98] significantly
As stated in [123], contrastive models tend to be data- enhances linear probing accuracy compared to the MIM
hungry and vulnerable to overfitting issues, whereas gen- method. RePre [128] integrates local feature learning into
erative models encounter data-filling challenges and ex- self-supervised vision transformers through reconstructive
hibit inferior data scaling capabilities when compared to pre-training, an approach that enhances contrastive frame-
contrastive models. While contrastive models often fo- works. This is achieved by incorporating an additional
cus on global views [83], overlooking internal structures branch dedicated to reconstructing raw image pixels, which
within images, MIM primarily models local relationships. operates concurrently with the established contrastive objec-
The divergent characteristics and challenges encountered tive. CMAE [129] concurrently performs CL and MIM tasks.
in contrastive self-supervised learning and generative self- To align CL with MIM effectively, CMAE introduces two
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
novel components: pixel shifting for generating plausible The SS-GAN [144] is defined by combining the objective
positive views, and a feature decoder for enhancing the functions of GANs with the concept of rotation [7]:
features of contrastive pairs. This approach significantly
improves the quality of representation and transfer perfor- LG (G, D) = −V (G, D)
mance compared to its MIM-only counterparts. SiameseIM − αEx∼pG Er∼R [log QD (R = r|xr )], (19)
[130] does not simply merge the objectives of CL and MIM,
but rather utilizes the views generated by CL as the target LD (G, D) = V (G, D)
for MIM reconstruction in the latent space.
− β Ex∼pdata Er∼R [log QD (R = r|xr )], (20)
Despite attempts to combine both types of approaches,
naive combinations may not always yield performance where V (G, D) represents the objective function of GANs as
gains and can even perform worse than the generative given in Eq. (18), and r ∼ R refers to a rotation selected from
model baseline, thereby exacerbating the issue of repre- a set of possible rotations, similar to the concept presented
sentation over-fitting [123]. The performance degradation in [7]. Here, xr denotes an image x rotated by r degrees,
could be attributed to the disparate properties of CL and and Q (R|xr ) corresponds to the discriminator’s predictive
generative methods. For instance, CL methods typically distribution over the angles of rotation for a given example
exhibit longer attention distances, whereas generative meth- x. Notably, rotation [7] serves as a classical SSL method. The
ods tend to favor local attention [131]. In light of this SS-GAN incorporates rotation invariance into the GANs’
challenge, RECON [123] emerges as a solution by training generation process by integrating the rotation prediction
generative modeling to guide CL, thereby leveraging the task during training.
benefits of both paradigms.
2.3.2 Semi-supervised learning
2.2.5 Summary SSL and semi-supervised learning are contrasting
As described above, numerous pretext tasks for SSL have paradigms that can be effectively combined. One notable
been devised, with several significant milestone variants example of this combination is self-supervised semi-
depicted in Fig. 8. Several other pretext tasks are avail- supervised learning (S4 L) [145]. In S4 L, the objective
able [132], [133], encompassing diverse approaches such function is given by
as relative patch location [134], noise prediction [135], fea-
ture clustering [136]–[138], cross-channel prediction [139], L = min Ll (Dl , θ) + wLu (Du , θ) . (21)
θ
and combining different cues [140]. Kolesnikov et al. [141]
This means optimizing the corresponding loss objectives on
conducted a comprehensive investigation of previously
a labeled dataset Dl and an unlabeled dataset Du . Ll is the
proposed SSL pretext tasks, yielding significant insights.
categorization loss (e.g., cross-entropy) and Lu stands for
Besides, Krähenbühl et al. [142] proposed an alternative
the self-supervised loss (e.g., rotation task in Eq. (3)). θ is the
approach to pretext tasks and demonstrated the ease of
learnable parameters.
obtaining data from video games.
Incorporating SSL as an auxiliary task is a well-
It has been observed that context-based approaches ex-
established approach in semi-supervised learning. Another
hibit limited applicability due to their inferior performance.
classical method to leverage SSL within this context involves
In the realm of visual SSL, two dominant types of algorithms
implementing SSL on unlabeled data, followed by fine-
are CL and MIM. While visual CL may encounter overfit-
tuning the resultant model on labeled data, as demonstrated
ting issues, CL algorithms that incorporate multi-modality,
in the SimCLR.
exemplified by CLIP [2], have gained popularity.
To demonstrate the robustness of self-supervision
against adversarial perturbations, Hendrycks et al. [146]
2.3 Combinations with other learning paradigms proposed an overall loss function as a linear combination
of supervised and self-supervised losses:
It is essential to acknowledge that the advancements in
SSL did not occur in isolation; instead, they have been L(x, y, θ) = LCE (y, p (y|P GD(x)) , θ)
the result of continuous development over time. In this (22)
+λLSS (P GD(x), θ) ,
section, we provide a comprehensive list of relevant learning
paradigms that, when combined with SSL, contribute to a where x is the example, y is the one-hot vector of ground-
clearer understanding of their collective impact. truth and θ denotes the model parameters. The adversarial
example is generated from x by projected gradient descent
(PGD) and adversarial training is implemented by cross-
2.3.1 GANs
entropy loss LCE . LSS is the self-supervised loss.
GANs represent classical unsupervised learning methods
and were among the most successful approaches in this do- 2.3.3 Multi-instance learning (MIL)
main before the surge of SSL techniques. The integration of
Miech et al. [13] introduced an extension of the InfoNCE
GANs with SSL offers various avenues, with self-supervised
loss (5) for MIL and termed it MIL-NCE:
GANs (SS-GAN) serving as one such example. The GANs’
T
objective function [35], [143] is given as ef (x) g(y)
P
n
X (x,y)∈Pi
min max V (G, D) = Ex∼pdata (x) [log D (x)] max log T
, (23)
′ )T g(y ′ )
f (x) g(y) f (x
P P
G D (18) f,g
i=1
e + e
+Ez∼pz (z) [log (1 − D (G (z)))] . (x,y)∈Pi (x′ ,y ′ )∈Ni
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
where x and y represent a video clip and a narration, CLIP’s advancements have significantly propelled multi-
respectively. The functions f and g generate embeddings modal learning to the forefront of research attention.
of x and y , respectively. For a specific example indexed by 2.3.4.3 Point clouds and other modalities: Several
i, Pi denotes the set of positive video/narration pairs, while SSL methods have been proposed for joint learning of 3D
Ni corresponds to the set of negative video/narration pairs. point cloud features and 2D image features by leverag-
ing cross-modality and cross-view correspondences through
2.3.4 Multi-view/multi-modal(ality) learning triplet and cross-entropy losses [149]. Additionally, there are
Observation plays a vital role in infants’ acquisition of efforts to jointly learn view-invariant and mode-invariant
knowledge about the world. Notably, they can grasp the characteristics from diverse modalities, such as images,
concept of apples through observational and comparative point clouds, and meshes, using heterogeneous networks
processes, which distinguishes their learning approach from for 3D data [150]. SSL has also been employed for point
traditional supervised algorithms that rely on extensive cloud datasets, with approaches including CL and cluster-
labeled apple data. This phenomenon was demonstrated by ing based on graph CNNs [151]. Furthermore, AEs have
Orhan et al. [22], who gathered perceptual data from infants been used for point clouds in works like [113], [114], [152],
and employed an SSL algorithm to model how infants learn [153], while capsule networks have been applied to point
the concept of “apple”. Moreover, infants’ learning about the cloud data in [154].
world extends to multi-view and multi-modal(ality) learn-
ing [2], encompassing various sensory inputs such as video 2.3.5 Test time training
and audio. Hence, SSL and multi-view/multi-modal(ality) Sun et al. [155] introduced “test time training (TTT) with
learning converge naturally in infants’ learning mechanisms self-supervision” to enhance the performance of predictive
as they explore and comprehend the workings of the world. models when the training and test data come from distinct
2.3.4.1 Multiview CL: The objective function in distributions. TTT converts an individual unlabeled test
standard multiview CL, as proposed by Tian et al. [64], is example into an SSL problem, enabling model parameter
given by updates before making predictions. Recently, Gandelsman
et al. [156] combined TTT with MAE for improved perfor-
LN CE = E [Lq ] , (24) mance. They argued that by treating TTT as a one-sample
where Lq corresponds to Eq. (5). Multiview CL treats dif- learning problem, optimizing a model for each test input
ferent views of the same sample as positive examples for could be addressed using the MAE as
contrastive learning. Tian et al. [64] introduced both un- n
1X
supervised and semi-supervised multiview learning based h0 = arg min Lm (h ◦ f0 (xi ) , yi ) , (27)
h n i=1
on adversarial learning. Let X̂ denote g(X), i.e., X̂ =
g(X). Two encoders, f1 and f2 , were trained to maximize fx , gx = arg min Ls (g ◦ f (mask(x)), x). (28)
f,g
IN CE (X̂1 , X̂2:3 ) as stated in Eq. (24). A flow-based model
g was trained to minimize IN CE (X̂1 , X̂2:3 ) and {X1 , X2:3 } Here, f and g refer to the encoder and decoder of MAE, and
is obtained from image splitting over its channels. Formally, h denotes the main task head, respectively.
the objective function for unsupervised view learning can TTT achieves an improved bias-variance tradeoff under
be expressed as distribution shifts. A static model heavily depends on train-
ing data that may not accurately represent the new test
min max INf1CE
,f2
(g(X)1 , g(X)2:3 ). (25) distribution, leading to bias. On the other hand, training
g f1 ,f2
a new model from scratch for each test input, ignoring all
In the context of semi-supervised view learning, when training data, is undesirable. This approach results in an
several labeled examples are available, the objective func- unbiased representation for each test input but exhibits high
tion is formulated as variance due to its singularity.
min max INf1CE
,f2
(g(X)1 , g(X)2:3 )
g,c1 ,c2 f1 ,f2 (26) 2.3.6 Summary
+Lce (c1 (g(X)1 ) , y) + Lce (c2 (g(X)2:3 ) , y) ,
The evolution of SSL is characterized by its dynamic and
where y represents the labels, c1 and c2 are classifiers, and interconnected nature. Analyzing the amalgamation of var-
Lce denotes the cross-entropy. Further relevant works can ious methods allows for a clearer grasp of SSL’s develop-
be found in [63], [64], [147]. Table 3 summarizes different mental trajectory. An exemplar of this success is evident
SSL losses. in CLIP, which effectively combines CL with multi-modal
2.3.4.2 Images and text: In the study conducted by learning, leading to remarkable achievements. SSL has been
Gomez et al. [148], the authors employed a topic mod- extensively integrated with various machine learning tasks,
eling framework to project the text of an article into the showcasing its versatility and potential. It has been com-
topic probability space. This semantic-level representation bined with clustering [68], semi-supervised learning [145],
was then utilized as the self-supervised signal for train- multi-task learning [157], [158], transfer learning [159]–[161],
ing CNN models on images. On a similar note, CLIP graph NNs [147], [162], [163], reinforcement learning [164]–
[2] leverages a CL-style pre-training task to predict the [166], few-shot learning [167], [168], neural architecture
correspondence between captions and images. Benefiting search [169], robust learning [146], [170]–[172], and meta-
from the CL paradigm, CLIP is capable of training models learning [173], [174]. This diverse integration underscores
from scratch on an extensive dataset comprising 400 million the widespread applicability and impact of SSL in the ma-
image-text pairs collected from the internet. Consequently, chine learning domain.
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
TABLE 4: Experimental results of the tested algorithms for linear classification and transfer learning tasks. DB denotes the
default batch size. The symbol “-” indicates the absence or unavailability of the data point in the respective paper. The
subscripts A, R, and V represent AlexNet, ResNet-50, and ViT-B, respectively. The superscript “e” indicates the utilization
of extra data, specifically VOC2012.
Methods Linear Probe Fine-Tuning VOC det VOC seg COCO det COCO seg ADE20K seg DB
Random: 17.1A [8] - 60.2eR [69] 19.8A [8] 36.7R [50] 33.7R [50] - -
R50 Sup 76.5 [68] 76.5 [68] 81.3e [69] 74.4 [67] 40.6 [50] 36.8 [50] - -
ViT-B Sup 82.3 [70] 82.3 [70] - - 47.9 [70] 42.9 [70] 47.4 [70] -
Context-Based:
Jigsaw [8] 45.7R [68] 54.7 61.4R [42] 37.6 - - - 256
Colorization [38] 39.6R [68] 40.7 [7] 46.9 35.6 - - - -
Rotation [7] 38.7 50.0 54.4 39.1 - - - 128
CL Based on Negative Examples:
Examplar [132] 31.5 [48] - - - - - - -
Instdisc [48] 54.0 - 65.4 - - - - 256
MoCo v1 [50] 60.6 - 74.9 - 40.8 36.9 - 256
SimCLR [52] 73.9V [82] - 81.8e [69] - 37.9 [69] 33.3 [69] - 4096
MoCo v2 [51] 72.2 [69] - 82.5e - 39.8 [56] 36.1 [56] - 256
MoCo v3 [82] 76.7 83.2 - - 47.9 [70] 42.7 [70] 47.3 [70] 4096
CL Based on Clustering:
SwAV [68] 75.3 - 82.6e [56] - 41.6 37.8 [56] - 4096
CL Based on Self-distillation:
BYOL [67] 74.3 - 81.4e [69] 76.3 40.4 [56] 37.0 [56] - 4096
SimSiam [69] 71.3 - 82.4e [69] - 39.2 34.4 - 512
DINO [83] 78.2 83.6 [98] - - 46.8 [100] 41.5 [100] 44.1 [99] 1024
CL Based on Feature Decorrelation:
Barlow Twins [55] 73.2 - 82.6e [56] - 39.2 34.3 - 2048
VICReg [56] 73.2 - 82.4e - 39.4 36.4 - 2048
Masked Image Modeling (ViT-B by default):
Context Encoder [104] 21.0A [7] - 44.5A [7] 30.0A - - - -
BEiT v1 [99] 56.7 [111] 83.4 [98] - - 49.8 [70] 44.4 [70] 47.1 [70] 2000
MAE [70] 67.8 83.6 - - 50.3 44.9 48.1 4096
SimMIM [101] 56.7 83.8 - - 52.3Swin−B [244] - 52.8Swin−B [244] 2048
PeCo [107] - 84.5 - - 43.9 39.8 46.7 2048
iBOT [98] 79.5 84.0 - - 51.2 44.2 50.0 1024
MimCo [110] - 83.9 - - 44.9 40.7 48.91 2048
CAE [100] 70.4 83.9 - - 50 44 50.2 2048
data2vec [108] - 84.2 - - - - - 2048
SdAE [109] 64.9 84.1 - - 48.9 43.0 48.6 768
BEiT v2 [111] 80.1 85.5 - - - - 53.1 2048
The evaluation of object detection on the PASCAL VOC tasks. MIM-based approaches consistently exhibit substan-
dataset employs mean average precision (mAP), specifically tial performance enhancements in downstream tasks, while
AP50 . By default, the object detection task on PASCAL VOC CL-based methods offer comparatively limited assistance.
employs VOC2007 for training. However, certain methods Thirdly, CL-based methods tend to employ resource-
employ the combined 07+12 dataset and are annotated with intensive techniques like momentum encoders, memory
a superscript “e”. As for the object detection and instance queues, and multi-crop, significantly increasing the de-
segmentation tasks on COCO, we adopt the bounding-box mands on computing, storage, and communication re-
AP (APbb ) and mask AP (APmk ) metrics, in accordance with sources. In contrast, MIM-based methods have a more ef-
[50]. The results on video understanding are evaluated using ficient resource utilization, possibly attributed to the ab-
fine-tuned Top-1 accuracy as the metric. sence of example interactions. This advantageous property
allows MIM-based algorithms to easily scale up models and
data, efficiently leveraging modern GPUs for high parallel
4.2 Summary
computing. We compared the computational complexity of
First, the linear probe performance of contrastive learning different SSL methods in Table 1 of the Appendix. Note
models typically surpasses that of other algorithms, and that the primary sources of time complexity and memory
contrastive learning approaches tend to regard the linear consumption are the neural network other than SSL compo-
probe as a significant performance metric. This superiority is nents, e.g., the calculation of the cross-correlation matrix in
attributed to contrastive learning generating well-structured Barlow Twins.
latent spaces, wherein distinct categories are effectively sep-
arated, and similar categories are appropriately clustered.
Secondly, it is observed that pre-trained models using
5 C ONCLUSIONS , F UTURE T RENDS , AND O PEN
MIM can be fine-tuned to achieve superior performance in Q UESTIONS
most cases. Conversely, pre-trained models based on CL In summary, this comprehensive review offers essential in-
lack this property. One primary reason for this discrepancy sights into contemporary SSL research, providing newcom-
lies in the increased susceptibility of CL-based models to ers with an overall picture of the field. The paper presents a
overfitting [66], [262], [263]. This observation also extends thorough survey of SSL from three main perspectives: algo-
to the fine-tuning of pre-trained models for downstream rithms, applications, and future trends. We focus on main-
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Contrastive Methods
Downstream Dataset
Pre-training Linear
Method Backbone UCF101 [254] HMDB51 [255]
Dataset Probe
Linear Fine-tune Linear Fine-tune
DSM [256] K400 R3D34 - - 78.2 - 52.8
TCE [257] K400 R50 - - 71.2 - 36.6
CoCRL [216] K400 S3D-G - 74.5 [258] 87.9 46.1 [258] 54.6
CoCRL K400 2×S3D-G - - 90.6 - 62.9
VTHCL [259] K400 R3D50 37.8 [260] - 82.1 - 49.2
CVRL [261] K400 R3D50 66.1 89.2 92.2 57.3 66.7
CVRL K600 R3D50 70.4 90.6 93.4 59.7 68.0
ρBYOL [260] K400 R3D50 71.5 - 95.5 - 73.6
ρBYOL K400 S3D-G - - 96.3 - 75.0
BraVe [258] K400 R3D50 - 90.6 93.7 65.1 72.0
BraVe K600 R3D50 69.1 91.9 94.4 67.6 73.9
Masked Image Modeling Methods
Downstream Dataset
Pre-training
Method Backbone K400 [251] SSv2 [252] AVA [253]
Dataset
MaskFeat [106] K400 MViTv2-L/312 86.4 74.4 37.5
BEVT [115] K400 Swin-B 76.2 67.1 -
BEVT IN1K + K400 Swin-B 80.6 70.6 -
VidelMAE [116] K400 ViT-B 80.0 68.5 26.7
VidelMAE SSv2 ViT-B 69.6 79.6 -
VidelMAE SSv2 ViT-L - 75.4 34.3
MAE-ST [112] K400 ViT-L 84.8 72.1 32.3
OmniMAE [117] IN1K + K400 ViT-B 80.8 69.0 -
OmniMAE IN1K + SSv2 ViT-B 80.6 69.5 -
OmniMAE IN1K + SSv2 ViT-L 84.0 74.2 -
stream visual SSL algorithms, classifying them into four tomatic design of an optimal pretext task to enhance the
major types: context-based methods, generative methods, performance of a fixed downstream task. Various methods
contrastive methods, and contrastive generative methods. have been proposed to address this challenge, including
Furthermore, we investigate the correlation between SSL the pixel-to-propagation consistency method [65] and dense
and other learning paradigms. Lastly, we will delve into contrastive learning [269]. However, this problem remains
future trends and open problems as outlined below. insufficiently resolved, and further theoretical investigations
Main trends: Firstly, the theoretical cloud still looms are warranted in this direction.
over SSL. How can we understand different SSL algorithms Thirdly, there is a pressing need for a unified SSL
and unify them in the same way physics seeks to unify paradigm that encompasses multiple modalities. MIM has
the four fundamental forces? [54] analyzed the key prop- demonstrated remarkable progress in vision tasks, akin to
erties of contrastive learning based on negative samples, en- the success of masked language model in NLP, suggesting
hancing the understanding of representation distributions. the possibility of unifying learning paradigms. Additionally,
[78] rethought contrastive learning from the perspective of the ViT architecture bridges the gap between visual and
spectral decomposition, providing a high-level understand- verbal modalities, enabling the construction of a unified
ing of why contrastive learning is effective. [264] showed transformer model for both CV and NLP tasks. Recent en-
practical properties, with InfoMin [64] indicating that the deavors [108], [270] have sought to unify SSL models, yield-
design of views should consider downstream tasks. [265] in- ing impressive results in downstream tasks and showing
vestigated why distillation-based methods do not collapse. broad applicability. Nevertheless, NLP has advanced further
[266] demonstrated the duality between negative example- in leveraging SSL models, prompting the CV community
based contrastive learning and covariance regularization- to draw inspiration from NLP approaches to effectively
based methods such as Barlow Twins, indicating the lat- harness the potential of pre-trained models.
ter can be seen as contrastive between the dimensions Open problems: Can SSL effectively leverage vast
of the embeddings instead of between the samples. [267] amounts of unlabeled data? How does it consistently benefit
demonstrated that introducing discrete sparse overcomplete from additional unlabeled data, and how can we determine
representations for SSL can improve generalization. [268] the theoretical inflection point?
presented the connections and distinctions among various Secondly, it is pertinent to explore the interconnection
SSL methods from the perspective of gradients. We antici- between SSL and multi-modality learning, as both method-
pate that new theoretical studies will aid in comprehending ologies share resemblances with the cognitive processes
and unifying various SSL approaches, particularly in har- observed in infants. Consequently, a critical inquiry arises:
monizing CL-base methods with MIM-based methods. how can these two approaches be synergistically integrated
Secondly, a crucial question arises concerning the au- to forge a robust and comprehensive learning model?
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
Thirdly, determining the most optimal or recommended [14] Y. M. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis
SSL algorithm poses a challenge as there is no universally of self-supervision, or what we can learn from a single image,”
in Int. Conf. Learn. Represent., 2020.
applicable solution. The ideal selection of an algorithm [15] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimension-
should align with the specific problem structure, yet prac- ality of data with neural networks,” Science, vol. 313, no. 5786,
tical situations often complicate this process. Consequently, pp. 504–507, 2006.
the development of a checklist to aid users in identifying the [16] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex-
tracting and composing robust features with denoising autoen-
most suitable method under particular circumstances war- coders,” in Int. Conf. Mach. Learn., pp. 1096–1103, 2008.
rants investigation and should be pursued as a promising [17] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning
avenue for future research. to grasp from 50k tries and 700 robot hours,” in IEEE Int. Conf.
Robot. Autom., pp. 3406–3413, 2016.
Fourthly, the assumption that unlabeled data invariably
[18] Y. Li, M. Paluri, J. M. Rehg, and P. Dollár, “Unsupervised learning
leads to improved outcomes warrants scrutiny. Our hy- of edges,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1619–
pothesis challenges this notion, especially concerning semi- 1627, 2016.
supervised learning methods, as the no free lunch theorem [19] D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H.
Yang, “Unsupervised visual representation learning by graph-
comes into play. Performance degradation can arise when based consistent constraints,” in Eur. Conf. Comput. Vis., pp. 678–
model assumptions fail to align effectively with the un- 694, 2016.
derlying problem structure. For instance, if a model as- [20] H. Lee, S. J. Hwang, and J. Shin, “Rethinking data aug-
sumes a substantial separation between decision boundaries mentation: Self-supervision and self-distillation,” arXiv preprint
arXiv:1910.05872, 2019.
and regions of high data density, it may perform poorly [21] B. Zoph, G. Ghiasi, T.-Y. Lin, Y. Cui, H. Liu, E. D. Cubuk, and
when faced with data originating from heavily overlapping Q. Le, “Rethinking pre-training and self-training,” in Neural Inf.
Cauchy distributions, as the decision boundary would tra- Process. Syst., pp. 1–13, 2020.
verse through dense areas. However, preemptively identi- [22] A. E. Orhan, V. V. Gupta, and B. M. Lake, “Self-supervised
learning through the eyes of a child,” in Neural Inf. Process. Syst.,
fying such mismatches remains intricate and an unresolved pp. 9960–9971, 2020.
matter. Consequently, this topic merits further research to [23] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell,
shed light on the matter. “Representation learning via invariant causal mechanisms,” in
Int. Conf. Learn. Represent., pp. 1–19, 2021.
[24] T. Hua, W. Wang, Z. Xue, S. Ren, Y. Wang, and H. Zhao, “On
feature decorrelation in self-supervised learning,” in IEEE Int.
R EFERENCES Conf. Comput. Vis., pp. 9598–9608, 2021.
[25] VentureBeat, “Yann LeCun, Yoshua Bengio: Self-
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, supervised learning is key to human-level intelligence.”
“Imagenet: A large-scale hierarchical image database,” in IEEE https://cacm.acm.org/news/244720-yann-lecun-yoshua-
Conf. Comput. Vis. Pattern Recognit., pp. 248–255, 2009. bengio-self-supervised-learning-is-key-to-human-level-
[2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- intelligence/fulltext.
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning [26] J. Yu, H. Yin, X. Xia, T. Chen, J. Li, and Z. Huang, “Self-supervised
transferable visual models from natural language supervision,” learning for recommender systems: A survey,” arXiv preprint
in Int. Conf. Mach. Learn., pp. 8748–8763, 2021. arXiv:2203.15876, 2022.
[3] L. Ericsson, H. Gouk, and T. M. Hospedales, “How well do self- [27] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and P. Yu, “Graph
supervised models transfer?,” in IEEE Conf. Comput. Vis. Pattern self-supervised learning: A survey,” IEEE T. Knowl. Data Eng.,
Recognit., pp. 5414–5423, 2021. 2022.
[4] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, [28] H. H. Mao, “A survey on self-supervised pre-training for se-
“Self-supervised learning: Generative or contrastive,” IEEE T. quential transfer learning in neural networks,” arXiv preprint
Knowl. Data Eng., 2022. arXiv:2007.00800, 2020.
[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [29] M. C. Schiappa, Y. S. Rawat, and M. Shah, “Self-supervised
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, learning for videos: A survey,” arXiv preprint arXiv:2207.00419,
et al., “An image is worth 16x16 words: Transformers for image 2022.
recognition at scale,” in Int. Conf. Learn. Represent., 2021. [30] G.-J. Qi and M. Shah, “Adversarial pretraining of self-supervised
[6] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, deep networks: Past, present and future,” arXiv preprint
“Learning spatiotemporal features with 3d convolutional net- arXiv:2210.13463, 2022.
works,” in IEEE Int. Conf. Comput. Vis., pp. 4489–4497, 2015. [31] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon,
[7] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen- “A survey on contrastive self-supervised learning,” Technologies,
tation learning by predicting image rotations,” in Int. Conf. Learn. vol. 9, no. 1, pp. 1–22, 2020.
Represent., pp. 1–14, 2018. [32] V. R. de Sa, “Learning classification with unlabeled data,” in
[8] M. Noroozi and P. Favaro, “Unsupervised learning of visual Neural Inf. Process. Syst., pp. 112–119, 1994.
representations by solving jigsaw puzzles,” in Eur. Conf. Comput. [33] Y. LeCun and Y. Bengio, “Reflections from the turing award
Vis., pp. 69–84, 2016. winners.” https://iclr.cc/virtual 2020/speaker 7.html.
[9] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: un- [34] L. Jing and Y. Tian, “Self-supervised visual feature learning with
supervised learning using temporal order verification,” in Eur. deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach.
Conf. Comput. Vis., pp. 527–544, 2016. Intell., vol. 43, no. 11, pp. 4037–4058, 2021.
[10] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning [35] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on genera-
and using the arrow of time,” in IEEE Conf. Comput. Vis. Pattern tive adversarial networks: Algorithms, theory, and applications,”
Recognit., pp. 8052–8060, 2018. IEEE T. Knowl. Data Eng., 2022.
[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- [36] T. Nathan Mundhenk, D. Ho, and B. Y. Chen, “Improvements
training of deep bidirectional transformers for language under- to context based self-supervised learning,” in IEEE Conf. Comput.
standing,” arXiv preprint arXiv:1810.04805, 2018. Vis. Pattern Recognit., pp. 9339–9348, 2018.
[12] X. Zeng, Y. Pan, M. Wang, J. Zhang, and Y. Liu, “Realistic face [37] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by mov-
reenactment via self-supervised disentangling of identity and ing,” in IEEE Int. Conf. Comput. Vis., pp. 37–45, 2015.
pose,” in AAAI Conf.Artif. Intell., pp. 12154–12163, 2020. [38] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”
[13] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zis- in Eur. Conf. Comput. Vis., pp. 649–666, 2016.
serman, “End-to-end learning of visual representations from un- [39] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning repre-
curated instructional videos,” in IEEE Conf. Comput. Vis. Pattern sentations for automatic colorization,” in Eur. Conf. Comput. Vis.,
Recognit., pp. 9879–9889, 2020. pp. 577–593, 2016.
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
[40] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. [66] X. Wang and G.-J. Qi, “Contrastive learning with stronger aug-
Efros, “Real-time user-guided image colorization with learned mentations,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–12,
deep priors,” arXiv preprint arXiv:1705.02999, 2017. 2022.
[41] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a [67] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond,
proxy task for visual understanding,” in IEEE Conf. Comput. Vis. E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar,
Pattern Recognit., pp. 6874–6883, 2017. et al., “Bootstrap your own latent: A new approach to self-
[42] P. Goyal, D. Mahajan, A. Gupta, and I. Misra, “Scaling and supervised learning,” in Neural Inf. Process. Syst., pp. 1–14, 2020.
benchmarking self-supervised visual representation learning,” in [68] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and
IEEE Int. Conf. Comput. Vis., pp. 6391–6400, 2019. A. Joulin, “Unsupervised learning of visual features by contrast-
[43] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised ing cluster assignments,” in Neural Inf. Process. Syst., 2020.
learning of spatiotemporal context for video action recognition,” [69] X. Chen and K. He, “Exploring simple siamese representation
in Proc. Winter Conf. Appl. Comput. Vis., pp. 179–189, 2019. learning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 15750–
[44] X. Zhan, X. Pan, Z. Liu, D. Lin, and C. C. Loy, “Self-supervised 15758, 2021.
learning via conditional motion propagation,” in IEEE Conf. [70] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
Comput. Vis. Pattern Recognit., pp. 1881–1889, 2019. autoencoders are scalable vision learners,” in IEEE Conf. Comput.
[45] K. Wang, L. Lin, C. Jiang, C. Qian, and P. Wei, “3d human Vis. Pattern Recognit., pp. 16000–16009, 2022.
pose machines with self-supervised learning,” IEEE Trans. Pattern [71] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lu-
Anal. Mach. Intell., vol. 42, no. 5, pp. 1069–1082, 2019. cic, “On mutual information maximization for representation
[46] M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learn- learning,” in Int. Conf. Learn. Represent., pp. 1–12, 2020.
ing by learning to count,” in IEEE Int. Conf. Comput. Vis., [72] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khan-
pp. 5898–5906, 2017. deparkar, “A theoretical analysis of contrastive unsupervised
[47] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext- representation learning,” in Int. Conf. Mach. Learn., pp. 5628–5637,
invariant representations,” in IEEE Conf. Comput. Vis. Pattern 2019.
Recognit., pp. 6707–6717, 2020. [73] Y. Yang and Z. Xu, “Rethinking the value of labels for improving
[48] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature class-imbalanced learning,” in Neural Inf. Process. Syst., 2020.
learning via non-parametric instance discrimination,” in IEEE [74] Y.-H. H. Tsai, Y. Wu, R. Salakhutdinov, and L.-P. Morency,
Conf. Comput. Vis. Pattern Recognit., pp. 3733–3742, 2018. “Self-supervised learning from a multi-view perspective,” arXiv
[49] N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “What makes instance preprint arXiv:2006.05576, 2020.
discrimination good for transfer learning?,” in Int. Conf. Learn. [75] C.-Y. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, and
Represent., pp. 1–11, 2021. S. Jegelka, “Debiased contrastive learning,” in Int. Conf. Learn.
[50] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con- Represent., 2020.
trast for unsupervised visual representation learning,” in IEEE
[76] J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo, “Predicting what you
Conf. Comput. Vis. Pattern Recognit., pp. 9729–9738, 2020.
already know helps: Provable self-supervised learning,” arXiv
[51] X. Chen, H. Fan, R. Girshick, and K. He, “Improved base- preprint arXiv:2008.01064, 2020.
lines with momentum contrastive learning,” arXiv preprint
[77] S. Chen, G. Niu, C. Gong, J. Li, J. Yang, and M. Sugiyama,
arXiv:2003.04297, 2020.
“Large-margin contrastive learning with distance polarization
[52] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
regularizer,” in Int. Conf. Mach. Learn., pp. 1673–1683, 2021.
framework for contrastive learning of visual representations,” in
Int. Conf. Mach. Learn., pp. 1597–1607, 2020. [78] J. Z. HaoChen, C. Wei, A. Gaidon, and T. Ma, “Provable guaran-
tees for self-supervised deep learning with spectral contrastive
[53] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton,
loss,” in Neural Inf. Process. Syst., Nov. 2021.
“Big self-supervised models are strong semi-supervised learn-
ers,” in Neural Inf. Process. Syst., pp. 1–13, 2020. [79] C. Tosh, A. Krishnamurthy, and D. Hsu, “Contrastive learning,
[54] T. Wang and P. Isola, “Understanding contrastive representation multi-view redundancy, and linear models,” in Algorithmic Learn-
learning through alignment and uniformity on the hypersphere,” ing Theory, pp. 1179–1206, 2021.
in Int. Conf. Mach. Learn., pp. 9929–9939, 2020. [80] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of
[55] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: self-training with deep networks on unlabeled data,” in Int. Conf.
Self-supervised learning via redundancy reduction,” in Int. Conf. Learn. Represent., pp. 1–15, 2021.
Mach. Learn., 2021. [81] Y. Tian, “Deep contrastive learning is provably (almost) principal
[56] A. Bardes, J. Ponce, and Y. LeCun, “Vicreg: Variance-invariance- component analysis,” arXiv preprint arXiv:2201.12680, 2022.
covariance regularization for self-supervised learning,” in Int. [82] X. Chen, S. Xie, and K. He, “An empirical study of training self-
Conf. Learn. Represent., pp. 1–12, 2022. supervised visual transformers,” in IEEE Int. Conf. Comput. Vis.,
[57] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction pp. 9640–9649, 2021.
by learning an invariant mapping,” in IEEE Conf. Comput. Vis. [83] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski,
Pattern Recognit., pp. 1735–1742, 2006. and A. Joulin, “Emerging properties in self-supervised vision
[58] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with transformers,” in IEEE Int. Conf. Comput. Vis., pp. 9650–9660,
contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2021.
2019. [84] Y. Wang, X. Shen, S. X. Hu, Y. Yuan, J. L. Crowley, and D. Vaufrey-
[59] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: daz, “Self-supervised transformers for unsupervised object dis-
A new estimation principle for unnormalized statistical models,” covery using normalized cut,” in IEEE Conf. Comput. Vis. Pattern
in Int. Conf. Artif. Intell. Statist., pp. 297–304, 2010. Recognit., pp. 14543–14553, 2022.
[60] M. Zheng, S. You, F. Wang, C. Qian, C. Zhang, X. Wang, and [85] E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learn-
C. Xu, “Ressl: Relational self-supervised learning with weak ing through spatial contrasting,” arXiv preprint arXiv:1610.00243,
augmentation,” arXiv preprint arXiv:2107.09282, 2021. 2016.
[61] N. Zhao, Z. Wu, R. W. Lau, and S. Lin, “Distilling localization [86] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “Regioncl: exploring con-
for self-supervised representation learning,” in AAAI Conf.Artif. trastive region pairs for self-supervised representation learning,”
Intell., pp. 10990–10998, 2021. in Eur. Conf. Comput. Vis., pp. 477–494, Springer, 2022.
[62] R. Arandjelovic and A. Zisserman, “Objects that sound,” in Eur. [87] M. Yang, M. Liao, P. Lu, J. Wang, S. Zhu, H. Luo, Q. Tian,
Conf. Comput. Vis., pp. 435–451, 2018. and X. Bai, “Reading and writing: Discriminative and generative
[63] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod- modeling for self-supervised text recognition,” arXiv preprint
ing,” in Eur. Conf. Comput. Vis., pp. 776–794, 2020. arXiv:2207.00193, 2022.
[64] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, [88] R. Zhu, B. Zhao, J. Liu, Z. Sun, and C. W. Chen, “Improving
“What makes for good views for contrastive learning,” in Neural contrastive learning by visualizing feature transformation,” in
Inf. Process. Syst., pp. 1–13, 2020. IEEE Int. Conf. Comput. Vis., pp. 10306–10315, 2021.
[65] Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, and H. Hu, “Propagate [89] M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng, “Par-
yourself: Exploring pixel-level consistency for unsupervised vi- tially view-aligned representation learning with noise-robust
sual representation learning,” in IEEE Conf. Comput. Vis. Pattern contrastive loss,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
Recognit., pp. 16684–16693, 2021. pp. 1134–1143, 2021.
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
[90] A. Islam, C.-F. Chen, R. Panda, L. Karlinsky, R. Radke, and [114] Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked
R. Feris, “A broad study on the transferability of visual repre- autoencoders for point cloud self-supervised learning,” in Eur.
sentations with contrastive learning,” in IEEE Int. Conf. Comput. Conf. Comput. Vis., pp. 604–621, 2022.
Vis., pp. 8845–8855, 2021. [115] R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y.-G. Jiang,
[91] J. Li, C. Xiong, and S. C. Hoi, “Learning from noisy data with L. Zhou, and L. Yuan, “Bevt: Bert pretraining of video transform-
robust representation learning,” in IEEE Int. Conf. Comput. Vis., ers,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 9485–9494, 2021. pp. 14733–14743, 2022.
[92] L. Jing, P. Vincent, Y. LeCun, and Y. Tian, “Understanding di- [116] Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked
mensional collapse in contrastive self-supervised learning,” in autoencoders are data-efficient learners for self-supervised video
Int. Conf. Learn. Represent., pp. 1–11, 2022. pre-training,” Neural Inf. Process. Syst., vol. 35, pp. 10078–10093,
[93] J. Zhang, X. Xu, F. Shen, Y. Yao, J. Shao, and X. Zhu, “Video 2022.
representation learning with graph contrastive augmentation,” [117] R. Girdhar, A. El-Nouby, M. Singh, K. V. Alwala, A. Joulin, and
in ACM Int. Conf. Multimedia, pp. 3043–3051, 2021. I. Misra, “Omnimae: Single model masked pretraining on images
[94] Q. Hu, X. Wang, W. Hu, and G.-J. Qi, “Adco: Adversarial contrast and videos,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern
for efficient learning of unsupervised representations from self- Recognit., pp. 10406–10417, June 2023.
trained negative adversaries,” in IEEE Conf. Comput. Vis. Pattern [118] A. Gupta, J. Wu, J. Deng, and L. Fei-Fei, “Siamese masked
Recognit., 2021. autoencoders,” in Neural Inf. Process. Syst., Nov. 2023.
[95] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and [119] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao,
D. Larlus, “Hard negative mixing for contrastive learning,” in Z. Zhang, L. Dong, et al., “Swin transformer v2: Scaling up capac-
Neural Inf. Process. Syst., pp. 1–12, 2020. ity and resolution,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[96] S. Purushwalkam and A. Gupta, “Demystifying contrastive self- pp. 12009–12019, 2022.
supervised learning: Invariances, augmentations and dataset bi- [120] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
ases,” in Neural Inf. Process. Syst., pp. 1–12, 2020. transformer backbones for object detection,” in Eur. Conf. Comput.
[97] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, Vis., pp. 280–296, 2022.
A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive [121] Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision
learning,” in Neural Inf. Process. Syst., pp. 18661–18673, 2020. transformer baselines for human pose estimation,” in Neural Inf.
[98] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, Process. Syst., pp. 38571–38584, 2022.
“ibot: Image bert pre-training with online tokenizer,” in Int. Conf.
[122] Z. Liu, J. Gui, and H. Luo, “Good helper is around you: Attention-
Learn. Represent., pp. 1–12, 2022.
driven masked image modeling,” in AAAI Conf.Artif. Intell.,
[99] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of pp. 1799–1807, 2023.
image transformers,” in Int. Conf. Learn. Represent., pp. 1–13, 2022.
[123] Z. Qi, R. Dong, G. Fan, Z. Ge, X. Zhang, K. Ma, and
[100] X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, L. Yi, “Contrast with reconstruct: Contrastive 3d representa-
G. Zeng, and J. Wang, “Context autoencoder for self-supervised tion learning guided by generative pretraining,” arXiv preprint
representation learning,” arXiv preprint arXiv:2202.03026, 2022. arXiv:2302.02318, 2023.
[101] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu,
[124] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, Y. Wei, Q. Dai, and H. Hu, “On
“Simmim: A simple framework for masked image modeling,” in
data scaling in masked image modeling,” in IEEE Conf. Comput.
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9653–9663, 2022.
Vis. Pattern Recognit., pp. 10365–10374, 2023.
[102] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
[125] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al.,
et al., “Language models are few-shot learners,” arXiv preprint
“Dinov2: Learning robust visual features without supervision,”
arXiv:2005.14165, 2020.
arXiv preprint arXiv:2304.07193, 2023.
[103] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, P. Dhariwal,
D. Luan, and I. Sutskever, “Generative pretraining from pixels,” [126] X. Kong and X. Zhang, “Understanding masked image modeling
in Int. Conf. Mach. Learn., pp. 1691–1703, 2020. via learning occlusion invariant feature,” in IEEE Conf. Comput.
Vis. Pattern Recognit., pp. 6241–6251, 2023.
[104] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
“Context encoders: Feature learning by inpainting,” in IEEE Conf. [127] H. Chen, Y. Wang, B. Lagadec, A. Dantcheva, and F. Bremond,
Comput. Vis. Pattern Recognit., pp. 2536–2544, 2016. “Joint generative and contrastive learning for unsupervised per-
son re-identification,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[105] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,
pp. 2004–2013, 2021.
M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,”
in Int. Conf. Mach. Learn., pp. 8821–8831, 2021. [128] L. Wang, F. Liang, Y. Li, H. Zhang, W. Ouyang, and J. Shao,
[106] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichten- “Repre: Improving self-supervised vision transformer with re-
hofer, “Masked feature prediction for self-supervised visual pre- constructive pre-training,” Jan. 2022.
training,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 14668– [129] Z. Huang, X. Jin, C. Lu, Q. Hou, M.-M. Cheng, D. Fu, X. Shen,
14678, 2022. and J. Feng, “Contrastive masked autoencoders are stronger
[107] X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, vision learners,” IEEE Transactions on Pattern Analysis and Machine
F. Wen, and N. Yu, “Peco: Perceptual codebook for bert pre- Intelligence, pp. 1–13, 2023.
training of vision transformers,” arXiv preprint arXiv:2111.12710, [130] C. Tao, X. Zhu, W. Su, G. Huang, B. Li, J. Zhou, Y. Qiao, X. Wang,
2021. and J. Dai, “Siamese image modeling for self-supervised vision
[108] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, representation learning,” in Proceedings of the IEEE Conf. Comput.
“Data2vec: A general framework for self-supervised learning in Vis. Pattern Recognit., pp. 2132–2141, 2023.
speech, vision and language,” arXiv preprint arXiv:2202.03555, [131] Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y. Cao, “Revealing
2022. the dark secrets of masked image modeling,” in IEEE Conf.
[109] Y. Chen, Y. Liu, D. Jiang, X. Zhang, W. Dai, H. Xiong, and Q. Tian, Comput. Vis. Pattern Recognit., pp. 14475–14485, 2023.
“Sdae: Self-distillated masked autoencoder,” in Eur. Conf. Comput. [132] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox,
Vis., pp. 108–124, 2022. “Discriminative unsupervised feature learning with convolu-
[110] Q. Zhou, C. Yu, H. Luo, Z. Wang, and H. Li, “Mimco: Masked tional neural networks,” in Neural Inf. Process. Syst., pp. 766–774,
image modeling pre-training with contrastive teacher,” in ACM 2014.
Int. Conf. Multimedia, pp. 4487–4495, 2022. [133] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller,
[111] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “Beit v2: Masked and T. Brox, “Discriminative unsupervised feature learning with
image modeling with vector-quantized visual tokenizers,” arXiv exemplar convolutional neural networks,” IEEE Trans. Pattern
preprint arXiv:2208.06366, 2022. Anal. Mach. Intell., vol. 38, no. 9, pp. 1734–1747, 2015.
[112] C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders [134] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual
as spatiotemporal learners,” arXiv preprint arXiv:2205.09113, 2022. representation learning by context prediction,” in IEEE Int. Conf.
[113] Y. Liang, S. Zhao, B. Yu, J. Zhang, and F. He, “Meshmae: Masked Comput. Vis., pp. 1422–1430, 2015.
autoencoders for 3d mesh data analysis,” in Eur. Conf. Comput. [135] P. Bojanowski and A. Joulin, “Unsupervised learning by predict-
Vis., pp. 37–54, 2022. ing noise,” in Int. Conf. Mach. Learn., 2017.
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
[136] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed- [161] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting
ding for clustering analysis,” in Int. Conf. Mach. Learn., pp. 478– self-supervised learning via knowledge transfer,” in IEEE Conf.
487, 2016. Comput. Vis. Pattern Recognit., pp. 9359–9367, 2018.
[137] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of [162] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun, “Gpt-
deep representations and image clusters,” in IEEE Conf. Comput. gnn: Generative pre-training of graph neural networks,” in ACM
Vis. Pattern Recognit., pp. 5147–5156, 2016. SIGKDD International Conference on Knowledge Discovery and Data
[138] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clus- Mining, pp. 1857–1867, 2020.
tering for unsupervised learning of visual features,” in Eur. Conf. [163] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang,
Comput. Vis., pp. 132–149, 2018. “Self-supervised graph transformer on large-scale molecular
[139] R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: data,” in Neural Inf. Process. Syst., 2020.
Unsupervised learning by cross-channel prediction,” in IEEE [164] U. Buchler, B. Brattoli, and B. Ommer, “Improving spatiotempo-
Conf. Comput. Vis. Pattern Recognit., pp. 1058–1067, 2017. ral self-supervision by deep reinforcement learning,” in Eur. Conf.
[140] X. Wang, K. He, and A. Gupta, “Transitive invariance for self- Comput. Vis., pp. 770–786, 2018.
supervised visual representation learning,” in IEEE Int. Conf. [165] D. Guo, B. A. Pires, B. Piot, J.-b. Grill, F. Altché, R. Munos, and
Comput. Vis., pp. 1329–1338, 2017. M. G. Azar, “Bootstrap latent-predictive representations for mul-
[141] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised titask reinforcement learning,” arXiv preprint arXiv:2004.14646,
visual representation learning,” in IEEE Conf. Comput. Vis. Pattern 2020.
Recognit., pp. 1920–1929, 2019. [166] N. Hansen, Y. Sun, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang,
[142] P. Krähenbühl, “Free supervision from video games,” in IEEE “Self-supervised policy adaptation during deployment,” arXiv
Conf. Comput. Vis. Pattern Recognit., pp. 2955–2964, 2018. preprint arXiv:2007.04309, 2020.
[143] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [167] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord,
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- “Boosting few-shot visual learning with self-supervision,” in
sarial nets,” in Neural Inf. Process. Syst., pp. 2672–2680, 2014. IEEE Int. Conf. Comput. Vis., pp. 8059–8068, 2019.
[144] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby, “Self- [168] J.-C. Su, S. Maji, and B. Hariharan, “Boosting supervision
supervised gans via auxiliary rotation loss,” in IEEE Conf. Com- with self-supervision for few-shot learning,” arXiv preprint
put. Vis. Pattern Recognit., pp. 12154–12163, 2019. arXiv:1906.07079, 2019.
[145] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self- [169] C. Li, T. Tang, G. Wang, J. Peng, B. Wang, X. Liang, and X. Chang,
supervised semi-supervised learning,” in IEEE Int. Conf. Comput. “Bossnas: Exploring hybrid cnn-transformers with block-wisely
Vis., pp. 1476–1485, 2019. self-supervised neural architecture search,” in IEEE Int. Conf.
[146] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song, “Using Comput. Vis., 2021.
self-supervised learning can improve model robustness and un- [170] L. Fan, S. Liu, P.-Y. Chen, G. Zhang, and C. Gan, “When does
certainty,” in Neural Inf. Process. Syst., pp. 15663–15674, 2019. contrastive learning preserve adversarial robustness from pre-
[147] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view training to finetuning?,” in Neural Inf. Process. Syst., 2021.
representation learning on graphs,” in Int. Conf. Mach. Learn.,
[171] M. Kim, J. Tack, and S. J. Hwang, “Adversarial self-supervised
2020.
contrastive learning,” in Neural Inf. Process. Syst., pp. 1–12, 2020.
[148] L. Gomez, Y. Patel, M. Rusiñol, D. Karatzas, and C. Jawahar,
[172] T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang,
“Self-supervised learning of visual features through embedding
“Adversarial robustness: From self-supervised pre-training to
images into text topic spaces,” in IEEE Conf. Comput. Vis. Pattern
fine-tuning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 699–
Recognit., pp. 4230–4239, 2017.
708, 2020.
[149] L. Jing, Y. Chen, L. Zhang, M. He, and Y. Tian, “Self-supervised
[173] Y. Lin, X. Guo, and Y. Lu, “Self-supervised video representation
feature learning by cross-modality and cross-view correspon-
learning with meta-contrastive network,” in IEEE Int. Conf. Com-
dences,” arXiv preprint arXiv:2004.05749, 2020.
put. Vis., pp. 8239–8249, 2021.
[150] L. Jing, Y. Chen, L. Zhang, M. He, and Y. Tian, “Self-supervised
modal and view invariant feature learning,” arXiv preprint [174] Y. An, H. Xue, X. Zhao, and L. Zhang, “Conditional self-
arXiv:2005.14169, 2020. supervised learning for few-shot classification,” in Int. Joint Conf.
[151] L. Zhang and Z. Zhu, “Unsupervised feature learning for point Artif. Intell., pp. 2140–2146, 2021.
cloud understanding by contrasting and clustering using graph [175] S. Pal, A. Datta, and D. D. Majumder, “Computer recognition
convolutional neural networks,” in International Conference on 3D of vowel sounds using a self-supervised learning algorithm,”
Vision, pp. 395–404, 2019. Journal of the Anatomical Society of India, pp. 117–123, 1978.
[152] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud [176] A. Ghosh, N. R. Pal, and S. K. Pal, “Self-organization for object
auto-encoder via deep grid deformation,” in IEEE Conf. Comput. extraction using a multilayer neural network and fuzziness mear-
Vis. Pattern Recognit., pp. 206–215, 2018. sures,” IEEE Transactions on Fuzzy Systems, pp. 54–68, 1993.
[153] M. Gadelha, R. Wang, and S. Maji, “Multiresolution tree networks [177] A. Sharma, O. Grau, and M. Fritz, “Vconv-dae: Deep volumetric
for 3d point cloud processing,” in Eur. Conf. Comput. Vis., pp. 103– shape learning without object labels,” in Eur. Conf. Comput. Vis.,
118, 2018. pp. 236–250, 2016.
[154] Y. Zhao, T. Birdal, H. Deng, and F. Tombari, “3d point capsule [178] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin, “Look into
networks,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1009– person: Self-supervised structure-sensitive learning and a new
1018, 2019. benchmark for human parsing,” in IEEE Conf. Comput. Vis. Pat-
[155] Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt, tern Recognit., pp. 932–940, 2017.
“Test-time training with self-supervision for generalization under [179] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint
distribution shifts,” in Int. Conf. Mach. Learn., 2020. body parsing & pose estimation network and a new benchmark,”
[156] Y. Gandelsman, Y. Sun, X. Chen, and A. A. Efros, “Test-time train- IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 871–885,
ing with masked autoencoders,” arXiv preprint arXiv:2209.07522, 2018.
2022. [180] X. Zhan, X. Pan, B. Dai, Z. Liu, D. Lin, and C. C. Loy, “Self-
[157] J. J. Sun, A. Kennedy, E. Zhan, D. J. Anderson, Y. Yue, and supervised scene de-occlusion,” in IEEE Conf. Comput. Vis. Pattern
P. Perona, “Task programming: Learning data efficient behavior Recognit., pp. 3784–3792, 2020.
representations,” in IEEE Conf. Comput. Vis. Pattern Recognit., [181] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan,
pp. 2876–2885, 2021. “Learning features by watching objects move,” in IEEE Conf.
[158] Z. Ren and Y. Jae Lee, “Cross-domain self-supervised multi-task Comput. Vis. Pattern Recognit., pp. 2701–2710, 2017.
feature learning using synthetic imagery,” in IEEE Conf. Comput. [182] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised
Vis. Pattern Recognit., pp. 762–771, 2018. equivariant attention mechanism for weakly supervised seman-
[159] K. Saito, D. Kim, S. Sclaroff, and K. Saenko, “Universal domain tic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
adaptation through self supervision,” in Neural Inf. Process. Syst., pp. 12275–12284, 2020.
pp. 1–11, 2020. [183] Z. Chen, X. Ye, L. Du, W. Yang, L. Huang, X. Tan, Z. Shi,
[160] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros, “Unsupervised F. Shen, and E. Ding, “Aggnet for self-supervised monocular
domain adaptation through self-supervision,” arXiv preprint depth estimation: Go an aggressive step furthe,” in ACM Int.
arXiv:1909.11825, 2019. Conf. Multimedia, pp. 1526–1534, 2021.
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
[184] H. Chen, B. Lagadec, and F. Bremond, “Ice: Inter-instance con- [208] T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense
trastive encoding for unsupervised person re-identification,” in predictive coding for video representation learning,” in Eur. Conf.
IEEE Int. Conf. Comput. Vis., pp. 14960–14969, 2021. Comput. Vis., 2020.
[185] T. Isobe, D. Li, L. Tian, W. Chen, Y. Shan, and S. Wang, “Towards [209] B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised
discriminative representation learning for unsupervised person video representation learning with odd-one-out networks,” in
re-identification,” in IEEE Int. Conf. Comput. Vis., pp. 8526–8536, IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3636–3645, 2017.
2021. [210] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised
[186] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self- representation learning by sorting sequences,” in IEEE Int. Conf.
supervised deep visual odometry with online adaptation,” in Comput. Vis., pp. 667–676, 2017.
IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6339–6348, 2020. [211] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-
[187] W. Wu, Z. Y. Wang, Z. Li, W. Liu, and L. Fuxin, “Pointpwc-net: supervised spatiotemporal learning via video clip order predic-
Cost volume on point clouds for (self-) supervised scene flow tion,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 10334–
estimation,” in Eur. Conf. Comput. Vis., 2020. 10343, 2019.
[188] G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets [212] S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman,
self-supervision,” arXiv preprint arXiv:2006.07114, 2020. M. Rubinstein, M. Irani, and T. Dekel, “Speednet: Learning the
speediness in videos,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
[189] J. Walker, A. Gupta, and M. Hebert, “Dense optical flow predic-
pp. 9922–9931, 2020.
tion from a static image,” in IEEE Int. Conf. Comput. Vis., pp. 2443–
2451, 2015. [213] Y. Yao, C. Liu, D. Luo, Y. Zhou, and Q. Ye, “Video playback rate
perception for self-supervised spatio-temporal representation
[190] F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language nav- learning,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6548–
igation with self-supervised auxiliary reasoning tasks,” in IEEE 6557, 2020.
Conf. Comput. Vis. Pattern Recognit., pp. 10012–10022, 2020.
[214] J. Wang, J. Jiao, and Y.-H. Liu, “Self-supervised video represen-
[191] X. Niu, S. Shan, H. Han, and X. Chen, “Rhythmnet: End-to-end tation learning by pace prediction,” in Eur. Conf. Comput. Vis.,
heart rate estimation from face via spatial-temporal representa- 2020.
tion,” IEEE Trans. Image Process., vol. 29, pp. 2409–2423, 2020. [215] A. Diba, V. Sharma, L. V. Gool, and R. Stiefelhagen, “Dynamonet:
[192] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, and G. Zhao, “Video-based Dynamic action and motion network,” in IEEE Int. Conf. Comput.
remote physiological measurement via cross-verified feature dis- Vis., pp. 6192–6201, 2019.
entangling,” in Eur. Conf. Comput. Vis., 2020. [216] T. Han, W. Xie, and A. Zisserman, “Self-supervised co-training
[193] Y. Xie, Z. Wang, and S. Ji, “Noise2same: Optimizing a self- for video representation learning,” in Neural Inf. Process. Syst.,
supervised bound for image denoising,” in Neural Inf. Process. pp. 1–12, 2020.
Syst., 2020. [217] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of
[194] T. Huang, S. Li, X. Jia, H. Lu, and J. Liu, “Neighbor2neighbor: audio and video models from self-supervised synchronization,”
Self-supervised denoising from single noisy images,” in IEEE in Neural Inf. Process. Syst., pp. 7763–7774, 2018.
Conf. Comput. Vis. Pattern Recognit., 2021. [218] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in
[195] C. Yang, Z. Wu, B. Zhou, and S. Lin, “Instance localization for IEEE Int. Conf. Comput. Vis., pp. 609–617, 2017.
self-supervised detection pretraining,” in IEEE Conf. Comput. Vis. [219] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
Pattern Recognit., pp. 3987–3996, 2021. “Videobert: A joint model for video and language representation
[196] I. Croitoru, S.-V. Bogolin, and M. Leordeanu, “Unsupervised learning,” in IEEE Int. Conf. Comput. Vis., pp. 7464–7473, 2019.
learning from video to detect foreground objects in single im- [220] A. Nagrani, C. Sun, D. Ross, R. Sukthankar, C. Schmid, and
ages,” in IEEE Int. Conf. Comput. Vis., pp. 4335–4343, 2017. A. Zisserman, “Speech2action: Cross-modal supervision for ac-
[197] E. Xie, J. Ding, W. Wang, X. Zhan, H. Xu, Z. Li, and P. Luo, tion recognition,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
“Detco: Unsupervised contrastive learning for object detection,” pp. 10317–10326, 2020.
arXiv preprint arXiv:2102.04803, 2021. [221] J. C. Stroud, D. A. Ross, C. Sun, J. Deng, R. Sukthankar, and
[198] G. Wu, J. Jiang, X. Liu, and J. Ma, “A practical contrastive C. Schmid, “Learning video representations from textual web
learning framework for single image super-resolution,” arXiv supervision,” arXiv preprint arXiv:2007.14937, 2020.
preprint arXiv:2111.13924, 2021. [222] J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Rama-
[199] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin, “Pulse: puram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman,
Self-supervised photo upsampling via latent space exploration of “Self-supervised multimodal versatile networks,” arXiv preprint
generative models,” in IEEE Conf. Comput. Vis. Pattern Recognit., arXiv:2006.16228, 2020.
pp. 2437–2445, 2020. [223] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and
[200] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta, “Learning S. Levine, “Time-contrastive networks: Self-supervised learning
a predictable and generative vector representation for objects,” in from video,” in IEEE Int. Conf. Robot. Autom., pp. 1134–1141, 2018.
Eur. Conf. Comput. Vis., pp. 484–499, 2016. [224] X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence
from the cycle-consistency of time,” in IEEE Conf. Comput. Vis.
[201] D. Jayaraman and K. Grauman, “Learning image representations
Pattern Recognit., pp. 2566–2576, 2019.
tied to ego-motion,” in IEEE Int. Conf. Comput. Vis., pp. 1413–
[225] X. Li, S. Liu, S. De Mello, X. Wang, J. Kautz, and M.-H.
1421, 2015.
Yang, “Joint-task self-supervised learning for temporal corre-
[202] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, spondence,” in Neural Inf. Process. Syst., pp. 318–328, 2019.
optical flow and camera pose,” in IEEE Conf. Comput. Vis. Pattern [226] A. Jabri, A. Owens, and A. A. Efros, “Space-time correspondence
Recognit., pp. 1983–1992, 2018. as a contrastive random walk,” in Neural Inf. Process. Syst.,
[203] L. Huang, Y. Liu, B. Wang, P. Pan, Y. Xu, and R. Jin, “Self- pp. 19545–19560, 2020.
supervised video representation learning by context and mo- [227] Z. Lai, E. Lu, and W. Xie, “Mast: A memory-augmented self-
tion decoupling,” in IEEE Conf. Comput. Vis. Pattern Recognit., supervised tracker,” in IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 13886–13895, 2021. pp. 6479–6488, 2020.
[204] K. Hu, J. Shao, Y. Liu, B. Raj, M. Savvides, and Z. Shen, “Contrast [228] Z. Zhang, S. Lathuiliere, E. Ricci, N. Sebe, Y. Yan, and J. Yang,
and order representations for video self-supervised learning,” in “Online depth learning against forgetting in monocular videos,”
IEEE Int. Conf. Comput. Vis., pp. 7939–7949, 2021. in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4494–4503, 2020.
[205] M. Tschannen, J. Djolonga, M. Ritter, A. Mahendran, N. Houlsby, [229] D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang,
S. Gelly, and M. Lucic, “Self-supervised learning of video- “Video cloze procedure for self-supervised spatio-temporal
induced visual invariances,” in IEEE Conf. Comput. Vis. Pattern learning,” in AAAI Conf.Artif. Intell., pp. 11701–11708, 2020.
Recognit., pp. 13806–13815, 2020. [230] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch,
[206] X. He, Y. Pan, M. Tang, Y. Lv, and Y. Peng, “Learn from unlabeled S. Eslami, and A. v. d. Oord, “Data-efficient image recognition
videos for near-duplicate video retrieval,” in International Confer- with contrastive predictive coding,” in Int. Conf. Mach. Learn.,
ence on Research on Development in Information Retrieval, pp. 1–10, 2020.
2022. [231] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Im-
[207] T. Han, W. Xie, and A. Zisserman, “Video representation learning proving language understanding by generative pre-training,”
by dense predictive coding,” in ICCV Workshops, 2019. 2018.
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3415112
[232] C. Li, J. Yang, P. Zhang, M. Gao, B. Xiao, X. Dai, L. Yuan, sual actions,” in IEEE Conf, ComputVis.Pattern Recognit., pp. 6047–
and J. Gao, “Efficient self-supervised vision transformers for 6056, 2018.
representation learning,” arXiv preprint arXiv:2106.09785, 2021. [254] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101
[233] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, human actions classes from videos in the wild,” arXiv preprint
“Distributed representations of words and phrases and their arXiv:1212.0402, 2012.
compositionality,” in Neural Inf. Process. Syst., pp. 3111–3119, [255] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre,
2013. “Hmdb: a large video database for human motion recognition,”
[234] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre- in IEEE Int. Conf. Comput. Vis., pp. 2556–2563, IEEE, 2011.
training text encoders as discriminators rather than generators,” [256] J. Wang, Y. Gao, K. Li, J. Hu, X. Jiang, X. Guo, R. Ji, and
in Int. Conf. Learn. Represent., 2020. X. Sun, “Enhancing unsupervised video representation learning
[235] N. Pappas and J. Henderson, “Gile: A generalized input-label by decoupling the scene and the motion,” in Proceedings of the
embedding for text classification,” Transactions of the Association AAAI Conference on Artificial Intelligence, vol. 35, pp. 10129–10137,
for Computational Linguistics, vol. 7, pp. 139–155, 2019. 2021.
[236] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Pre- [257] J. Knights, B. Harwood, D. Ward, A. Vanderkop, O. Mackenzie-
training transformers as energy-based cloze models,” arXiv Ross, and P. Moghadam, “Temporally coherent embeddings for
preprint arXiv:2012.08561, 2020. self-supervised video representation learning,” in 2020 25th Inter-
national Conference on Pattern Recognition (ICPR), pp. 8914–8921,
[237] Z. Wu, S. Wang, J. Gu, M. Khabsa, F. Sun, and H. Ma, “Clear:
IEEE, 2021.
Contrastive learning for sentence representation,” arXiv preprint
[258] A. Recasens, P. Luc, J.-B. Alayrac, L. Wang, F. Strub, C. Tal-
arXiv:2012.15466, 2020.
lec, M. Malinowski, V. Pătrăucean, F. Altché, M. Valko, et al.,
[238] J. Giorgi, O. Nitski, B. Wang, and G. Bader, “Declutr: Deep
“Broaden your views for self-supervised video learning,” in IEEE
contrastive learning for unsupervised textual representations,”
Int. Conf. Comput. Vis., pp. 1255–1265, 2021.
arXiv preprint arXiv:2006.03659, 2020.
[259] C. Yang, Y. Xu, B. Dai, and B. Zhou, “Video representa-
[239] H.-Y. Zhou, C. Lu, S. Yang, X. Han, and Y. Yu, “Preservational tion learning with visual tempo consistency,” arXiv preprint
learning improves self-supervised medical image models by re- arXiv:2006.15489, 2020.
constructing diverse contexts,” in IEEE Int. Conf. Comput. Vis., [260] C. Feichtenhofer, H. Fan, B. Xiong, R. Girshick, and K. He, “A
pp. 3499–3509, 2021. large-scale study on unsupervised spatiotemporal representation
[240] K. Chaitanya, E. Erdil, N. Karani, and E. Konukoglu, “Contrastive learning,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern
learning of global and local features for medical image segmenta- Recognit., pp. 3299–3309, 2021.
tion with limited annotations,” in Neural Inf. Process. Syst., 2020. [261] R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie,
[241] J. Zhu, Y. Li, Y. Hu, K. Ma, S. K. Zhou, and Y. Zheng, “Rubik’s and Y. Cui, “Spatiotemporal contrastive video representation
cube+: A self-supervised feature learning framework for 3d med- learning,” in Proceedings of the IEEE Conf. Comput. Vis. Pattern
ical image analysis,” Medical Image Analysis, p. 101746, 2020. Recognit., pp. 6964–6974, 2021.
[242] O. Manas, A. Lacoste, X. Giró-i Nieto, D. Vazquez, and P. Ro- [262] J. Robinson, L. Sun, K. Yu, K. Batmanghelich, S. Jegelka, and
driguez, “Seasonal contrast: Unsupervised pre-training from un- S. Sra, “Can contrastive learning avoid shortcut solutions?,” in
curated remote sensing data,” in IEEE Int. Conf. Comput. Vis., Neural Inf. Process. Syst., pp. 4974–4986, 2021.
pp. 9414–9423, 2021. [263] Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen,
[243] D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang, and B. Guo, “Contrastive learning rivals masked image mod-
“Advancing plain vision transformer toward remote sensing eling in fine-tuning via feature distillation,” arXiv preprint
foundation model,” IEEE Trans. Geoscience and Remote Sensing, arXiv:2205.14141, 2022.
vol. 61, pp. 1–15, 2022. [264] T. Chen, C. Luo, and L. Li, “Intriguing properties of contrastive
[244] J. Liu, X. Huang, Y. Liu, and H. Li, “Mixmim: Mixed and masked losses,” in Neural Inf. Process. Syst., vol. 34, pp. 11834–11845,
image modeling for efficient visual representation learning,” Curran Associates, Inc., 2021.
arXiv preprint arXiv:2205.13137, 2022. [265] Y. Tian, X. Chen, and S. Ganguli, “Understanding self-supervised
[245] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network learning dynamics without contrastive pairs,” in Int. Conf. Mach.
dissection: Quantifying interpretability of deep visual representa- Learn., pp. 10268–10278, 2021.
tions,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6541–6549, [266] Q. Garrido, Y. Chen, A. Bardes, L. Najman, and Y. LeCun, “On the
2017. duality between contrastive and non-contrastive self-supervised
[246] Q. Garrido, R. Balestriero, L. Najman, and Y. Lecun, “Rankme: learning,” in Int. Conf. Learn. Represent., 2023.
Assessing the downstream performance of pretrained self- [267] S. Lavoie, C. Tsirigotis, M. Schwarzer, A. Vani, M. Noukhovitch,
supervised representations by their rank,” in Int. Conf. Mach. K. Kawaguchi, and A. Courville, “Simplicial embeddings in self-
Learn., pp. 10929–10974, PMLR, July 2023. supervised learning and downstream classification,” in Int. Conf.
[247] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and Learn. Represent., 2023.
A. Zisserman, “The pascal visual object classes (voc) challenge,” [268] C. Tao, H. Wang, X. Zhu, J. Dong, S. Song, G. Huang, and J. Dai,
Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010. “Exploring the equivalence of siamese self-supervised learning
via a unified gradient framework,” in Proceedings of the IEEE Conf.
[248] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
Comput. Vis. Pattern Recognit., pp. 14431–14440, 2022.
P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft
[269] X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense con-
coco: Common objects in context,” 2015.
trastive learning for self-supervised visual pre-training,” in IEEE
[249] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Conf. Comput. Vis. Pattern Recognit., pp. 3024–3033, 2021.
“Scene parsing through ade20k dataset,” in IEEE Conf. Comput. [270] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal,
Vis. Pattern Recognit., 2017. O. K. Mohammed, S. Singhal, S. Som, et al., “Image as a foreign
[250] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and language: Beit pretraining for all vision and vision-language
A. Torralba, “Semantic understanding of scenes through the tasks,” arXiv preprint arXiv:2208.10442, 2022.
ade20k dataset,” Int. J. Comput. Vis., vol. 127, no. 3, pp. 302–321,
2019.
[251] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-
jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al.,
“The kinetics human action video dataset,” arXiv preprint
arXiv:1705.06950, 2017.
[252] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska,
S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-
Freitag, et al., “The” something something” video database for
learning and evaluating visual common sense,” in IEEE Int. Conf.
Comput. Vis., pp. 5842–5850, 2017.
[253] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li,
S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al.,
“Ava: A video dataset of spatio-temporally localized atomic vi-
Authorized licensed use limited to: Universite Picardie Jules Verne. Downloaded on July 02,2024 at 14:03:30 UTC from IEEE Xplore. Restrictions apply.
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.