wang20k
wang20k
Abstract
Contrastive representation learning has been out-
standingly successful in practice. In this work,
we identify two key properties related to the con-
trastive loss: (1) alignment (closeness) of features
from positive pairs, and (2) uniformity of the in-
duced distribution of the (normalized) features on
the hypersphere. We prove that, asymptotically,
the contrastive loss optimizes these properties,
and analyze their positive effects on downstream
tasks. Empirically, we introduce an optimizable
metric to quantify each property. Extensive exper-
iments on standard vision and language datasets Alignment: Similar samples have similar features.
confirm the strong agreement between both met- (Figure inspired by Tian et al. (2019).)
rics and downstream task performance. Directly
optimizing for these two metrics leads to repre-
sentations with comparable or better performance
at downstream tasks than contrastive learning.
Project Page: ssnl.github.io/hypersphere.
Code: github.com/SsnL/align uniform.
1. Introduction
A vast number of recent empirical works learn representa- Feature Density
tions with a unit ℓ2 norm constraint, effectively restricting Uniformity:
Uniformity: Preserve
Preserve maximal
maximal information.
information
the output space to the unit hypersphere (Parkhi et al., 2015;
Figure 1: Illustration of alignment and uniformity of fea-
Schroff et al., 2015; Liu et al., 2017; Hasnat et al., 2017;
ture distributions on the output unit hypersphere. STL-10
Wang et al., 2017; Bojanowski & Joulin, 2017; Mettes et al.,
(Coates et al., 2011) images are used for demonstration.
2019; Hou et al., 2019; Davidson et al., 2018; Xu & Dur-
rett, 2018), including many recent unsupervised contrastive
representation learning methods (Wu et al., 2018; Bachman learning where dot products are ubiquitous (Xu & Durrett,
et al., 2019; Tian et al., 2019; He et al., 2019; Chen et al., 2018; Wang et al., 2017). Moreover, if features of a class are
2020). sufficiently well clustered, they are linearly separable with
the rest of feature space (see Figure 2), a common criterion
Intuitively, having the features live on the unit hypersphere used to evaluate representation quality.
leads to several desirable traits. Fixed-norm vectors are
known to improve training stability in modern machine While the unit hypersphere is a popular choice of feature
space, not all encoders that map onto it are created equal.
1
MIT Computer Science & Artificial Intelligence Lab (CSAIL). Recent works argue that representations should addition-
Correspondence to: Tongzhou Wang <[email protected]>. ally be invariant to unnecessary details, and preserve as
Proceedings of the 37 th International Conference on Machine much information as possible (Oord et al., 2018; Tian et al.,
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- 2019; Hjelm et al., 2018; Bachman et al., 2019). Let us
thor(s). call these two properties alignment and uniformity (see
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
3. Preliminaries on Unsupervised Contrastive The InfoMax principle. Many empirical works are moti-
Representation Learning vated by the InfoMax principle of maximizing I(f (x); f (y))
for (x, y) ∼ ppos (Tian et al., 2019; Bachman et al., 2019;
The popular unsupervised contrastive representation learn- Wu et al., 2020). Usually they interpret Lcontrastive in
ing method (often referred to as contrastive learning in this Eqn. (1) as a lower bound of I(f (x); f (y)) (Oord et al.,
paper) learns representations from unlabeled data. It as- 2018; Hjelm et al., 2018; Bachman et al., 2019; Tian et al.,
sumes a way to sample positive pairs, representing similar 2019). However, this interpretation is known to have issues
samples that should have similar representations. Empir- in practice, e.g., maximizing a tighter bound often leads
ically, the positive pairs are often obtained by taking two to worse downstream task performance (Tschannen et al.,
independently randomly augmented versions of the same 2019). Therefore, instead of viewing it as a bound, we inves-
sample, e.g. two crops of the same image (Wu et al., 2018; tigate the exact behavior of directly optimizing Lcontrastive
Hjelm et al., 2018; Bachman et al., 2019; He et al., 2019; in the following sections.
Chen et al., 2020).
Let pdata (·) be the data distribution over Rn and ppos (·, ·) 4. Feature Distribution on the Hypersphere
the distribution of positive pairs over Rn × Rn . Based on
empirical practices, we assume the following property. The contrastive loss encourages learned feature representa-
tion for positive pairs to be similar, while pushing features
Assumption. Distributions pdata and ppos should satisfy from the randomly sampled negative pairs apart. Conven-
• Symmetry: ∀x, y, ppos (x, y) = ppos (y, x). tional wisdom says that representations should extract the
R most shared information between positive pairs and remain
• Matching marginal: ∀x, ppos (x, y) dy = pdata (x).
invariant to other noise factors (Linsker, 1988; Tian et al.,
2019; Wu et al., 2020; Bachman et al., 2019). Therefore,
We consider the following specific and widely popular form
the loss should prefer two following properties:
of contrastive loss for training an encoder f : Rn → S m−1 ,
mapping data to ℓ2 normalized feature vectors of dimension • Alignment: two samples forming a positive pair should
m. This loss has been shown effective by many recent be mapped to nearby features, and thus be (mostly)
representation learning methods (Logeswaran & Lee, 2018; invariant to unneeded noise factors.
Wu et al., 2018; Tian et al., 2019; He et al., 2019; Hjelm • Uniformity: feature vectors should be roughly uni-
et al., 2018; Bachman et al., 2019; Chen et al., 2020). formly distributed on the unit hypersphere S m−1 , pre-
serving as much information of the data as possible.
Lcontrastive (f ; τ, M ) , To empirically verify this, we visualize CIFAR-10 (Tor-
" #
f (x)T f (y)/τ ralba et al., 2008; Krizhevsky et al., 2009) representations
e
E − log − T , on S 1 (m = 2) obtained via three different methods:
ef (x)T f (y)/τ ef (xi) f (y)/τ
P
(x,y)∼ppos + i
M i.i.d.
{x−
i }i=1 ∼ pdata • Random initialization.
(1) • Supervised predictive learning: An encoder and a lin-
ear classifier are jointly trained from scratch with cross
where τ > 0 is a scalar temperature hyperparameter, and entropy loss on supervised labels.
M ∈ Z+ is a fixed number of negative samples.
• Unsupervised contrastive learning: An encoder is
The term contrastive loss has also been generally used to trained w.r.t. Lcontrastive with τ = 0.5 and M = 256.
refer to various objectives based on positive and negative
All three encoders share the same AlexNet based archi-
samples, e.g., in Siamese networks (Chopra et al., 2005;
tecture (Krizhevsky et al., 2012), modified to map input
Hadsell et al., 2006). In this work, we focus on the spe-
images to 2-dimensional vectors in S 1 . Both predictive
cific form in Equation (1) that is widely used in modern
and contrastive learning use standard data augmentations to
unsupervised contrastive representation learning literature.
augment the dataset and sample positive pairs.
Figure 3 summarizes the resulting distributions of validation
Necessity of normalization. Without the norm constraint,
set features. Indeed, features from unsupervised contrastive
the softmax distribution can be made arbitrarily sharp by
learning (bottom in Figure 3) exhibit the most uniform dis-
simply scaling all the features. Wang et al. (2017) provided
tribution, and are closely clustered for positive pairs.
an analysis on this effect and argued for the necessity of
normalization when using feature vector dot products in a The form of the contrastive loss in Eqn. (1) also suggests
cross entropy loss, as is in Eqn. (1). Experimentally, Chen this. We present informal arguments below, followed by
et al. (2020) also showed that normalizing outputs leads to more formal treatment in Section 4.2. From the symmetry
superior representations.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Alignment Uniformity
Positive Pair Feature Distances Feature Distribution Class 0 Class 3 Class 6 Class 9
Mean 1.0 1.0 1.0 1.0 1.0
5000
0.5 0.5 0.5 0.5 0.5
4000
0.0 0.0 0.0 0.0 0.0
Counts
−1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1
1000 Features Features Features Features Features
1000 100 100 100 100
Counts
0 0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
ℓ2 Distances Angles Angles Angles Angles Angles
−1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1
1000 Features Features Features Features Features
1000 100 100 100 100
Counts
0 0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
ℓ2 Distances Angles Angles Angles Angles Angles
2000
−1.0 −1.0 −1.0 −1.0 −1.0
−1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1
1000 Features Features Features Features Features
1000 100 100 100 100
Counts
0 0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
ℓ2 Distances Angles Angles Angles Angles Angles
0.8474
0.5 0.5 0.5 0.5 0.5
−1 0 1 −1 0 1 −1 0 1 0.2380 −1 0 1 −1 0 1
0.2088 0.2070
Features Features Features Features Features
1000 1000 1000 1000 1000
Counts
Counts
Counts
Counts
Counts
0 0 0 0 0
−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
Angles Average G2 Angles Average G2 Angles Average G2 Angles Average G2 Angles Average G2
Figure 4: Average pairwise G2 potential as a measure of uniformity. Each plot shows 10000 points distributed on S 1 ,
obtained via either applying an encoder on CIFAR-10 validation set (same as those in Figure 3) or sampling from a
distribution on S 1 , as described in plot titles. We show the points with Gaussian KDE and the angles with vMF KDE.
Realizability of perfect uniformity. We note that it is not (e.g., a collected dataset), the second term in Equation (2)
always possible to achieve perfect uniformity, e.g., when the can be alternatively viewed as a resubstitution entropy esti-
data manifold in Rn is lower dimensional than the feature mator of f (x) (Ahmad & Lin, 1976), where x follows the
space S m−1 . Moreover, in the case that pdata and ppos are underlying distribution pnature that generates {xi }N i=1 , via a
formed from sampling augmented samples from a finite von Mises-Fisher (vMF) kernel density estimation (KDE):
dataset, there cannot be an encoder that is both perfectly h i
− T
aligned and perfectly uniform, because perfect alignment E log − E ef (x ) f (x)/τ
implies that all augmentations from a single element have x∼pdata x ∼pdata
the same feature vector. Nonetheless, perfectly uniform N N
1 X 1 X T
encoder functions do exist under the conditions that n ≥ = log ef (xi ) f (xj )/τ
m − 1 and pdata has bounded density. N i=1 N j=1
N
We analyze the asymptotics with infinite negative samples. 1 X
Existing empirical work has established that larger number = log p̂vMF-KDE (f (xi )) + log ZvMF
N i=1
of negative samples consistently leads to better downstream
task performances (Wu et al., 2018; Tian et al., 2019; He , −Ĥ(f (x)) + log ZvMF , x ∼ pnature
et al., 2019; Chen et al., 2020), and often uses very large ˆ f (x)) + log ZvMF ,
, −I(x; x ∼ pnature ,
values (e.g., M = 65536 in He et al. (2019)). The following
where
theorem nicely confirms that optimizing w.r.t. the limiting
loss indeed requires both alignment and uniformity. • p̂vMF-KDE is the KDE based on samples {f (xj )}N
j=1
Theorem 1 (Asymptotics of Lcontrastive ). For fixed τ > 0, using a vMF kernel with κ = τ −1 ,
as the number of negative samples M → ∞, the (normal- • ZvMF is the vMF normalization constant for κ = τ −1 ,
ized) contrastive loss converges to • Ĥ denotes the resubstitution entropy estimator,
lim Lcontrastive (f ; τ, M ) − log M = • Iˆ denotes the mutual information estimator based on
M →∞
Ĥ, since f is a deterministic function.
1
f (x)T f (y)
− E (2)
τ (x,y)∼ppos Relation with the InfoMax principle. Many empirical
i
works are motivated by the InfoMax principle, i.e., maxi-
h − T
+ E log E ef (x ) f (x)/τ .
x∼pdata x− ∼pdata mizing I(f (x); f (y)) for (x, y) ∼ ppos . However, the inter-
pretation of Lcontrastive as a lower bound of I(f (x); f (y)) is
We have the following results: known to be inconsistent with its actual behavior in prac-
1. The first term is minimized iff f is perfectly aligned. tice (Tschannen et al., 2019). Our results instead analyze
2. If perfectly uniform encoders exist, they form the exact the properties of Lcontrastive itself. Considering the identity
minimizers of the second term. I(f (x); f (y)) = H(f (x)) − H(f (x) | f (y)), we can see
that while uniformity indeed favors large H(f (x)), align-
3. For the convergence in Equation (2), the absolute devi- ment is stronger than merely desiring small H(f (x) | f (y)).
ation from the limit decays in O(M −2/3 ). Instead, our above analysis suggests that Lcontrastive opti-
mizes for aligned and information-preserving encoders.
Proof. See supplementary material.
Finally, even for the case where only a single negative sam-
Relation with Luniform . The proof of Theorem 1 in the ple is used (i.e., M = 1), we can still prove a weaker result,
supplementary material connects the asymptotic Lcontrastive which we describe in details in the supplementary material.
form with minimizing average pairwise Gaussian poten-
tial, i.e., minimizing Luniform . Compared with the second 5. Experiments
term of Equation (2), Luniform essentially pushes the log out-
side the outer expectation, without changing the minimizer In this section, we empirically verify the hypothesis that
(perfectly uniform encoders). However, due to its pair- alignment and uniformity are desired properties for repre-
wise nature, Luniform is much simpler in form and avoids the sentations. Recall that our two metrics are
computationally expensive softmax operation in Lcontrastive α
Lalign (f ; α) , E(x,y)∼ppos [kf (x) − f (y)k2 ]
(Goodman, 2001; Bengio et al.; Gutmann & Hyvärinen, h i
2
2010; Grave et al., 2017; Chen et al., 2018). Luniform (f ; t) , log E i.i.d. e−tkf (x)−f (y)k2 .
x,y ∼ pdata
Relation with feature distribution entropy estimation. We conduct extensive experiments with convolutional neural
When pdata is uniform over finite samples {x1 , x2 , . . . , xN } network (CNN) and recurrent neural network (RNN) based
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
# bsz : batch size (number of positive pairs) We optimize a total of 306 STL-10 encoders, 64 NYU-
# d : latent dim
# x : Tensor, shape=[bsz, d]
D EPTH -V2 encoders, 45 I MAGE N ET-100 encoders, and
# latents for one side of positive pairs 108 B OOK C ORPUS encoders without supervision. The
# y : Tensor, shape=[bsz, d] encoders are optimized w.r.t. weighted combinations of
# latents for the other side of positive pairs
# lam : hyperparameter balancing the two losses
Lcontrastive , Lalign , and/or Luniform , with varying
• (possibly zero) weights on the three losses,
def lalign(x, y, alpha=2):
return (x - y).norm(dim=1).pow(alpha).mean() • loss hyperparameters: τ for Lcontrastive , α for Lalign ,
def lunif(x, t=2): and t for Luniform ,
sq_pdist = torch.pdist(x, p=2).pow(2)
return sq_pdist.mul(-t).exp().mean().log()
• batch size (affecting the number of (negative) pairs for
Lcontrastive and Luniform ),
loss = lalign(x, y) + lam * (lunif(x) + lunif(y)) / 2
• embedding dimension,
Figure 5: PyTorch implementation of Lalign and Luniform . • number of training epochs and learning rate,
• initialization (from scratch vs. a pretrained encoder).
encoders on four popular representation learning bench-
marks with distinct types of downstream tasks: See the supplementary material for more experiment details
and the exact configurations used.
• STL-10 (Coates et al., 2011) classification on AlexNet-
based encoder outputs or intermediate activations with Both Lalign and Luniform strongly agree with downstream
a linear or k-nearest neighbor (k-NN) classifier. task performance. For each encoder, we measure the
• NYU-D EPTH -V2 (Nathan Silberman & Fergus, 2012) downstream task performance, and the Lalign , Luniform met-
depth prediction on CNN encoder intermediate activa- rics on the validation set. Figure 6 visualizes the trends
tions after convolution layers. between both metrics and representation quality. We ob-
• I MAGE N ET-100 (100 randomly selected classes from serve that the two metrics strongly agrees the representation
I MAGE N ET) classification on CNN encoder penulti- quality overall. In particular, the best performing encoders
mate layer activations with a linear classifier. are exactly the ones with low Lalign and Luniform , i.e., the
lower left corners in Figure 6. In the supplementary mate-
• B OOK C ORPUS (Zhu et al., 2015) RNN sentence en- rial, we observe that as long as the ratio between weights on
coder outputs used for Moview Review Sentence Po- Lalign and Luniform is not too large (e.g., < 4), the represen-
larity (MR) (Pang & Lee, 2005) and Customer Product tation quality remains relatively good and insensitive to the
Review Sentiment (CR) (Wang & Manning, 2012) bi- exact weight choices.
nary classification tasks with logisitc classifiers.
For image datasets, we follow the standard practice and Directly optimizing only Lalign and Luniform can lead to
choose positive pairs as two independent augmentations better representations. As shown in Table 1, encoders
of the same image. For B OOK C ORPUS, positive pairs are trained with only Lalign and Luniform consistently outper-
chosen as neighboring sentences, following Quick-Thought form their Lcontrastive -trained counterparts, for both tasks.
Vectors (Logeswaran & Lee, 2018). Theoretically, Theorem 1 showed that Lcontrastive optimizes
alignment and uniformity asymptotically with infinite neg-
We perform majority of our analysis on STL-10 and NYU-
ative samples. This empirical performance gap suggests
D EPTH -V2 encoders, where we calculate Lcontrastive with
that directly optimizing these properties can be superior in
negatives being other samples within the minibatch follow-
practice, when we can only have finite negatives.
ing the standard practice (Hjelm et al., 2018; Bachman et al.,
2019; Tian et al., 2019; Chen et al., 2020), and Luniform as
Lalign and Luniform causally affect downstream task per-
the logarithm of average pairwise feature potentials also
formance. We take an encoder trained with Lcontrastive
within the minibatch. Due to their simple forms, these two
using a suboptimal temperature τ = 2.5, and finetune it
losses can be implemented in PyTorch (Paszke et al., 2019)
according to Lalign and/or Luniform . Figure 7 visualizes the
with less than 10 lines of code, as shown in Figure 5.
finetuning trajectories. When only one of alignment and
To investigate alignment and uniformity properties on re- uniformity is optimized, the corresponding metric improves,
cent contrastive representation learning variants and larger but both the other metric and performance degrade. How-
datasets, we also analyze I MAGE N ET-100 encoders trained ever, when both properties are optimized, the representation
with Momentum Contrast (MoCo) (He et al., 2019) and quality steadily increases. These trends confirm the causal
B OOK C ORPUS encoders trained with Quick-Thought Vec- effect of alignment and uniformity on the representation
tors (Logeswaran & Lee, 2018), with these methods modi- quality, and suggest that directly optimizing them can be a
fied to also allow Lalign and Luniform . reasonable choice.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Linear Classification on Outputs 5-NN Classification on fc7 Depth Prediction on conv5
85 85 0.80
2.00 contrastive only 2.00 contrastive only 2.00 contrastive only
align, uniform only 80 align, uniform only 80 align, uniform only
1.75 1.75 1.75
All three mixed All three mixed All three mixed 0.78
1.50 75 1.50 75 1.50
align(α = 2)
align(α = 2)
Val Accuracy
Val Accuracy
0.76
70 70
Val MSE
1.00 1.00 1.00
65 65
0.75 0.75 0.75 0.74
(a) 306 STL-10 encoders are evaluated with linear classification on output features and (b) 64 NYU-D EPTH -V2 encoders are eval-
5-nearest neighbor (5-NN) on fc7 activations. Higher accuracy (blue color) is better. uated with CNN depth regressors on conv5
activations. Lower MSE (blue color) is better.
Figure 6: Metrics and performance of STL-10 and NYU-D EPTH -V2 experiments. Each point represents a trained encoder,
with its x- and y-coordinates showing Lalign and Luniform metrics and color showing the performance on validation set. Blue
is better for both tasks. Encoders with low Lalign and Luniform are consistently the better performing ones (lower left corners).
STL-10 Validation Set Accuracy ↑ NYU-D EPTH -V2 Validation Set MSE ↓
Output + Linear Output + 5-NN fc7 + Linear fc7 + 5-NN conv5 conv4
Best Lcontrastive only 80.46% 78.75% 83.89% 76.33% 0.7024 0.7575
Best Lalign and Luniform only 81.15% 78.89% 84.43% 76.78% 0.7014 0.7592
Table 1: Encoder evaluations. STL-10: Numbers show linear and 5-nearest neighbor (5-NN) classification accuracies. The
best result is picked by encoder outputs linear classifier accuracy from a 5-fold training set cross validation, among all 150
encoders trained from scratch with 128-dimensional output and 768 batch size. NYU-D EPTH -V2: Numbers show depth
prediction mean squared error (MSE). The best result is picked based on conv5 layer MSE from a 5-fold training set cross
validation, among all 64 encoders trained from scratch with 128-dimensional output and 128 batch size.
Finetune with 0.0025 ⋅ align Finetune with 0.0005 ⋅ uniform Finetune with 0.025 ⋅ align + 0.025 ⋅ uniform
1.0 uniform(t = 2) (exp) 1.0 uniform(t = 2) (exp) 1.0 uniform(t = 2) (exp)
align(α = 2) align(α = 2) align(α = 2)
0.8 Val accuracy 0.8 Val accuracy 0.8 Val accuracy
Figure 7: Finetuning trajectories from a STL-10 encoder trained with Lcontrastive using a suboptimal temperature τ = 2.5.
Finetuning objectives are weighted combinations of Lalign (α=2) and Luniform (t=2). For each intermediate checkpoint, we
measure Lalign and Luniform metrics, as well as validation accuracy of a linear classifier trained from scratch on the encoder
outputs. Luniform is exponentiated for plotting purpose. Left and middle: Performance degrades if only one of alignment
and uniformity is optimized. Right: Performance improves when both are optimized.
Alignment and uniformity also matter in other con- batches. After modifying them to also allow Lalign and
trastive representation learning variants. MoCo (He Luniform , we train these methods on I MAGE N ET-100 and
et al., 2019) and Quick-Thought Vectors (Logeswaran & B OOK C ORPUS, respectively. Figure 8 shows that Lalign and
Lee, 2018) are contrastive representation learning variants Luniform metrics are still correlated with the downstream
that have nontrivial differences with directly optimizing task performances. Table 2 shows that directly optimizing
Lcontrastive in Equation (1). MoCo introduces a memory them also leads to comparable or better representation qual-
queue and a momentum encoder. Quick-Thought Vectors ity. These results suggest that alignment and uniformity
uses two different encoders to encode each sentence in a are indeed desirable properties for representations, for both
positive pair, only normalizes encoder outputs during eval- image and text modalities, and are likely connected with
uation, and does not use random sampling to obtain mini- general contrastive representation learning methods.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Linear Classification on Penultimate Layer Moview Review Classification on Outputs Customer Review Classification on Outputs
72.0 80
contrastive only 2.00 2.00
0.6
align, uniform only 71.5 74
79
1.75 1.75
All three mixed
0.5 71.0 78
1.50 1.50
72
0.4 70.5 77
align(α = 2)
1.25 1.25
Val Accuracy
align(α = 2)
align(α = 2)
Val Accuracy
Val Accuracy
70.0 1.00 70 1.00 76
0.3
(a) 45 I MAGE N ET-100 encoders are trained (b) 108 B OOK C ORPUS encoders are trained with Quick-Thought-Vectors-based methods,
with MoCo-based methods, and evaluated and evaluated with logistic binary classification on Movie Review Sentence Polarity (MR)
with linear classification. and Customer Product Review Sentiment (CR) tasks.
Figure 8: Metrics and performance of I MAGE N ET-100 and B OOK C ORPUS experiments. Each point represents a trained
encoder, with its x- and y-coordinates showing Lalign and Luniform metrics and color showing the validation accuracy. Blue is
better. Encoders with low Lalign and Luniform consistently perform well (lower left corners), even though the training methods
(based on MoCo and Quick-Thought Vectors) are different from directly optimizing the contrastive loss in Equation (1).
Table 2: Encoder evaluations. I MAGE N ET-100: Numbers show linear classifier accuracies on encoder penultimate layer
activations.The best result is picked based on top1 accuracy from a 3-fold training set cross validation, among all 45 encoders
trained from scratch with 128-dimensional output and 128 batch size. B OOK C ORPUS: Numbers show Movie Review
Sentence Polarity (MR) and Customer Product Sentiment (CR) classification accuracies of logistic classifiers fit on encoder
outputs. The best result is picked based on accuracy from a 5-fold training set cross validation, individually for MR and CR,
among all 108 encoders trained from scratch with 1200-dimensional output and 400 batch size.
Acknowledgements Grave, E., Joulin, A., Cissé, M., Jégou, H., et al. Efficient soft-
max approximation for gpus. In Proceedings of the 34th In-
We thank Philip Bachman, Ching-Yao Chuang, Justin ternational Conference on Machine Learning-Volume 70, pp.
Solomon, Yonglong Tian, and Zhenyang Zhang for many 1302–1310. JMLR. org, 2017.
helpful comments and suggestions. Tongzhou Wang was Gutmann, M. and Hyvärinen, A. Noise-contrastive estimation: A
supported by the MIT EECS Merrill Lynch Graduate Fel- new estimation principle for unnormalized statistical models.
lowship. In Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics, pp. 297–304, 2010.
Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L.,
Chen, P. H., Si, S., Kumar, S., Li, Y., and Hsieh, C.-J. Learning
Rezende, D., and Lerchner, A. Towards a definition of dis-
to screen for fast softmax inference on large vocabulary neural
entangled representations. arXiv preprint arXiv:1812.02230,
networks. 2018.
2018.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K.,
framework for contrastive learning of visual representations. Bachman, P., Trischler, A., and Bengio, Y. Learning deep repre-
arXiv preprint arXiv:2002.05709, 2020. sentations by mutual information estimation and maximization.
arXiv preprint arXiv:1808.06670, 2018.
Chopra, S., Hadsell, R., and LeCun, Y. Learning a similarity metric
discriminatively, with application to face verification. In 2005 Hou, S., Pan, X., Loy, C. C., Wang, Z., and Lin, D. Learning a
IEEE Computer Society Conference on Computer Vision and unified classifier incrementally via rebalancing. In Proceed-
Pattern Recognition (CVPR’05), volume 1, pp. 539–546. IEEE, ings of the IEEE Conference on Computer Vision and Pattern
2005. Recognition, pp. 831–839, 2019.
Coates, A., Ng, A., and Lee, H. An analysis of single-layer net- Krizhevsky, A., Hinton, G., et al. Learning multiple layers of
works in unsupervised feature learning. In Proceedings of the features from tiny images. 2009.
fourteenth international conference on artificial intelligence
and statistics, pp. 215–223, 2011. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classifi-
cation with deep convolutional neural networks. In Advances in
Cohn, H. and Kumar, A. Universally optimal distribution of points neural information processing systems, pp. 1097–1105, 2012.
on spheres. Journal of the American Mathematical Society, 20
Landkof, N. S. Foundations of modern potential theory, volume
(1):99–148, 2007.
180. Springer, 1972.
Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, Linsker, R. Self-organization in a perceptual network. Computer,
J. M. Hyperspherical variational auto-encoders. 34th Confer- 21(3):105–117, 1988.
ence on Uncertainty in Artificial Intelligence (UAI-18), 2018.
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. Sphereface:
Goodman, J. Classes for fast maximum entropy training. In 2001 Deep hypersphere embedding for face recognition. In Proceed-
IEEE International Conference on Acoustics, Speech, and Sig- ings of the IEEE conference on computer vision and pattern
nal Processing. Proceedings (Cat. No. 01CH37221), volume 1, recognition, pp. 212–220, 2017.
pp. 561–564. IEEE, 2001.
Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., and Song,
Götz, M. and Saff, E. B. Note on d—extremal configurations L. Learning towards minimum hyperspherical energy. In Ad-
for the sphere in r d+1. In Recent Progress in Multivariate vances in Neural Information Processing Systems, pp. 6222–
Approximation, pp. 159–162. Springer, 2001. 6233. 2018.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Logeswaran, L. and Lee, H. An efficient framework for learn- Wang, F., Xiang, X., Cheng, J., and Yuille, A. L. Normface: L2
ing sentence representations. In International Conference on hypersphere embedding for face verification. In Proceedings
Learning Representations, 2018. of the 25th ACM international conference on Multimedia, pp.
1041–1049, 2017.
Mettes, P., van der Pol, E., and Snoek, C. Hyperspherical proto-
type networks. In Advances in Neural Information Processing Wang, S. and Manning, C. D. Baselines and bigrams: Simple,
Systems, pp. 1485–1495, 2019. good sentiment and topic classification. In Proceedings of
the 50th annual meeting of the association for computational
Nathan Silberman, Derek Hoiem, P. K. and Fergus, R. Indoor linguistics: Short papers-volume 2, pp. 90–94. Association for
segmentation and support inference from rgbd images. In ECCV, Computational Linguistics, 2012.
2012.
Wu, M., Zhuang, C., Yamins, D., and Goodman, N. On the
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with importance of views in unsupervised representation learning.
contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2020.
2018.
Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature
Pang, B. and Lee, L. Seeing stars: Exploiting class relationships learning via non-parametric instance discrimination. In Proceed-
for sentiment categorization with respect to rating scales. In ings of the IEEE Conference on Computer Vision and Pattern
Proceedings of the 43rd annual meeting on association for Recognition, pp. 3733–3742, 2018.
computational linguistics, pp. 115–124. Association for Com-
putational Linguistics, 2005. Xu, J. and Durrett, G. Spherical latent spaces for stable variational
autoencoders. In Proceedings of the 2018 Conference on Empir-
Parkhi, O. M., Vedaldi, A., and Zisserman, A. Deep face recogni- ical Methods in Natural Language Processing, pp. 4503–4513,
tion. 2015. 2018.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R.,
G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmai- Torralba, A., and Fidler, S. Aligning books and movies: Towards
son, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, story-like visual explanations by watching movies and reading
A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chin- books. In arXiv preprint arXiv:1506.06724, 2015.
tala, S. Pytorch: An imperative style, high-performance deep
learning library. In Wallach, H., Larochelle, H., Beygelzimer,
A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances
in Neural Information Processing Systems 32, pp. 8026–8037.
2019.
Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., and Khande-
parkar, H. A theoretical analysis of contrastive unsupervised
representation learning. In International Conference on Ma-
chine Learning, pp. 5628–5637, 2019.
Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lu-
cic, M. On mutual information maximization for representation
learning. arXiv preprint arXiv:1907.13625, 2019.