0% found this document useful (0 votes)
14 views

wang20k

Uploaded by

Trần Khiêm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

wang20k

Uploaded by

Trần Khiêm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Understanding Contrastive Representation Learning through

Alignment and Uniformity on the Hypersphere

Tongzhou Wang 1 Phillip Isola 1

Abstract
Contrastive representation learning has been out-
standingly successful in practice. In this work,
we identify two key properties related to the con-
trastive loss: (1) alignment (closeness) of features
from positive pairs, and (2) uniformity of the in-
duced distribution of the (normalized) features on
the hypersphere. We prove that, asymptotically,
the contrastive loss optimizes these properties,
and analyze their positive effects on downstream
tasks. Empirically, we introduce an optimizable
metric to quantify each property. Extensive exper-
iments on standard vision and language datasets Alignment: Similar samples have similar features.
confirm the strong agreement between both met- (Figure inspired by Tian et al. (2019).)
rics and downstream task performance. Directly
optimizing for these two metrics leads to repre-
sentations with comparable or better performance
at downstream tasks than contrastive learning.
Project Page: ssnl.github.io/hypersphere.
Code: github.com/SsnL/align uniform.

1. Introduction
A vast number of recent empirical works learn representa- Feature Density
tions with a unit ℓ2 norm constraint, effectively restricting Uniformity:
Uniformity: Preserve
Preserve maximal
maximal information.
information
the output space to the unit hypersphere (Parkhi et al., 2015;
Figure 1: Illustration of alignment and uniformity of fea-
Schroff et al., 2015; Liu et al., 2017; Hasnat et al., 2017;
ture distributions on the output unit hypersphere. STL-10
Wang et al., 2017; Bojanowski & Joulin, 2017; Mettes et al.,
(Coates et al., 2011) images are used for demonstration.
2019; Hou et al., 2019; Davidson et al., 2018; Xu & Dur-
rett, 2018), including many recent unsupervised contrastive
representation learning methods (Wu et al., 2018; Bachman learning where dot products are ubiquitous (Xu & Durrett,
et al., 2019; Tian et al., 2019; He et al., 2019; Chen et al., 2018; Wang et al., 2017). Moreover, if features of a class are
2020). sufficiently well clustered, they are linearly separable with
the rest of feature space (see Figure 2), a common criterion
Intuitively, having the features live on the unit hypersphere used to evaluate representation quality.
leads to several desirable traits. Fixed-norm vectors are
known to improve training stability in modern machine While the unit hypersphere is a popular choice of feature
space, not all encoders that map onto it are created equal.
1
MIT Computer Science & Artificial Intelligence Lab (CSAIL). Recent works argue that representations should addition-
Correspondence to: Tongzhou Wang <[email protected]>. ally be invariant to unnecessary details, and preserve as
Proceedings of the 37 th International Conference on Machine much information as possible (Oord et al., 2018; Tian et al.,
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- 2019; Hjelm et al., 2018; Bachman et al., 2019). Let us
thor(s). call these two properties alignment and uniformity (see
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

et al., 2019; He et al., 2019; Chen et al., 2020). The com-


mon motivation behind these work is the InfoMax principle
(Linsker, 1988), which we here instantiate as maximizing
the mutual information (MI) between two views (Tian et al.,
2019; Bachman et al., 2019; Wu et al., 2020). However, this
interpretation is known to be inconsistent with the actual
behavior in practice, e.g., optimizing a tighter bound on MI
can lead to worse representations (Tschannen et al., 2019).
What the contrastive loss exactly does remains largely a
mystery. Analysis based on the assumption of latent classes
provides nice theoretical insights (Saunshi et al., 2019), but
Li unfortunately has a rather large gap with empirical practices:
cla near
ssi the result that representation quality suffers with a large
fie
r
number of negatives is inconsistent with empirical obser-
Hypersphere:
Figure 2:Hypersphere: When
Clustered sets classes areseparable
are linearly well-clustered vations (Wu et al., 2018; Tian et al., 2019; He et al., 2019;
(forming spherical caps), they are linearly separable. The Chen et al., 2020). In this paper, we analyze and characterize
same does not hold for Euclidean spaces. the behavior of contrastive learning from the perspective of
alignment and uniformity properties, and empirically verify
our claims with standard representation learning tasks.
Figure 1). Alignment favors encoders that assign similar
features to similar samples. Uniformity prefers a feature Representation learning on the unit hypersphere. Out-
distribution that preserves maximal information, i.e., the side contrastive learning, many other representation learning
uniform distribution on the unit hypersphere. approaches also normalize their features to be on the unit hy-
In this work, we analyze the alignment and uniformity prop- persphere. In variational autoencoders, the hyperspherical
erties. We show that a currently popular form of contrastive latent space has been shown to perform better than the Eu-
representation learning in fact directly optimizes for these clidean space (Xu & Durrett, 2018; Davidson et al., 2018).
two properties in the limit of infinite negative samples. We Directly matching uniformly sampled points on the unit
propose theoretically-motivated metrics for alignment and hypersphere is known to provide good representations (Bo-
uniformity, and observe strong agreement between them janowski & Joulin, 2017), agreeing with our intuition that
and downstream task performance. Remarkably, directly uniformity is a desirable property. Mettes et al. (2019) opti-
optimizing for these two metrics leads to comparable or mizes prototype representations on the unit hypersphere for
better performance than contrastive learning. classification. Hyperspherical face embeddings greatly out-
perform the unnormalized counterparts (Parkhi et al., 2015;
Our main contributions are: Liu et al., 2017; Wang et al., 2017; Schroff et al., 2015).
• We propose quantifiable metrics for alignment and Its empirical success suggests that the unit hypersphere is
uniformity as two measures of representation quality, indeed a nice feature space. In this work, we formally inves-
with theoretical motivations. tigate the interplay between the hypersphere geometry and
the popular contrastive representation learning.
• We prove that the contrastive loss optimizes for align-
ment and uniformity asymptotically.
Distributing points on the unit hypersphere. The prob-
• Empirically, we find strong agreement between both lem of uniformly distributing points on the unit hypersphere
metrics and downstream task performance. is a well-studied one. It is often defined as minimizing
• Despite being simple in form, our proposed metrics, the total pairwise potential w.r.t. a certain kernel function
when directly optimized with no other loss, empirically (Borodachov et al., 2019; Landkof, 1972), e.g., the Thomson
lead to comparable or better performance at down- problem of finding the minimal electrostatic potential energy
stream tasks than contrastive learning. configuration of electrons (Thomson, 1904), and minimiza-
tion of the Riesz s-potential (Götz & Saff, 2001; Hardin &
2. Related Work Saff, 2005; Liu et al., 2018). The uniformity metric we pro-
pose is based on the Gaussian potential, which can be used
Unsupervised Contrastive Representation Learning to represent a very general class of kernels and is closely
has seen remarkable success in learning representations related to the universally optimal point configurations (Boro-
for image and sequential data (Logeswaran & Lee, 2018; dachov et al., 2019; Cohn & Kumar, 2007). Additionally,
Wu et al., 2018; Oord et al., 2018; Hénaff et al., 2019; Tian the best-packing problem on hyperspheres (often called the
et al., 2019; Hjelm et al., 2018; Bachman et al., 2019; Tian Tammes problem) is also well studied (Tammes, 1930).
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

3. Preliminaries on Unsupervised Contrastive The InfoMax principle. Many empirical works are moti-
Representation Learning vated by the InfoMax principle of maximizing I(f (x); f (y))
for (x, y) ∼ ppos (Tian et al., 2019; Bachman et al., 2019;
The popular unsupervised contrastive representation learn- Wu et al., 2020). Usually they interpret Lcontrastive in
ing method (often referred to as contrastive learning in this Eqn. (1) as a lower bound of I(f (x); f (y)) (Oord et al.,
paper) learns representations from unlabeled data. It as- 2018; Hjelm et al., 2018; Bachman et al., 2019; Tian et al.,
sumes a way to sample positive pairs, representing similar 2019). However, this interpretation is known to have issues
samples that should have similar representations. Empir- in practice, e.g., maximizing a tighter bound often leads
ically, the positive pairs are often obtained by taking two to worse downstream task performance (Tschannen et al.,
independently randomly augmented versions of the same 2019). Therefore, instead of viewing it as a bound, we inves-
sample, e.g. two crops of the same image (Wu et al., 2018; tigate the exact behavior of directly optimizing Lcontrastive
Hjelm et al., 2018; Bachman et al., 2019; He et al., 2019; in the following sections.
Chen et al., 2020).
Let pdata (·) be the data distribution over Rn and ppos (·, ·) 4. Feature Distribution on the Hypersphere
the distribution of positive pairs over Rn × Rn . Based on
empirical practices, we assume the following property. The contrastive loss encourages learned feature representa-
tion for positive pairs to be similar, while pushing features
Assumption. Distributions pdata and ppos should satisfy from the randomly sampled negative pairs apart. Conven-
• Symmetry: ∀x, y, ppos (x, y) = ppos (y, x). tional wisdom says that representations should extract the
R most shared information between positive pairs and remain
• Matching marginal: ∀x, ppos (x, y) dy = pdata (x).
invariant to other noise factors (Linsker, 1988; Tian et al.,
2019; Wu et al., 2020; Bachman et al., 2019). Therefore,
We consider the following specific and widely popular form
the loss should prefer two following properties:
of contrastive loss for training an encoder f : Rn → S m−1 ,
mapping data to ℓ2 normalized feature vectors of dimension • Alignment: two samples forming a positive pair should
m. This loss has been shown effective by many recent be mapped to nearby features, and thus be (mostly)
representation learning methods (Logeswaran & Lee, 2018; invariant to unneeded noise factors.
Wu et al., 2018; Tian et al., 2019; He et al., 2019; Hjelm • Uniformity: feature vectors should be roughly uni-
et al., 2018; Bachman et al., 2019; Chen et al., 2020). formly distributed on the unit hypersphere S m−1 , pre-
serving as much information of the data as possible.
Lcontrastive (f ; τ, M ) , To empirically verify this, we visualize CIFAR-10 (Tor-
" #
f (x)T f (y)/τ ralba et al., 2008; Krizhevsky et al., 2009) representations
e
E − log − T , on S 1 (m = 2) obtained via three different methods:
ef (x)T f (y)/τ ef (xi) f (y)/τ
P
(x,y)∼ppos + i
M i.i.d.
{x−
i }i=1 ∼ pdata • Random initialization.
(1) • Supervised predictive learning: An encoder and a lin-
ear classifier are jointly trained from scratch with cross
where τ > 0 is a scalar temperature hyperparameter, and entropy loss on supervised labels.
M ∈ Z+ is a fixed number of negative samples.
• Unsupervised contrastive learning: An encoder is
The term contrastive loss has also been generally used to trained w.r.t. Lcontrastive with τ = 0.5 and M = 256.
refer to various objectives based on positive and negative
All three encoders share the same AlexNet based archi-
samples, e.g., in Siamese networks (Chopra et al., 2005;
tecture (Krizhevsky et al., 2012), modified to map input
Hadsell et al., 2006). In this work, we focus on the spe-
images to 2-dimensional vectors in S 1 . Both predictive
cific form in Equation (1) that is widely used in modern
and contrastive learning use standard data augmentations to
unsupervised contrastive representation learning literature.
augment the dataset and sample positive pairs.
Figure 3 summarizes the resulting distributions of validation
Necessity of normalization. Without the norm constraint,
set features. Indeed, features from unsupervised contrastive
the softmax distribution can be made arbitrarily sharp by
learning (bottom in Figure 3) exhibit the most uniform dis-
simply scaling all the features. Wang et al. (2017) provided
tribution, and are closely clustered for positive pairs.
an analysis on this effect and argued for the necessity of
normalization when using feature vector dot products in a The form of the contrastive loss in Eqn. (1) also suggests
cross entropy loss, as is in Eqn. (1). Experimentally, Chen this. We present informal arguments below, followed by
et al. (2020) also showed that normalizing outputs leads to more formal treatment in Section 4.2. From the symmetry
superior representations.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

Alignment Uniformity
Positive Pair Feature Distances Feature Distribution Class 0 Class 3 Class 6 Class 9
Mean 1.0 1.0 1.0 1.0 1.0
5000
0.5 0.5 0.5 0.5 0.5
4000
0.0 0.0 0.0 0.0 0.0
Counts

3000 −0.5 −0.5 −0.5 −0.5 −0.5

2000 −1.0 −1.0 −1.0 −1.0 −1.0

−1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1
1000 Features Features Features Features Features
1000 100 100 100 100

Counts
0 0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
ℓ2 Distances Angles Angles Angles Angles Angles

(a) Random Initialization. Linear classification validation accuracy: 12.71%.


Alignment Uniformity
Positive Pair Feature Distances Feature Distribution Class 0 Class 3 Class 6 Class 9
5000 Mean 1.0 1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5 0.5


4000

0.0 0.0 0.0 0.0 0.0


3000
Counts

−0.5 −0.5 −0.5 −0.5 −0.5

2000 −1.0 −1.0 −1.0 −1.0 −1.0

−1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1
1000 Features Features Features Features Features
1000 100 100 100 100
Counts

0 0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
ℓ2 Distances Angles Angles Angles Angles Angles

(b) Supervised Predictive Learning. Linear classification validation accuracy: 57.19%.


Alignment Uniformity
Positive Pair Feature Distances Feature Distribution Class 0 Class 3 Class 6 Class 9
5000 1.0 1.0 1.0 1.0 1.0
Mean

0.5 0.5 0.5 0.5 0.5


4000

0.0 0.0 0.0 0.0 0.0


3000
Counts

−0.5 −0.5 −0.5 −0.5 −0.5

2000
−1.0 −1.0 −1.0 −1.0 −1.0

−1 0 1 −1 0 1 −1 0 1 −1 0 1 −1 0 1
1000 Features Features Features Features Features
1000 100 100 100 100
Counts

0 0 0 0 0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
ℓ2 Distances Angles Angles Angles Angles Angles

(c) Unsupervised Contrastive Learning. Linear classification validation accuracy: 28.60%.


Figure 3: Representations of CIFAR-10 validation set on S 1 . Alignment analysis: We show distribution of distance
between features of positive pairs (two random augmentations). Uniformity analysis: We plot feature distributions with
Gaussian kernel density estimation (KDE) in R2 and von Mises-Fisher (vMF) KDE on angles (i.e., arctan2(y, x) for each
point (x, y) ∈ S 1 ). Four rightmost plots visualize feature distributions of selected specific classes. Representation from
contrastive learning is both aligned (having low positive pair feature distances) and uniform (evenly distributed on S 1 ).

of p, we can derive which is akin to maximizing pairwise distances with a


LogSumExp transformation. Intuitively, pushing all fea-
−f (x)T f (y)/τ
 
Lcontrastive (f ; τ, M ) = E tures away from each other should indeed cause them to be
(x,y)∼ppos
" !# roughly uniformly distributed.
f (x)T f (y)/τ T
X
f (x− ) f (x)/τ
+ E log e + e i .
(x,y)∼ppos 4.1. Quantifying Alignment and Uniformity
i.i.d.
i
M
{x−
i }i=1 ∼ pdata
For further analysis, we need a way to measure alignment
− T
Because the i ef (xi ) f (x)/τ term is always and uniformity. We propose the following two metrics
P
positive and
(losses).
 
bounded below, the loss favors smaller E −f (x)T f (y)/τ ,
i.e., having more aligned positive pair features. Suppose the
encoder is perfectly aligned, i.e., P [f (x) = f (y)] = 1, then 4.1.1. A LIGNMENT
minimizing the loss is equivalent to optimizing The alignment loss is straightforwardly defined with the
" !# expected distance between positive pairs:
)T f (x)/τ
X
1/τ f (x−
x∼p
E log e + e i , α
M
data
i.i.d. i Lalign (f ; α) , − E [kf (x) − f (y)k2 ] , α > 0.
{x−
i }i=1 ∼ pdata (x,y)∼ppos
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
0.4⋅vMF([1, 0], κ = 103)+ 0.6⋅vMF([0, − 1], κ = 1)
Random Initialization Samples Supervised Predictive Learning Unsupervised Contrastive Learning Uniform Distribution Samples
1.0 1.0 1.0 1.0 1.0

0.8474
0.5 0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0 0.0

−0.5 −0.5 −0.5 −0.5 −0.5

−1.0 −1.0 0.3546 −1.0 −1.0 −1.0

−1 0 1 −1 0 1 −1 0 1 0.2380 −1 0 1 −1 0 1
0.2088 0.2070
Features Features Features Features Features
1000 1000 1000 1000 1000
Counts

Counts

Counts

Counts

Counts
0 0 0 0 0
−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
Angles Average G2 Angles Average G2 Angles Average G2 Angles Average G2 Angles Average G2

Figure 4: Average pairwise G2 potential as a measure of uniformity. Each plot shows 10000 points distributed on S 1 ,
obtained via either applying an encoder on CIFAR-10 validation set (same as those in Figure 3) or sampling from a
distribution on S 1 , as described in plot titles. We show the points with Gaussian KDE and the angles with vMF KDE.

4.1.2. U NIFORMITY Proposition 2. For each N > 0, the N point minimizer of


the average pairwise potential is
We want the uniformity metric to be both asymptotically X
correct (i.e., the distribution optimizing this metric should u∗N = arg min Gt (ui , uj ).
converge to uniform distribution) and empirically reasonable u1 ,u2 ,...,uN ∈S d 1≤i<j≤N
with finite number of points. To this end, we consider the
The normalized counting measures associated with the
Gaussian potential kernel (also known as the Radial Basis
{u∗N }∞ ∗
N =1 sequence converge weak to σd .
Function (RBF) kernel) Gt : S d × S d → R+ (Cohn &
Kumar, 2007; Borodachov et al., 2019):
2 T Proof. See supplementary material.
Gt (u, v) , e−tku−vk2 = e2t·u v−2t
, t > 0,
and define the uniformity loss as the logarithm of the average Designing an objective minimized by the uniform distribu-
pairwise Gaussian potential: tion is in fact nontrivial. For instance, average pairwise dot
Luniform (f ; t) , log E [Gt (u, v)] , t > 0, products or Euclidean distances is simply optimized by any
i.i.d.
x,y ∼ pdata distribution that has zero mean. Among kernels that achieve
where t is a fixed parameter. uniformity at optima, the Gaussian kernel is special in that
it is closely related to the universally optimal point config-
The average pairwise Gaussian potential is nicely tied with urations and can also be used to represent a general class
the uniform distribution on the unit hypersphere. of other kernels, including the Riesz s-potentials. We refer
Definition (Uniform distribution on S d ). σd denotes the readers to Borodachov et al. (2019) and Cohn & Kumar
normalized surface area measure on S d . (2007) for in-depths discussion on these topics. Moreover,
as we show below, Luniform , defined with the Gaussian ker-
First, we show that the uniform distribution is the unique nel, has close connections with Lcontrastive .
distribution that minimize the expected pairwise potential.
Proposition 1. For M(S d ) the set of Borel probability Empirically, we evaluate the average pairwise potential of
measures on S d , σd is the unique solution of various finite point collections on S 1 in Figure 4. The values
Z Z nicely align with our intuitive understanding of uniformity.
min Gt (u, v) dµ dµ.
µ∈M(S d ) u v 4.2. Limiting Behavior of Contrastive Learning
Proof. See supplementary material. In this section, we formalize the intuition that contrastive
learning optimizes alignment and uniformity, and charac-
In addition, as number of points goes to infinity, distribu- terize its asymptotic behavior. We consider optimization
tions of points minimizing the average pairwise potential problems over all measurable encoder functions from the
converge weak∗ to the uniform distribution. Recall the defi- pdata measure in Rn to the Borel space S m−1 .
nition of the weak∗ convergence of measures. We first define the notion of optimality for these metrics.
Definition (Weak∗ convergence of measures). A sequence
of Borel measures {µn }∞ p ∗ Definition (Perfect Alignment). We say an encoder f is
n=1 in R converges weak to a
p
Borel measure µ if for all continuous function f : R → R, perfectly aligned if f (x) = f (y) a.s. over (x, y) ∼ ppos .
we have Definition (Perfect Uniformity). We say an encoder f is
perfectly uniform if the distribution of f (x) for x ∼ pdata is
Z Z
lim f (x) dµn (x) = f (x) dµ(x). the uniform distribution σm−1 on S m−1 .
n→∞
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

Realizability of perfect uniformity. We note that it is not (e.g., a collected dataset), the second term in Equation (2)
always possible to achieve perfect uniformity, e.g., when the can be alternatively viewed as a resubstitution entropy esti-
data manifold in Rn is lower dimensional than the feature mator of f (x) (Ahmad & Lin, 1976), where x follows the
space S m−1 . Moreover, in the case that pdata and ppos are underlying distribution pnature that generates {xi }N i=1 , via a
formed from sampling augmented samples from a finite von Mises-Fisher (vMF) kernel density estimation (KDE):
dataset, there cannot be an encoder that is both perfectly  h i
− T
aligned and perfectly uniform, because perfect alignment E log − E ef (x ) f (x)/τ
implies that all augmentations from a single element have x∼pdata x ∼pdata
 
the same feature vector. Nonetheless, perfectly uniform N N
1 X 1 X T
encoder functions do exist under the conditions that n ≥ = log  ef (xi ) f (xj )/τ 
m − 1 and pdata has bounded density. N i=1 N j=1
N
We analyze the asymptotics with infinite negative samples. 1 X
Existing empirical work has established that larger number = log p̂vMF-KDE (f (xi )) + log ZvMF
N i=1
of negative samples consistently leads to better downstream
task performances (Wu et al., 2018; Tian et al., 2019; He , −Ĥ(f (x)) + log ZvMF , x ∼ pnature
et al., 2019; Chen et al., 2020), and often uses very large ˆ f (x)) + log ZvMF ,
, −I(x; x ∼ pnature ,
values (e.g., M = 65536 in He et al. (2019)). The following
where
theorem nicely confirms that optimizing w.r.t. the limiting
loss indeed requires both alignment and uniformity. • p̂vMF-KDE is the KDE based on samples {f (xj )}N
j=1

Theorem 1 (Asymptotics of Lcontrastive ). For fixed τ > 0, using a vMF kernel with κ = τ −1 ,
as the number of negative samples M → ∞, the (normal- • ZvMF is the vMF normalization constant for κ = τ −1 ,
ized) contrastive loss converges to • Ĥ denotes the resubstitution entropy estimator,
lim Lcontrastive (f ; τ, M ) − log M = • Iˆ denotes the mutual information estimator based on
M →∞
Ĥ, since f is a deterministic function.
1
f (x)T f (y)
 
− E (2)
τ (x,y)∼ppos Relation with the InfoMax principle. Many empirical
 i
works are motivated by the InfoMax principle, i.e., maxi-
h − T
+ E log E ef (x ) f (x)/τ .
x∼pdata x− ∼pdata mizing I(f (x); f (y)) for (x, y) ∼ ppos . However, the inter-
pretation of Lcontrastive as a lower bound of I(f (x); f (y)) is
We have the following results: known to be inconsistent with its actual behavior in prac-
1. The first term is minimized iff f is perfectly aligned. tice (Tschannen et al., 2019). Our results instead analyze
2. If perfectly uniform encoders exist, they form the exact the properties of Lcontrastive itself. Considering the identity
minimizers of the second term. I(f (x); f (y)) = H(f (x)) − H(f (x) | f (y)), we can see
that while uniformity indeed favors large H(f (x)), align-
3. For the convergence in Equation (2), the absolute devi- ment is stronger than merely desiring small H(f (x) | f (y)).
ation from the limit decays in O(M −2/3 ). Instead, our above analysis suggests that Lcontrastive opti-
mizes for aligned and information-preserving encoders.
Proof. See supplementary material.
Finally, even for the case where only a single negative sam-
Relation with Luniform . The proof of Theorem 1 in the ple is used (i.e., M = 1), we can still prove a weaker result,
supplementary material connects the asymptotic Lcontrastive which we describe in details in the supplementary material.
form with minimizing average pairwise Gaussian poten-
tial, i.e., minimizing Luniform . Compared with the second 5. Experiments
term of Equation (2), Luniform essentially pushes the log out-
side the outer expectation, without changing the minimizer In this section, we empirically verify the hypothesis that
(perfectly uniform encoders). However, due to its pair- alignment and uniformity are desired properties for repre-
wise nature, Luniform is much simpler in form and avoids the sentations. Recall that our two metrics are
computationally expensive softmax operation in Lcontrastive α
Lalign (f ; α) , E(x,y)∼ppos [kf (x) − f (y)k2 ]
(Goodman, 2001; Bengio et al.; Gutmann & Hyvärinen, h i
2
2010; Grave et al., 2017; Chen et al., 2018). Luniform (f ; t) , log E i.i.d. e−tkf (x)−f (y)k2 .
x,y ∼ pdata

Relation with feature distribution entropy estimation. We conduct extensive experiments with convolutional neural
When pdata is uniform over finite samples {x1 , x2 , . . . , xN } network (CNN) and recurrent neural network (RNN) based
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

# bsz : batch size (number of positive pairs) We optimize a total of 306 STL-10 encoders, 64 NYU-
# d : latent dim
# x : Tensor, shape=[bsz, d]
D EPTH -V2 encoders, 45 I MAGE N ET-100 encoders, and
# latents for one side of positive pairs 108 B OOK C ORPUS encoders without supervision. The
# y : Tensor, shape=[bsz, d] encoders are optimized w.r.t. weighted combinations of
# latents for the other side of positive pairs
# lam : hyperparameter balancing the two losses
Lcontrastive , Lalign , and/or Luniform , with varying
• (possibly zero) weights on the three losses,
def lalign(x, y, alpha=2):
return (x - y).norm(dim=1).pow(alpha).mean() • loss hyperparameters: τ for Lcontrastive , α for Lalign ,
def lunif(x, t=2): and t for Luniform ,
sq_pdist = torch.pdist(x, p=2).pow(2)
return sq_pdist.mul(-t).exp().mean().log()
• batch size (affecting the number of (negative) pairs for
Lcontrastive and Luniform ),
loss = lalign(x, y) + lam * (lunif(x) + lunif(y)) / 2
• embedding dimension,
Figure 5: PyTorch implementation of Lalign and Luniform . • number of training epochs and learning rate,
• initialization (from scratch vs. a pretrained encoder).
encoders on four popular representation learning bench-
marks with distinct types of downstream tasks: See the supplementary material for more experiment details
and the exact configurations used.
• STL-10 (Coates et al., 2011) classification on AlexNet-
based encoder outputs or intermediate activations with Both Lalign and Luniform strongly agree with downstream
a linear or k-nearest neighbor (k-NN) classifier. task performance. For each encoder, we measure the
• NYU-D EPTH -V2 (Nathan Silberman & Fergus, 2012) downstream task performance, and the Lalign , Luniform met-
depth prediction on CNN encoder intermediate activa- rics on the validation set. Figure 6 visualizes the trends
tions after convolution layers. between both metrics and representation quality. We ob-
• I MAGE N ET-100 (100 randomly selected classes from serve that the two metrics strongly agrees the representation
I MAGE N ET) classification on CNN encoder penulti- quality overall. In particular, the best performing encoders
mate layer activations with a linear classifier. are exactly the ones with low Lalign and Luniform , i.e., the
lower left corners in Figure 6. In the supplementary mate-
• B OOK C ORPUS (Zhu et al., 2015) RNN sentence en- rial, we observe that as long as the ratio between weights on
coder outputs used for Moview Review Sentence Po- Lalign and Luniform is not too large (e.g., < 4), the represen-
larity (MR) (Pang & Lee, 2005) and Customer Product tation quality remains relatively good and insensitive to the
Review Sentiment (CR) (Wang & Manning, 2012) bi- exact weight choices.
nary classification tasks with logisitc classifiers.
For image datasets, we follow the standard practice and Directly optimizing only Lalign and Luniform can lead to
choose positive pairs as two independent augmentations better representations. As shown in Table 1, encoders
of the same image. For B OOK C ORPUS, positive pairs are trained with only Lalign and Luniform consistently outper-
chosen as neighboring sentences, following Quick-Thought form their Lcontrastive -trained counterparts, for both tasks.
Vectors (Logeswaran & Lee, 2018). Theoretically, Theorem 1 showed that Lcontrastive optimizes
alignment and uniformity asymptotically with infinite neg-
We perform majority of our analysis on STL-10 and NYU-
ative samples. This empirical performance gap suggests
D EPTH -V2 encoders, where we calculate Lcontrastive with
that directly optimizing these properties can be superior in
negatives being other samples within the minibatch follow-
practice, when we can only have finite negatives.
ing the standard practice (Hjelm et al., 2018; Bachman et al.,
2019; Tian et al., 2019; Chen et al., 2020), and Luniform as
Lalign and Luniform causally affect downstream task per-
the logarithm of average pairwise feature potentials also
formance. We take an encoder trained with Lcontrastive
within the minibatch. Due to their simple forms, these two
using a suboptimal temperature τ = 2.5, and finetune it
losses can be implemented in PyTorch (Paszke et al., 2019)
according to Lalign and/or Luniform . Figure 7 visualizes the
with less than 10 lines of code, as shown in Figure 5.
finetuning trajectories. When only one of alignment and
To investigate alignment and uniformity properties on re- uniformity is optimized, the corresponding metric improves,
cent contrastive representation learning variants and larger but both the other metric and performance degrade. How-
datasets, we also analyze I MAGE N ET-100 encoders trained ever, when both properties are optimized, the representation
with Momentum Contrast (MoCo) (He et al., 2019) and quality steadily increases. These trends confirm the causal
B OOK C ORPUS encoders trained with Quick-Thought Vec- effect of alignment and uniformity on the representation
tors (Logeswaran & Lee, 2018), with these methods modi- quality, and suggest that directly optimizing them can be a
fied to also allow Lalign and Luniform . reasonable choice.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Linear Classification on Outputs 5-NN Classification on fc7 Depth Prediction on conv5
85 85 0.80
2.00 contrastive only 2.00 contrastive only 2.00 contrastive only
align, uniform only 80 align, uniform only 80 align, uniform only
1.75 1.75 1.75
All three mixed All three mixed All three mixed 0.78
1.50 75 1.50 75 1.50

1.25 1.25 1.25


align(α = 2)

align(α = 2)

align(α = 2)
Val Accuracy

Val Accuracy
0.76
70 70

Val MSE
1.00 1.00 1.00
65 65
0.75 0.75 0.75 0.74

0.50 60 0.50 60 0.50


0.72
0.25 55 0.25 55 0.25

0.00 0.00 0.00


50 50 0.70
−4 −3 −2 −1 0 −4 −3 −2 −1 0 −4 −3 −2 −1 0
uniform(t = 2) uniform(t = 2) uniform(t = 2)

(a) 306 STL-10 encoders are evaluated with linear classification on output features and (b) 64 NYU-D EPTH -V2 encoders are eval-
5-nearest neighbor (5-NN) on fc7 activations. Higher accuracy (blue color) is better. uated with CNN depth regressors on conv5
activations. Lower MSE (blue color) is better.
Figure 6: Metrics and performance of STL-10 and NYU-D EPTH -V2 experiments. Each point represents a trained encoder,
with its x- and y-coordinates showing Lalign and Luniform metrics and color showing the performance on validation set. Blue
is better for both tasks. Encoders with low Lalign and Luniform are consistently the better performing ones (lower left corners).

STL-10 Validation Set Accuracy ↑ NYU-D EPTH -V2 Validation Set MSE ↓
Output + Linear Output + 5-NN fc7 + Linear fc7 + 5-NN conv5 conv4
Best Lcontrastive only 80.46% 78.75% 83.89% 76.33% 0.7024 0.7575
Best Lalign and Luniform only 81.15% 78.89% 84.43% 76.78% 0.7014 0.7592

Table 1: Encoder evaluations. STL-10: Numbers show linear and 5-nearest neighbor (5-NN) classification accuracies. The
best result is picked by encoder outputs linear classifier accuracy from a 5-fold training set cross validation, among all 150
encoders trained from scratch with 128-dimensional output and 768 batch size. NYU-D EPTH -V2: Numbers show depth
prediction mean squared error (MSE). The best result is picked based on conv5 layer MSE from a 5-fold training set cross
validation, among all 64 encoders trained from scratch with 128-dimensional output and 128 batch size.
Finetune with 0.0025 ⋅ align Finetune with 0.0005 ⋅ uniform Finetune with 0.025 ⋅ align + 0.025 ⋅ uniform
1.0 uniform(t = 2) (exp) 1.0 uniform(t = 2) (exp) 1.0 uniform(t = 2) (exp)
align(α = 2) align(α = 2) align(α = 2)
0.8 Val accuracy 0.8 Val accuracy 0.8 Val accuracy

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Finetune Epochs Finetune Epochs Finetune Epochs

Figure 7: Finetuning trajectories from a STL-10 encoder trained with Lcontrastive using a suboptimal temperature τ = 2.5.
Finetuning objectives are weighted combinations of Lalign (α=2) and Luniform (t=2). For each intermediate checkpoint, we
measure Lalign and Luniform metrics, as well as validation accuracy of a linear classifier trained from scratch on the encoder
outputs. Luniform is exponentiated for plotting purpose. Left and middle: Performance degrades if only one of alignment
and uniformity is optimized. Right: Performance improves when both are optimized.

Alignment and uniformity also matter in other con- batches. After modifying them to also allow Lalign and
trastive representation learning variants. MoCo (He Luniform , we train these methods on I MAGE N ET-100 and
et al., 2019) and Quick-Thought Vectors (Logeswaran & B OOK C ORPUS, respectively. Figure 8 shows that Lalign and
Lee, 2018) are contrastive representation learning variants Luniform metrics are still correlated with the downstream
that have nontrivial differences with directly optimizing task performances. Table 2 shows that directly optimizing
Lcontrastive in Equation (1). MoCo introduces a memory them also leads to comparable or better representation qual-
queue and a momentum encoder. Quick-Thought Vectors ity. These results suggest that alignment and uniformity
uses two different encoders to encode each sentence in a are indeed desirable properties for representations, for both
positive pair, only normalizes encoder outputs during eval- image and text modalities, and are likely connected with
uation, and does not use random sampling to obtain mini- general contrastive representation learning methods.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Linear Classification on Penultimate Layer Moview Review Classification on Outputs Customer Review Classification on Outputs
72.0 80
contrastive only 2.00 2.00
0.6
align, uniform only 71.5 74
79
1.75 1.75
All three mixed
0.5 71.0 78
1.50 1.50
72
0.4 70.5 77
align(α = 2)

1.25 1.25

Val Accuracy

align(α = 2)

align(α = 2)
Val Accuracy

Val Accuracy
70.0 1.00 70 1.00 76
0.3

69.5 0.75 0.75 75


0.2
68
69.0 0.50 0.50
74
0.1 contrastive only contrastive only
0.25 0.25
68.5 align, uniform only align, uniform only 73
66
0 0.00 All three mixed 0.00 All three mixed
68.0 72
−4 −3 −2 −1 0 −4 −3 −2 −1 0 −4 −3 −2 −1 0
uniform(t = 2) uniform(t = 2) uniform(t = 2)

(a) 45 I MAGE N ET-100 encoders are trained (b) 108 B OOK C ORPUS encoders are trained with Quick-Thought-Vectors-based methods,
with MoCo-based methods, and evaluated and evaluated with logistic binary classification on Movie Review Sentence Polarity (MR)
with linear classification. and Customer Product Review Sentiment (CR) tasks.
Figure 8: Metrics and performance of I MAGE N ET-100 and B OOK C ORPUS experiments. Each point represents a trained
encoder, with its x- and y-coordinates showing Lalign and Luniform metrics and color showing the validation accuracy. Blue is
better. Encoders with low Lalign and Luniform consistently perform well (lower left corners), even though the training methods
(based on MoCo and Quick-Thought Vectors) are different from directly optimizing the contrastive loss in Equation (1).

I MAGE N ET-100 MoCo-based Encoders B OOK C ORPUS Quick-Though-Vectors-based Encoders


top1 Val. Accuracy ↑ top5 Val. Accuracy ↑ MR Val. Accuracy ↑ CR Val. Accuracy ↑
Best Lcontrastive only 72.80% 91.64% 77.51% 83.86%
Best Lalign and Luniform only 74.60% 92.74% 73.76% 80.95%

Table 2: Encoder evaluations. I MAGE N ET-100: Numbers show linear classifier accuracies on encoder penultimate layer
activations.The best result is picked based on top1 accuracy from a 3-fold training set cross validation, among all 45 encoders
trained from scratch with 128-dimensional output and 128 batch size. B OOK C ORPUS: Numbers show Movie Review
Sentence Polarity (MR) and Customer Product Sentiment (CR) classification accuracies of logistic classifiers fit on encoder
outputs. The best result is picked based on accuracy from a 5-fold training set cross validation, individually for MR and CR,
among all 108 encoders trained from scratch with 1200-dimensional output and 400 batch size.

6. Discussion choice from a manifold mapping perspective (Liu et al.,


2017; Davidson et al., 2018) and computation stability (Xu
Alignment and uniformity are often alluded to as motivations & Durrett, 2018; Wang et al., 2017). However, to our best
for representation learning methods (see Figure 1). However, knowledge, the question of why the unit hypersphere is a
a thorough understanding of these properties is lacking in nice feature space is not yet rigorously answered. One pos-
the literature. sible direction is to formalize the intuition that connected
Are they in fact related to the representation learning meth- sets with smooth boundaries are nearly linearly separable
ods? Do they actually agree with the representation quality in the hyperspherical geometry (see Figure 2), since lin-
(measured by downstream task performance)? ear separability is one of the most widely used criteria for
representation quality and is related to the notion of disen-
In this work, we have presented a detailed investigation tanglement (Higgins et al., 2018).
on the relation between these properties and the popular
paradigm of contrastive representation learning. Through
Beyond contrastive learning. Our analysis focused on
theoretical analysis and extensive experiments, we are able
the relationship between contrastive learning and the align-
to relate the contrastive loss with the alignment and unifor-
ment and uniformity properties on the unit hypersphere.
mity properties, and confirm their strong connection with
However, the ubiquitous presence of ℓ2 normalization in the
downstream task performances. Remarkably, we have re-
representation learning literature suggests that the connec-
vealed that directly optimizing our proposed metrics often
tion may be more general. In fact, several existing empirical
leads to representations of better quality.
methods are directly related to uniformity on the hyper-
Below we summarize several suggestions for future work. sphere (Bojanowski & Joulin, 2017; Davidson et al., 2018;
Xu & Durrett, 2018). We believe that relating a broader
Niceness of the unit hypersphere. Our analysis was class of representations to uniformity and/or alignment on
based on the empirical observation that representations are the hypersphere will provide novel insights and lead to better
often ℓ2 normalized. Existing works have motivated this empirical algorithms.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

Acknowledgements Grave, E., Joulin, A., Cissé, M., Jégou, H., et al. Efficient soft-
max approximation for gpus. In Proceedings of the 34th In-
We thank Philip Bachman, Ching-Yao Chuang, Justin ternational Conference on Machine Learning-Volume 70, pp.
Solomon, Yonglong Tian, and Zhenyang Zhang for many 1302–1310. JMLR. org, 2017.
helpful comments and suggestions. Tongzhou Wang was Gutmann, M. and Hyvärinen, A. Noise-contrastive estimation: A
supported by the MIT EECS Merrill Lynch Graduate Fel- new estimation principle for unnormalized statistical models.
lowship. In Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics, pp. 297–304, 2010.

References Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction


by learning an invariant mapping. In 2006 IEEE Computer So-
Ahmad, I. and Lin, P.-E. A nonparametric estimation of the en- ciety Conference on Computer Vision and Pattern Recognition
tropy for absolutely continuous distributions (corresp.). IEEE (CVPR’06), volume 2, pp. 1735–1742. IEEE, 2006.
Transactions on Information Theory, 22(3):372–375, 1976.
Hardin, D. and Saff, E. Minimal riesz energy point configurations
Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning rep- for rectifiable d-dimensional manifolds. Advances in Mathemat-
resentations by maximizing mutual information across views. ics, 193(1):174–204, 2005.
In Advances in Neural Information Processing Systems, pp.
15509–15519, 2019. Hasnat, M., Bohné, J., Milgram, J., Gentric, S., Chen, L., et al. von
mises-fisher mixture model-based deep learning: Application
Bengio, Y. et al. Quick training of probabilistic neural nets by to face verification. arXiv preprint arXiv:1706.04264, 2017.
importance sampling.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum
Bojanowski, P. and Joulin, A. Unsupervised learning by predicting contrast for unsupervised visual representation learning. arXiv
noise. In Proceedings of the 34th International Conference on preprint arXiv:1911.05722, 2019.
Machine Learning-Volume 70, pp. 517–526. JMLR. org, 2017.
Hénaff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A.
Borodachov, S. V., Hardin, D. P., and Saff, E. B. Discrete energy v. d. Data-efficient image recognition with contrastive predictive
on rectifiable sets. Springer, 2019. coding. arXiv preprint arXiv:1905.09272, 2019.

Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L.,
Chen, P. H., Si, S., Kumar, S., Li, Y., and Hsieh, C.-J. Learning
Rezende, D., and Lerchner, A. Towards a definition of dis-
to screen for fast softmax inference on large vocabulary neural
entangled representations. arXiv preprint arXiv:1812.02230,
networks. 2018.
2018.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K.,
framework for contrastive learning of visual representations. Bachman, P., Trischler, A., and Bengio, Y. Learning deep repre-
arXiv preprint arXiv:2002.05709, 2020. sentations by mutual information estimation and maximization.
arXiv preprint arXiv:1808.06670, 2018.
Chopra, S., Hadsell, R., and LeCun, Y. Learning a similarity metric
discriminatively, with application to face verification. In 2005 Hou, S., Pan, X., Loy, C. C., Wang, Z., and Lin, D. Learning a
IEEE Computer Society Conference on Computer Vision and unified classifier incrementally via rebalancing. In Proceed-
Pattern Recognition (CVPR’05), volume 1, pp. 539–546. IEEE, ings of the IEEE Conference on Computer Vision and Pattern
2005. Recognition, pp. 831–839, 2019.

Coates, A., Ng, A., and Lee, H. An analysis of single-layer net- Krizhevsky, A., Hinton, G., et al. Learning multiple layers of
works in unsupervised feature learning. In Proceedings of the features from tiny images. 2009.
fourteenth international conference on artificial intelligence
and statistics, pp. 215–223, 2011. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classifi-
cation with deep convolutional neural networks. In Advances in
Cohn, H. and Kumar, A. Universally optimal distribution of points neural information processing systems, pp. 1097–1105, 2012.
on spheres. Journal of the American Mathematical Society, 20
Landkof, N. S. Foundations of modern potential theory, volume
(1):99–148, 2007.
180. Springer, 1972.
Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, Linsker, R. Self-organization in a perceptual network. Computer,
J. M. Hyperspherical variational auto-encoders. 34th Confer- 21(3):105–117, 1988.
ence on Uncertainty in Artificial Intelligence (UAI-18), 2018.
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. Sphereface:
Goodman, J. Classes for fast maximum entropy training. In 2001 Deep hypersphere embedding for face recognition. In Proceed-
IEEE International Conference on Acoustics, Speech, and Sig- ings of the IEEE conference on computer vision and pattern
nal Processing. Proceedings (Cat. No. 01CH37221), volume 1, recognition, pp. 212–220, 2017.
pp. 561–564. IEEE, 2001.
Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., and Song,
Götz, M. and Saff, E. B. Note on d—extremal configurations L. Learning towards minimum hyperspherical energy. In Ad-
for the sphere in r d+1. In Recent Progress in Multivariate vances in Neural Information Processing Systems, pp. 6222–
Approximation, pp. 159–162. Springer, 2001. 6233. 2018.
Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

Logeswaran, L. and Lee, H. An efficient framework for learn- Wang, F., Xiang, X., Cheng, J., and Yuille, A. L. Normface: L2
ing sentence representations. In International Conference on hypersphere embedding for face verification. In Proceedings
Learning Representations, 2018. of the 25th ACM international conference on Multimedia, pp.
1041–1049, 2017.
Mettes, P., van der Pol, E., and Snoek, C. Hyperspherical proto-
type networks. In Advances in Neural Information Processing Wang, S. and Manning, C. D. Baselines and bigrams: Simple,
Systems, pp. 1485–1495, 2019. good sentiment and topic classification. In Proceedings of
the 50th annual meeting of the association for computational
Nathan Silberman, Derek Hoiem, P. K. and Fergus, R. Indoor linguistics: Short papers-volume 2, pp. 90–94. Association for
segmentation and support inference from rgbd images. In ECCV, Computational Linguistics, 2012.
2012.
Wu, M., Zhuang, C., Yamins, D., and Goodman, N. On the
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with importance of views in unsupervised representation learning.
contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2020.
2018.
Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature
Pang, B. and Lee, L. Seeing stars: Exploiting class relationships learning via non-parametric instance discrimination. In Proceed-
for sentiment categorization with respect to rating scales. In ings of the IEEE Conference on Computer Vision and Pattern
Proceedings of the 43rd annual meeting on association for Recognition, pp. 3733–3742, 2018.
computational linguistics, pp. 115–124. Association for Com-
putational Linguistics, 2005. Xu, J. and Durrett, G. Spherical latent spaces for stable variational
autoencoders. In Proceedings of the 2018 Conference on Empir-
Parkhi, O. M., Vedaldi, A., and Zisserman, A. Deep face recogni- ical Methods in Natural Language Processing, pp. 4503–4513,
tion. 2015. 2018.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R.,
G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmai- Torralba, A., and Fidler, S. Aligning books and movies: Towards
son, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, story-like visual explanations by watching movies and reading
A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chin- books. In arXiv preprint arXiv:1506.06724, 2015.
tala, S. Pytorch: An imperative style, high-performance deep
learning library. In Wallach, H., Larochelle, H., Beygelzimer,
A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances
in Neural Information Processing Systems 32, pp. 8026–8037.
2019.

Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., and Khande-
parkar, H. A theoretical analysis of contrastive unsupervised
representation learning. In International Conference on Ma-
chine Learning, pp. 5628–5637, 2019.

Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified


embedding for face recognition and clustering. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pp. 815–823, 2015.

Tammes, P. M. L. On the origin of number and arrangement of


the places of exit on the surface of pollen-grains. Recueil des
travaux botaniques néerlandais, 27(1):1–84, 1930.

Thomson, J. J. Xxiv. on the structure of the atom: an investigation


of the stability and periods of oscillation of a number of corpus-
cles arranged at equal intervals around the circumference of a
circle; with application of the results to the theory of atomic
structure. The London, Edinburgh, and Dublin Philosophical
Magazine and Journal of Science, 7(39):237–265, 1904.

Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding.


arXiv preprint arXiv:1906.05849, 2019.

Torralba, A., Fergus, R., and Freeman, W. T. 80 million tiny


images: A large data set for nonparametric object and scene
recognition. IEEE transactions on pattern analysis and machine
intelligence, 30(11):1958–1970, 2008.

Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lu-
cic, M. On mutual information maximization for representation
learning. arXiv preprint arXiv:1907.13625, 2019.

You might also like