0% found this document useful (0 votes)
8 views

Understanding Deep Contrastive Learning via Coordinate-wise Optimization

This paper presents a unified framework for Contrastive Learning (CL) through coordinate-wise optimization, introducing a formulation called α-CL that integrates various contrastive loss functions and allows for the development of novel losses with improved performance on benchmark datasets. The authors demonstrate that the max player in this optimization is equivalent to Principal Component Analysis (PCA) in deep linear networks, and extend this analysis to 2-layer ReLU networks, revealing insights into their learning dynamics. The findings suggest that the proposed approach not only enhances understanding of CL but also opens new avenues for designing effective contrastive losses.

Uploaded by

R.A.Y. 27
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Understanding Deep Contrastive Learning via Coordinate-wise Optimization

This paper presents a unified framework for Contrastive Learning (CL) through coordinate-wise optimization, introducing a formulation called α-CL that integrates various contrastive loss functions and allows for the development of novel losses with improved performance on benchmark datasets. The authors demonstrate that the max player in this optimization is equivalent to Principal Component Analysis (PCA) in deep linear networks, and extend this analysis to 2-layer ReLU networks, revealing insights into their learning dynamics. The findings suggest that the proposed approach not only enhances understanding of CL but also opens new avenues for designing effective contrastive losses.

Uploaded by

R.A.Y. 27
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Understanding Deep Contrastive Learning via

Coordinate-wise Optimization

Yuandong Tian
Meta AI (FAIR)
[email protected]
arXiv:2201.12680v7 [cs.LG] 20 Nov 2022

Abstract
We show that Contrastive Learning (CL) under a broad family of loss functions
(including InfoNCE) has a unified formulation of coordinate-wise optimization on
the network parameter θ and pairwise importance α, where the max player θ learns
representation for contrastiveness, and the min player α puts more weights on pairs
of distinct samples that share similar representations. The resulting formulation,
called α-CL, unifies not only various existing contrastive losses, which differ by
how sample-pair importance α is constructed, but also is able to extrapolate to give
novel contrastive losses beyond popular ones, opening a new avenue of contrastive
loss design. These novel losses yield comparable (or better) performance on
CIFAR10, STL-10 and CIFAR-100 than classic InfoNCE. Furthermore, we also
analyze the max player in detail: we prove that with fixed α, max player is
equivalent to Principal Component Analysis (PCA) for deep linear network, and
almost all local minima are global and rank-1, recovering optimal PCA solutions.
Finally, we extend our analysis on max player to 2-layer ReLU networks, showing
that its fixed points can have higher ranks. Codes are available 1 .

1 Introduction
While contrastive self-supervised learning has been shown to learn good features (Chen et al., 2020;
He et al., 2020; Oord et al., 2018) and in many cases, comparable with features learned from
supervised learning, it remains an open problem what features it learns, in particular when deep
nonlinear networks are used. Theory on this is quite sparse, mostly focusing on loss function (Arora
et al., 2019) and treating the networks as a black-box function approximator.
In this paper, we present a novel perspective of contrastive learning (CL) for a broad family of
contrastive loss functions L(θ): minimizing L(θ) corresponds to a coordinate-wise optimization
procedure on an objective Eα (θ)−R(α) with respect to network parameter θ and pairwise importance
α on batch samples, where Eα (θ) is an energy function and R(α) is a regularizer, both associated
with the original contrastive loss L. In this view, the max player θ learns a representation to maximize
the contrastiveness of different samples and keep different augmentation view of the same sample
similar, while the min player α puts more weights on pairs of different samples that appear similar
in the representation space, subject to regularization. Empirically, this formulation, named Pair-
weighed Contrastive Learning (α-CL), when coupled with various regularization terms, yields novel
contrastive losses that show comparable (or better) performance in CIFAR10 (Krizhevsky et al., 2009)
and STL-10 (Coates et al., 2011).
We then focus on the behavior of the max player who does representation learning via maximizing
the energy function Eα (θ). When the underlying network is deep linear, we show that maxθ Eα (θ) is
the loss function (under re-parameterization) of Principal Component Analysis (PCA) (Wold et al.,
1987), a century-old unsupervised dimension reduction method. To further show they are equivalent,
we prove that the nonlinear training dynamics of CL with a linear multi-layer feedforward network
1
https://github.com/facebookresearch/luckmatters/tree/main/ssl/real-dataset

36th Conference on Neural Information Processing Systems (NeurIPS 2022).


𝒙[𝑖] 𝒙[𝑖′] 𝒙[𝑗] Contrastive Loss φ(x) ψ(x)
InfoNCE (Oord et al., 2018) τ log( + x) ex/τ
Multi-layer network 𝜽
MINE (Belghazi et al., 2018) log(x) ex
Triplet (Schroff et al., 2015) x [x + ]+
𝒛[𝑖] 𝒛[𝑖′] 𝒛[𝑗]
Soft Triplet (Tian et al., 2020c) τ log(1 + x) ex/τ +
N+1 Tuplet (Sohn, 2016) log(1 + x) ex
"
𝑑!" 𝑑!# Lifted Structured (Oh Song et al., 2016) [log(x)]2+ ex+
Modified Triplet Eqn. 10 (Coria et al., 2020) x sigmoid(cx)
ℒ!,# Triplet Contrastive Eqn. 2 (Ji et al., 2021) linear linear

Figure 1: Problem Setting. Left: Data points (i-th sample x[i] and its augmented version x[i0 ], j-th sample
x[j]) are sent to networks with weights θ, to yield outputs z[i], z[i0 ] and z[j]. From the outputs z, we compute
pairwise squared distance d2ij between z[i] and z[j] and intra-class squared distance d2i between z[i] and z[i0 ]
for contrastive learning with a general family of contrastive loss Lφ,ψ (Eqn. 1). Right: Different existing loss
functions corresponds to different monotonous functions φ and ψ. Here [x]+ := max(x, 0).

(MLP) enjoys nice properties: with proper weight normalization, almost all its local optima are
global, achieving optimal PCA objective, and are rank-1. The only difference here is that the data
augmentation provides negative eigen-directions to avoid.
Furthermore, we extend our analysis to 2-layer ReLU network, to explore the difference between
the rank-1 PCA solution and the solution learned by a nonlinear network. Assuming the data follow
an orthogonal mixture model, the 2-layer ReLU networks enjoy similar dynamics as the linear one,
except for a special sticky weight rule that keeps the low-layer weights to be non-negative and stays
zero when touching zero. In the case of one hidden node, we prove that the solution in ReLU always
picks a single mode from the mixtures. In the case of multiple hidden nodes, the resulting solution is
not necessarily rank-1.

2 Related Work

Contrastive learning. While many contrastive learning techniques (e.g., SimCLR (Chen et al.,
2020), MoCo (He et al., 2020), PIRL (Misra & Maaten, 2020), SwAV (Caron et al., 2020), Deep-
Cluster (Caron et al., 2018), Barlow Twins (Zbontar et al., 2021), InstDis (Wu et al., 2018), etc) have
been proposed empirically and able to learn good representations for downstream tasks, theoretical
study is relatively sparse, mostly focusing on loss function itself (Tian et al., 2020b; HaoChen et al.,
2021; Arora et al., 2019), e.g., the relationship of loss functions with mutual information (MI). To
our knowledge, there is no analysis that combines the property of neural network and that of loss
functions.
Theoretical analysis of deep networks. Many works focus on analysis of deep linear networks in
supervised setting, where label is given. (Baldi & Hornik, 1989; Zhou & Liang, 2018; Kawaguchi,
2016) analyze the critical points of linear networks. (Saxe et al., 2014; Arora et al., 2018) also analyze
the training dynamics. On the other hand, analyzing nonlinear networks has been a difficult task.
Existing works mostly lie in supervised learning, e.g., teacher-student setting (Tian, 2020; Allen-Zhu
et al., 2018), landscape (Safran & Shamir, 2018). For contrastive learning, recent work (Wen & Li,
2021) analyzes the dynamics of 1-layer ReLU networks with a specific weight structure, and (Jing
et al., 2022) analyzes the collapsing behaviors in 2-layer linear network for CL. To our best knowledge,
we are not aware of such analysis on deep networks (> 2 layers, linear or nonlinear) in the context of
CL.
Connection between Principal Component Analysis (PCA) and Self-supervised Learning. (Lee
et al., 2021) establishes the statistical connection between non-linear Canonical Component Analysis
(CCA) and SimSiam (Chen & He, 2020) for any zero-mean encoder, without considering the
aspect of training dynamics. In contrast, we reformulate contrastive learning as coordinate-wise
optimization procedure with min/max players, in which the max player is a reparameterization of
PCA optimized with gradient descent, and analyze its training dynamics in the presence of specific
neural architectures.

2
3 Contrastive Learning as Coordinate-wise Optimization
0 N 0
Notation. Suppose we have N pairs of samples {x[i]}N i=1 and {x[i ]}i=1 . Both x[i] and x[i ] are
augmented samples from sample i and x represents the input batch. These samples are sent to
neural networks and z[i] and z[i0 ] are their outputs. The goal of contrastive learning (CL) is to
find the representation to maximize the squared distance d2ij := kz[i] − z[j]k22 /2 between distinct
samples i and j, and minimize the squared distance d2i := kz[i] − z[i0 ]k22 /2 between different data
augmentations x[i] and x[i0 ] of the same sample i.

3.1 A general family of contrastive loss


We consider minimizing a general family of loss functions P Lφ,ψ , where φ and ψ are monotonously
increasing and differentiable scalar functions (define ξi := j6=i ψ(d2i − d2ij ) for notation brevity):
 
XN XN X
min Lφ,ψ (θ) := φ(ξi ) = φ ψ(d2i − d2ij ) (1)
θ
i=1 i=1 j6=i

Both i and j run from 1 to N . With different φ and ψ, Eqn. 1 covers many loss functions (Tbl. 1).
In particular, setting φ(x) = τ log( + x) and ψ(x) = exp(x/τ ) gives a generalized version of
InfoNCE loss (Oord et al., 2018):
 
N N
X exp(−d2i /τ ) X X d2i −d2ij
Lnce := −τ log =τ log + e τ  (2)
 exp(−d2i /τ )+ j6=i exp(−d2ij /τ )
P
i=1 i=1 j6=i

0
where  > 0 is some constant not related to z[i] and z[i ].  = 1 has been used in many works (He
et al., 2020; Tian et al., 2020a). Setting  = 0 yields SimCLR setting (Chen et al., 2020) where the
denominator doesn’t contains exp(−d2i /τ ). This is also used in (Yeh et al., 2021).

3.2 The other side of gradient descent of contrastive loss

To minimize Lφ,ψ , gradient descent follows its negative gradient direction. As a first discovery of
this work, it turns out that the gradient descent of the loss function L is the gradient ascent direction
of another energy function Eα :
Theorem 1. For any differential mapping z = z(x; θ), gradient descent of Lφ,ψ is equivalent to
gradient ascent of the objective Eα (θ) := 12 tr(Cα [z(θ), z(θ)]):
∂Lφ,ψ ∂Eα
=− (3)
∂θ ∂θ α=α(θ)

Here the pairwise importance α = α(θ) := {αij (θ)} is a function of input batch x, defined as:
αij (θ) := φ0 (ξi )ψ 0 (d2i − d2ij ) ≥ 0 (4)
0 0
where φ , ψ ≥ 0 are derivatives of φ, ψ. The contrastive covariance Cα [·, ·] is defined as:
 
N X
X XN X
Cα [a, b] := αij (a[i]−a[j])(b[i]−b[j])> −  αij  (a[i]−a[i0 ])(b[i]−b[i0 ])> (5)
i=1 j6=i i=1 j6=i

That is, minimizing the loss function Lφ,ψ (θ) can be regarded as maximizing the energy function
Eα=sg(α(θ)) (θ) with respect to θ. Here sg(·) means stop-gradient, i.e., the gradient of θ is not
backpropagated into α(θ).

Please check Supplementary Materials (SM) for all proofs. From the definition of energy Eα (θ), it
is clear that αij determines the importance of each sample pair x[i] and x[j]. For (i, j)-pair that
“deserves attention”, αij is large so that it plays a large role in the contrastive covariance term. In
particular, for InfoNCE loss with  = 0, the pairwise importance α takes the following form:
exp(−d2ij /τ )
αij = P 2 >0 (6)
j6=i exp(−dij /τ )

3
which means that InfoNCE focuses on (i, j)-pair with small squared distance d2ij . If both φ and ψ
are linear, then αij = const and L is a simple subtraction of positive/negative squared distances.
From Thm. 1, an important observation is that when propagating gradient w.r.t. θ using the objective
Eα (θ) during the backward pass, the gradient does not propagate into α(θ), even if α(θ) is a function
of θ in the forward pass. In fact, in Sec. 6 we show that propagating gradient through α(θ) yields
worse empirical performance. This suggests that α should be treated as an independent variable when
optimizing θ. It turns out that if ψ(x) is an exponential function (as in most cases of Tbl. 1), this is
indeed true and α can be determined by a separate optimization procedure:
Theorem 2. If ψ(x) = ex/τ , then the corresponding pairwise importance α (Eqn. 4) is the solution
to the minimization problem:
 
 X 
α(θ) = arg min Eα (θ) − R(α), A := α : ∀i, αij = τ −1 ξi φ0 (ξi ), αij ≥ 0 (7)
α∈A  
j6=i
PN PN P
Here the regularization R(α) = RH (α) := τ i=1H(αi· ) = −τ i=1 j6=i αij log αij .
P
For InfoNCE, the feasible set A becomes {α : α ≥ 0, j6=i αij = ξi /(ξi + )}. This means that if i-
P distance di and large inter-augmentation
th sample is already well-separated (small intra-augmentation
distance dij ), then ξi is small, the summation of weights j6=i αij associated with sample i is also
smallP and such a sample is overall discounted. Setting  = 0 reduces to sample-agnostic constraint
(i.e., j6=i αij = 1).
Thm. 2 leads to a novel perspective of coordinate-wise optimization for Contrastive Learning (CL):
Corollary 1 (Contrastive Learning as Coordinate-wise Optimization). If ψ(x) = ex/τ , minimizing
Lφ,ψ is equivalent to the following iterative procedure:
(Min-player α) αt = arg min Eα (θt ) − R(α) (8a)
α∈A
(Max-player θ) θt+1 = θt + η∇θ Eαt (θ) (8b)

Intuitively, the max player θ (Eqn. 8b) performs one-step gradient ascent for the objective Eα (θ) −
R(α), learns a representation to maximize the distance of different samples and minimize the
distance of the same sample with different augmentations (as suggested by Cα [z, z]). On the other
hand, the “min player” α (Eqn. 8a) finds optimal α analytically, assigning high weights on confusing
pairs for “max player” to solve.
Relation to max-min formulation. While Corollary 1 looks very similar to max-min formulation,
important differences exist. Different from traditional max-min formulation, in Corollary 1 there is
asymmetry between θ and α. First, θ only follows one step update along gradient ascent direction of
maxθ Eα (θ), while α is solved analytically. Second, due to the stop-gradient operator, the gradient
of θ contains no knowledge on how θ changes α. This prevents θ from adapting to α’s response on
changing θ. Both give advantages to min-player α to find the confusing sample pairs more effectively.
Relation to hard-negative samples. While many previous works (Kalantidis et al., 2020; Robinson
et al., 2021) focus on seeking and putting more weights on hard samples, Corollary 1 shows that
contrastive losses already have such mechanism at the batch level, focusing on “hard-negative pairs”
beyond hard-negative samples.
From this formulation, different pairwise importance α corresponds to different loss functions within
the loss family specified by Eqn. 1, and choosing among this family (i.e., different φ and ψ) can
be regarded as choosing different α when optimizing the same objective Eα (θ). Based on this
observation, we now propose the following training framework called α-CL:
Definition 1 (Pair-weighed Contrastive Learning (α-CL)). Optimize θ by gradient ascent: θt+1 =
θt + η∇θ Esg(αt ) (θ), with the energy Eα (θ) defined in Thm. 1 and pairwise importance αt = α(θt ).

In α-CL, choosing α can be achieved by either implicitly specifying a regularizer R(α) and solve
Eqn. 8a, or by a direct mapping α = α(θ) without any optimization. This opens a novel revenue
for CL loss design. Initial experiments (Sec. 6) show that α-CL gives comparable (or even better)
downstream performance in CIFAR10 and STL-10, compared to vanilla InfoNCE loss.

4
Contrastive Covariance X Eigenspectrum of X Objective over time Singular value dynamics Singular vector alignment
5 1.0 1.0
0
4 4 0.8

Objective function
2 l=1 0.8

Singular values
Eigenvalues
2 3 l=2

Alignment
4 0.6
l=3 0.6
2 0.4 l=4 1/2
6 0 l=5 2/3
1 0.2 0.4 3/4
8
2 4/5
0 0.0 0.2
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000
Eigen Index Iteration Iteration Iteration

Figure 2: Dynamics of CL with multilayer (L = 5) linear network (DeepLin) with fixed α. Running the
training dynamics (Lemma 1) quickly leads to convergence towards the maximal eigenvalue of Xα . For dynamics
of singular value of Wl , the largest singular values (solid lines) converges to 1 while the second largest singular
values (dashed lines) decay to 0.

4 Representation Learning in Deep Linear CL is PCA


In Corollary 1, optimizing over α is well-understood, since Eα (θ) is linear w.r.t. α and R(α) in
general is a (strong) concave function. As a result, α has a unique optimal. On the other hand,
understanding the max player maxθ Eα (θ) is important since it performs representation learning in
CL. It is also a hard problem because of non-convex optimization.
We start with a specific case when z is a deep linear network, i.e., z = W (θ)x, where W is the
equivalent linear mapping for the deep linear network, and θ is the parameters to be optimized. Note
that this covers many different kinds of deep linear networks, including VGG-like (Saxe et al., 2014),
ResNet-like (Hardt & Ma, 2017) and DenseNet-like (Huang et al., 2017). For notation brevity, we
define Cα [x] := Cα [x, x].
Corollary 2 (Representation learning in Deep Linear CL reparameterizes Principal Component
Analysis (PCA)). When z = W (θ)x with a constraint W W > = I, Eα is the objective of Principal
Component Analysis (PCA) with reparameterization W = W (θ):
1
max Eα (θ) = tr(W (θ)Xα W > (θ)) s.t. W W > = I (9)
θ 2
here Xα := Cα [x] is the contrastive covariance of input x.
As a comparison, in traditional Principal Component Analysis, the objective is (Kokiopoulou et al.,
2011): 12 maxW tr(W Vsample [x]W > ) subject to the constraint W W > = I, where Vsample [x] is
the empirical covariance of the dataset (here it is one batch). Therefore, Xα can be regarded as a
generalized covariance matrix, possibly containing negative eigenvalues. In the case of supervised
CL (i.e,. pairs from the same/different labels are treated as positive/negative (Khosla et al., 2020)),
then it is connected with Fisher’s Linear Discriminant Analysis (Fisher, 1936).
Here we show a mathematically rigorous connection between CL and dimensional reduction, as
suggested intuitively in (Hadsell et al., 2006). Unlike traditional PCA, due to the presence of data
augmentation, while symmetric, the contrastive covariance Xα is not necessarily a PSD matrix.
Nevertheless, the intuition is the same: to find the direction that corresponds to maximal variation of
the data.
While it is interesting to discover that CL with deep linear network is essentially a reparameterization
of PCA, it remains elusive that such a reparameterization leads to the same solution of PCA, in
particular when the network is deep (and may contain local optima). Also, PCA has an overall
end-to-end constraint W W > = I, while in network training, we instead use normalization layers
and it is unclear whether they are equivalent or not.
In this section, we show for a specific deep linear model, almost all its local maxima of Eqn. 9 are
global and it indeed solves PCA.

4.1 A concrete deep linear model


We study a concrete deep linear network with parameters/weights θ := {Wl }L
l=1 :
z[i] := WL WL−1 . . . W1 x[i] (10)
nl ×nl−1
Here Wl ∈ R , nl is the number of nodes at layer l, z[i] is the output of x[i] and similarly
z[i0 ] for x[i0 ]. We use θ to represent the collection of weights at all layers. For convenience, we

5
define the l-th layer activation fl [i] = Wl fl−1 [i]. With this notation f0 [i] = x[i] is the input and
z[i] = WL fL−1 [i].
We call this setting DeepLin. The Jacobian matrix W>l := WL WL−1 . . . Wl+1 and W := W>0 =
WL WL−1 . . . W1 .
>
Lemma 1. The training dynamics in DeepLin is Ẇl = W>l W>l Wl Cα [fl−1 ]

Note that Cα [f0 ] = Cα [x] = Xα . Similar to supervised learning


 (Arora et al., 2018; Du et al.,
d
2018b), nearby layers are also balanced: dt Wl Wl> − Wl+1
>
Wl+1 = 0.

4.2 Normalization Constraints


Note that if we just run the training dynamics (Lemma 1) without any constraints, kWl kF will go to
infinity. Fortunately, empirical works already suggest various ways of normalization to stabilize the
network training.
One popular technique in CL is `2 normalization. It is often put right after the output of the network
and before the loss function L (Chen et al., 2020; Grill et al., 2020; He et al., 2020), i.e., ẑ[i] =
z[i]/kz[i]k2 . Besides, LayerNorm (Ba et al., 2016) (i.e., fˆ[i] = (f [i] − mean(f [i]))/std(f [i])) is
extensively used in Transformer-based models (Xiong et al., 2020). Here we show that for gradient
flow dynamics of MLP models, such normalization layers conserve kWl kF for any l below it,
regardless of loss function.
d 2
Lemma 2. For MLP, if the weight Wl is below a `2 -norm or LayerNorm layer, then dt kWl kF = 0.

Note that Lemma 2 also holds for nonlinear MLP with reversible activations, which includes ReLU
(see SM). Therefore, without loss of generality, we consider the following complete objective for
max player with DeepLin (here Θ is the constraint set of the weights due to normalization):
1
max Eα (θ) := tr(W Xα W > ), Θ := {θ : kWl kF = 1, 1 ≤ l ≤ L} (11)
θ∈Θ 2

4.3 Representation Learning with DeepLin is PCA


As one of our main contributions, the following theorem asserts that almost all local optimal so-
lutions of Eqn. 11 are global, and the optimal objective corresponds to the PCA objective. Note
that (Kawaguchi, 2016; Laurent & Brecht, 2018) proves no bad local optima for deep linear network
in supervised learning, while here we give similar results for CL, and additionally we also give the
(simple) rank-1 structure of all local optima.
Theorem 3 (Representation Learning with DeepLin is PCA). If λmax (Xα ) > 0, then for any local
>
maximum θ ∈ Θ of Eqn. 11 whose W>1 W>1 has distinct maximal eigenvalue:
>
• there exists a set of unit vectors {vl }L
l=0 so that Wl = vl vl−1 for 1 ≤ l ≤ L, in particular,
v0 is the unit eigenvector corresponding to λmax (Xα ),
• θ is global optimal with objective 2E ∗ = λmax (Xα ).

Corollary 3. If we additionally use per-filter normalization
√ (i.e., kwlk k2 = 1/ nl ), then Thm. 3
holds and vl is more constrained: [vl ]k = ±1/ nl for 1 ≤ l ≤ L − 1.

Remark. Here we prove that given fixed α, maximizing Eα (θ) gives rank-1 solutions for deep linear
network. This conclusion is an extension of (Jing et al., 2022), which shows weight collapsing
happens if θ is 2-layer linear network and α is fixed. If the pairwise importance α is adversarial, then
it may not lead to a rank-1 solution. In fact, α can magnify minimal eigen-directions and change the
eigenstructure of Xα continuously. We leave it for future work.
>
Note that the condition that “W>1 W>1 has distinct maximal eigenvalue” is important. Otherwise
there are counterexamples. For example, consider 1-layer linear network z = W1 x, and Xα has
duplicated maximal eigenvalues (with u1 and u2 being corresponding orthogonal eigenvectors), then
>
W>1 W>1 = I (i.e., it has degenerated eigenvalues), and for any local maximal W1 , its row vector
can be arbitrary linear combinations of u1 and u2 and thus W1 is not rank-1.

6
Compared to recent works (Ji et al., 2021) that also relates CL with PCA in linear representation
setting using constant α, our Theorem 3 has no statistical assumptions on data distribution and
augmentation, and operates on vanilla InfoNCE loss and deep architectures.

5 How Representation Learning Differs in Two-layer ReLU Network


So far we have shown that the max player maxθ Eα (θ) := 21 tr(Cα [z(θ)]) is essentially a PCA
objective when the input-output mapping z = W (θ)x is linear. A natural question arises. What is
the benefit of CL if its representation learning component has such a simple nature? Why can it learn
a good representation in practice beyond PCA?
For this, nonlinearity is the key but understanding its role is highly nontrivial. For example, when
the neural network model is nonlinear, Thm. 1 and Corollary 1 holds but not Corollary 2. Therefore,
there is not even a well-defined Xα due to the fact that multiple hidden nodes can be switched on/off
given different data input. Previous works (Safran & Shamir, 2018; Du et al., 2018a) also show that
with nonlinearity, in supervised learning spurious local optima exist.
Here we take a first step to analyze nonlinear cases. We study 2-layer models with ReLU activation
h(x) = max(x, 0). We show that with a proper data assumption, the 2-layer model shares a
modified version of dynamics with its linear version, and the contrastive covariance term Xα (and its
eigenstructure) remains well-defined and useful in nonlinear case.

5.1 The 2-layer ReLU network and data model


We consider the bottom-layer weight W1 = [w11 , w12 , . . . , w1K ]> with w1k being the k-th filter.
For brevity, let K = n1 be the number of hidden nodes. We still consider solution in the constraint
set Θ (Eqn. 11), since Lemma 2 still holds for ReLU networks. This model is named ReLU2Layer.
In addition, we assume the following data model:
Assumption 1 (Orthogonal mixture model within receptive field Rk ). There exists a set of orthonor-
mal bases {x̄m }M
P
m=1 so that any input data x[i] = m a m [i]x̄m satisfies the property that am [i] is
Nonnegative: am [i] ≥ 0, One-hot: for any k, am [i] > 0 for at most one m and Augmentation only
scales xk by a (sample-dependent) factor, i.e., x[i0 ] = γ[i]x[i] with γ[i] > 0.

Since all x appears in the inner-product with the weight vectors w1k , with a rotation of coordination,
we can just set x̄m = em , where em is the one-hot vector with m-th component being 1. In this case,
x ≥ 0 is always a one-hot vector with only at most only one positive entry.
Intuitively, the model is motivated by sparsity: in each instantiation of x, there are very small number
of activated modes and their linear combination becomes the input signal x. As we shall see, even
with this simple model, the dynamics of ReLU network behaves very differently from the linear case.
With this assumption, we only need to consider nonnegative low-layer weights and Xα is still a valid
quantity for ReLU2Layer:
0
Lemma 3 (Evaluation of ReLU2Layer). If Assumption 1 holds, setting w1k = max(w1k , 0) won’t
change the output of ReLU2Layer. Furthermore, if W1 ≥ 0, then the formula for linear network
Eα = 12 tr(W2 W1 Xα W1> W2> ) still works for ReLU2Layer.

On the other hand, sharing the energy function Eα does not mean ReLU2Layer is completely
identical to its linear version. In fact, the dynamics follows its linear counterparts, but with important
modifications:
Theorem 4 (Dynamics of ReLU2Layer). If Assumption 1 holds, then the dynamics of ReLU2Layer
with w1k ≥ 0 is equivalent to linear dynamics with the Sticky Weight rule: any component that
reaches 0 stays 0.

As we will see, this modification leads to very different dynamics and local optima in ReLU2Layer
from linear cases, even when there is only one ReLU node.

5.2 Dynamics in One ReLU node


Now we consider the dynamics of the simplest case: ReLU2Layer with only 1 hidden node. In this
>
case, W>1 W>1 is a scalar and thus W2> W2 = tr(W2> W2 ) = 1. We only need to consider w1 ∈ Rn1 ,

7
Contrastive covariance X over iterations W1 W2 W2
2.0
1.5
1.0
0.5
0.0
0 500 1000 1500 2000
Iteration
Figure 3: Theorem 6 shows that training ReLU2Layer could lead to more diverse hidden weight patterns
beyond rank-1 solution obtained in the linear case (shown in right two figures: converged W1 and W2> W2 ).

which is the only weight vector in the lower layer, under the constraint kW1 kF = kw1 k2 = 1
(Eqn. 11). We denote this setting as ReLU2Layer1Hid.
The dynamics now becomes very different from linear setting. Under linear network, according
to Theorem 3, w1 converges to the largest eigenvector of Xα = Cα [x1 ]. For ReLU2Layer1Hid,
situation differs drastically:
Theorem 5. If Assumption 1 holds, then in ReLU2Layer1Hid, w1 → em for certain m.
Intuitively, this theorem is achieved by closely tracing the dynamics. When the number of positive
entries of w1 is more than 1, the linear dynamics always hits the boundary of the polytope w1 ≥ 0,
making one of its entry be zero, and stick to zero due to sticky weight rule. This procedure repeats
until there is only one survival positive entry in w1 .
Overall, this simple case already shows that nonlinear landscape can lead to many local optima:
for any m, w1 = em is one local optimal. Which one the training falls into depends on weight
initialization, and critically affects the properties of per-trained models.

5.3 Multiple hidden nodes


For complicated situations like multiple hidden units, completely characterizing the training dynamics
like Theorem 5 becomes hard (if not impossible). Instead, we focus on fixed point analysis.
For deep linear model, using multiple hidden units does not lead to any better solutions. According
to Thm. 3, at local optimal, W1 = v1 v0> . This means that the weights w1k , which are row vectors of
W1 , are just a scaled version of the maximal eigenvector v0 of Xα . Moreover, this is independent of
the eigenstructure of Xα as long as λmax (Xα ) > 0.
In ReLU2Layer, the situation is a bit different. Thm. 6 shows that these hidden nodes are (slightly)
more diverse. Fig. 3 shows one such example. The intuition here is that in nonlinear case, rank-1
structure of the critical points may be replaced with low-rank structures.
Theorem 6 (ReLU2Layer encourages diversity). If Assumption 1 holds, then for any local optimal
(W2 , W1 ) ∈ Θ of ReLU2Layer with E > 0, either W1 = ve> m for some m and v ≥ 0, or
rank(W1 ) > 1.

6 Experiments
We evaluate our α-CL framework (Def. 1) in CIFAR10 (Krizhevsky et al., 2009) and STL-10 (Coates
et al., 2011) with ResNet18 (He et al., 2016), and compare
P P the downstream performance P of multiple
losses, with regularizers taking the form of R(α) = i j6=i r(αij ) with a constraint j6=i αij = 1.
Here r can be different concave functions:
• (α-CL-rH ) Entropy regularizer rH (αij ) = −τ αij log αij ;
τ 1−γ
• (α-CL-rγ ) Inverse regularizers rγ (αij ) = 1−γ αij (γ > 1).
• (α-CL-rs ) Square regularizer rs (αij ) = − τ2 αij
2
.
Besides, we also compare with the following:
• Minimizing InfoNCE or quadratic loss: minθ L(θ) for L ∈ {Lnce , Lquadratic }.
• Setting α as InfoNCE (Eqn. 6) and backpropagates through α = α(θ) with respect to θ.

8
CIFAR-10 STL-10
100 epochs 300 epochs 500 epochs 100 epochs 300 epochs 500 epochs
Lquadratic 63.59 ± 2.53 73.02 ± 0.80 73.58 ± 0.82 55.59 ± 4.00 64.97 ± 1.45 67.28 ± 1.21
Lnce 84.06 ± 0.30 87.63 ± 0.13 87.86 ± 0.12 78.46 ± 0.24 82.49 ± 0.26 83.70 ± 0.12
backprop α(θ) 83.42 ± 0.25 87.18 ± 0.19 87.48 ± 0.21 77.88 ± 0.17 81.86 ± 0.30 83.19 ± 0.16
α-CL-rH 84.27 ± 0.24 87.75 ± 0.25 87.92 ± 0.24 78.53 ± 0.35 82.62 ± 0.15 83.74 ± 0.18
α-CL-rγ 83.72 ± 0.19 87.51 ± 0.11 87.69 ± 0.09 78.22 ± 0.28 82.19 ± 0.52 83.47 ± 0.34
α-CL-rs 84.72 ± 0.10 86.62 ± 0.17 86.74 ± 0.15 76.95 ± 1.06 80.64 ± 0.77 81.65 ± 0.59
α-CL-direct 85.11 ± 0.19 87.93 ± 0.16 88.09 ± 0.13 79.32 ± 0.36 82.95 ± 0.17 84.05 ± 0.20
Table 1: Comparison over multiple loss formulations (ResNet18 backbone, batchsize 128). Top-1 accuracy
with linear evaluation protocol. Temperature τ = 0.5 and learning rate is 0.01. Bold is highest performance and
blue is second highest. Each setting is repeated 5 times with different random seeds.

ResNet18 Backbone ResNet50 Backbone


CIFAR-100
100 epochs 300 epochs 500 epochs 100 epochs 300 epochs 500 epochs
Lnce 55.70 ± 0.37 59.71 ± 0.36 59.89 ± 0.34 60.16 ± 0.48 65.40 ± 0.31 65.53 ± 0.30
α-CL-direct 57.63 ± 0.07 60.12 ± 0.26 60.27 ± 0.29 62.93 ± 0.28 65.84 ± 0.14 65.87 ± 0.21
CIFAR-10
Lnce 84.06 ± 0.30 87.63 ± 0.13 87.86 ± 0.12 86.39 ± 0.16 89.97 ± 0.14 90.19 ± 0.23
α-CL-direct 85.11 ± 0.19 87.93 ± 0.16 88.09 ± 0.13 87.79 ± 0.25 90.41 ± 0.18 90.50 ± 0.21
STL-10
Lnce 78.46 ± 0.24 82.49 ± 0.26 83.70 ± 0.12 81.64 ± 0.24 86.57 ± 0.17 87.90 ± 0.22
α-CL-direct 79.32 ± 0.36 82.95 ± 0.17 84.05 ± 0.20 83.20 ± 0.25 87.17 ± 0.14 87.85 ± 0.21
Table 2: More experiments with ResNet18/ResNet50 backbone on CIFAR-10, STL-10 and CIFAR-100.
Batchsize is 128. For ResNet18, learning rate is 0.01; for ResNet50, learning rate is 0.001.

• (α-CL-direct) Directly setting α (here p > 1):


exp(−dpij /τ )
αij = P p (12)
j exp(−dij /τ )

For inverse regularizer rγ , we pick γ = 2 and τ = 0.5; for direct-set α, we pick p = 4 and τ = 0.5;
for square regularizer, we use τ = 5. All training is performed with Adam (Kingma & Ba, 2014)
optimizer. Code is written in PyTorch and a single modern GPU suffices for the experiments.
The results are shown in Tbl. 1. We can see that (1) backpropagating through α(θ) is worse,
justifying our perspective of coordinate-wise optimization, (2) our proposed α-CL works for different
regularizers, (3) using different regularizer leads to comparable or better performance than original
InfoNCE Lnce , (4) the pairwise importance α does not even need to come from a minimization
process. Instead, we can directly set α based on pairwise squared distances d2ij and d2i . For α-CL-
direct, the performance is slightly worse if we do not normalize αij (i.e., αij := exp(−dpij /τ )). It
dr
seems that for strong performance, dα ij
should go to +∞ when αij → 0. Regularizers that do not
satisfy this condition (e.g., squared regularizer rs ) may not work as well.
Tbl. 2 shows more experiments with different backbones (e.g., ResNet50) and more complicated
datasets (e.g., CIFAR-100). Overall, we see consistent gains of α-CL over InfoNCE in early stages of
the training (e.g., 1-2 point of absolute percentage gain) and comparable performance at 500 epoch.
More ablations on batchsizes and exponent p in Eqn. 12 are provided in Appendix B.

7 Conclusion and Future Work


We provide a novel perspective of contrastive learning (CL) via the lens of coordinate-wise opti-
mization and propose a unified framework called α-CL that not only covers a broad family of loss
functions including InfoNCE, but also allows a direct set of importance of sample pairs. Preliminary
experiments on CIFAR10/STL-10/CIFAR100 show comparable/better performance with the new loss
than InfoNCE. Furthermore, we prove that with deep linear networks, the representation learning
part is equivalent to Principal Component Analysis (PCA). In addition, we also extend our analysis
to representation learning in 2-layer ReLU network, shedding light on the important difference in
representation learning for linear/nonlinear cases.

9
Future work. Our framework α-CL turns various loss functions into a unified framework with
different choices of pairwise importance α and how to find good choices remains open. Also, we
mainly focus on representation learning with fixed pairwise importance α. However, in the actual
training, α and θ change concurrently. Understanding their interactions is an important next step.
Finally, removing Assumption 1 in ReLU analysis is also an open problem to be addressed later.

References
Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural
networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
Arora, S., Cohen, N., and Hazan, E. On the optimization of deep networks: Implicit acceleration by
overparameterization. In International Conference on Machine Learning, pp. 244–253. PMLR,
2018.
Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of
contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,
2016.
Baldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples
without local minima. Neural networks, 2(1):53–58, 1989.
Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual
information neural estimation. In International Conference on Machine Learning, pp. 531–540.
PMLR, 2018.
Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deep clustering for unsupervised learning of
visual features. In ECCV, 2018.
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of
visual features by contrasting cluster assignments. In NeurIPS, 2020.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of
visual representations. In International conference on machine learning, pp. 1597–1607. PMLR,
2020.
Chen, X. and He, K. Exploring simple siamese representation learning. In CVPR, 2020.
Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning.
In International conference on artificial intelligence and statistics, 2011.
Coria, J. M., Bredin, H., Ghannay, S., and Rosset, S. A comparison of metric learning loss functions
for end-to-end speaker verification. In International Conference on Statistical Language and
Speech Processing, pp. 137–148. Springer, 2020.
Du, S., Lee, J., Tian, Y., Singh, A., and Poczos, B. Gradient descent learns one-hidden-layer cnn:
Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pp.
1339–1348. PMLR, 2018a.
Du, S. S., Hu, W., and Lee, J. D. Algorithmic regularization in learning deep homogeneous models:
Layers are automatically balanced. arXiv preprint arXiv:1806.00900, 2018b.
Fisher, R. A. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):
179–188, 1936.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires,
B. A., Guo, Z. D., Azar, M. G., et al. Bootstrap your own latent: A new approach to self-supervised
learning. NeurIPS, 2020.
Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06),
volume 2, pp. 1735–1742. IEEE, 2006.

10
HaoChen, J. Z., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep
learning with spectral contrastive loss. NeurIPS, 2021.
Hardt, M. and Ma, T. Identity matters in deep learning. ICLR, 2017.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. B. Momentum contrast for unsupervised visual
representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 9726–9735, 2020.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional
networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
4700–4708, 2017.
Ji, W., Deng, Z., Nakada, R., Zou, J., and Zhang, L. The power of contrast for feature learning: A
theoretical analysis. arXiv preprint arXiv:2110.02473, 2021.
Jing, L., Vincent, P., LeCun, Y., and Tian, Y. Understanding dimensional collapse in contrastive
self-supervised learning. ICLR, 2022.
Kalantidis, Y., Sariyildiz, M. B., Pion, N., Weinzaepfel, P., and Larlus, D. Hard negative mixing for
contrastive learning. NeurIPS, 2020.
Kawaguchi, K. Deep learning without poor local minima. NeurIPS, 2016.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan,
D. Supervised contrastive learning. NeurIPS, 2020.
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Kokiopoulou, E., Chen, J., and Saad, Y. Trace optimization and eigenproblems in dimension reduction
methods. Numerical Linear Algebra with Applications, 18(3):565–602, 2011.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
Laurent, T. and Brecht, J. Deep linear networks with arbitrary loss: All local minima are global. In
International conference on machine learning, pp. 2902–2907. PMLR, 2018.
Lee, J. D., Lei, Q., Saunshi, N., and Zhuo, J. Predicting what you already know helps: Provable
self-supervised learning. Advances in Neural Information Processing Systems, 34, 2021.
Misra, I. and Maaten, L. v. d. Self-supervised learning of pretext-invariant representations. In CVPR,
2020.
Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S. Deep metric learning via lifted structured feature
embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 4004–4012, 2016.
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding.
arXiv preprint arXiv:1807.03748, 2018.
Robinson, J., Chuang, C.-Y., Sra, S., and Jegelka, S. Contrastive learning with hard negative samples.
ICLR, 2021.
Safran, I. and Shamir, O. Spurious local minima are common in two-layer relu neural networks. In
International Conference on Machine Learning, pp. 4433–4441. PMLR, 2018.
Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning
in deep linear neural networks. ICLR, 2014.
Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and
clustering. In CVPR, 2015.

11
Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural
information processing systems, pp. 1857–1865, 2016.
Tian, Y. A theoretical framework for deep locally connected relu network. arXiv preprint
arXiv:1809.10829, 2018.
Tian, Y. Student specialization in deep relu networks with finite width and input dimension. ICML,
2020.
Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. In Computer Vision–ECCV
2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp.
776–794. Springer, 2020a.
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P. What makes for good views for
contrastive learning? NeurIPS, 2020b.
Tian, Y., Yu, L., Chen, X., and Ganguli, S. Understanding self-supervised learning with dual deep
networks. arXiv preprint arXiv:2010.00578, 2020c.
Wen, Z. and Li, Y. Toward understanding the feature learning process of self-supervised contrastive
learning. arXiv preprint arXiv:2105.15134, 2021.
Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent
laboratory systems, 2(1-3):37–52, 1987.
Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 3733–3742, 2018.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.
On layer normalization in the transformer architecture. In International Conference on Machine
Learning, pp. 10524–10533. PMLR, 2020.
Yeh, C.-H., Hong, C.-Y., Hsu, Y.-C., Liu, T.-L., Chen, Y., and LeCun, Y. Decoupled contrastive
learning. arXiv preprint arXiv:2110.06848, 2021.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. Barlow twins: Self-supervised learning via
redundancy reduction. arXiv preprint arxiv:2103.03230, 2021.
Zhou, Y. and Liang, Y. Critical points of linear neural networks: Analytical forms and landscape
properties. In International Conference on Learning Representations, 2018.

12
A Proofs
A.1 Section 3

Theorem 1. For any differential mapping z = z(x; θ), gradient descent of Lφ,ψ is equivalent to
gradient ascent of the objective Eα (θ) := 12 tr(Cα [z(θ), z(θ)]):
∂Lφ,ψ ∂Eα
=− (3)
∂θ ∂θ α=α(θ)

Here the pairwise importance α = α(θ) := {αij (θ)} is a function of input batch x, defined as:

αij (θ) := φ0 (ξi )ψ 0 (d2i − d2ij ) ≥ 0 (4)

where φ0 , ψ 0 ≥ 0 are derivatives of φ, ψ. The contrastive covariance Cα [·, ·] is defined as:


 
XN X XN X
Cα [a, b] := αij (a[i]−a[j])(b[i]−b[j])> −  αij  (a[i]−a[i0 ])(b[i]−b[i0 ])> (5)
i=1 j6=i i=1 j6=i

That is, minimizing the loss function Lφ,ψ (θ) can be regarded as maximizing the energy function
Eα=sg(α(θ)) (θ) with respect to θ. Here sg(·) means stop-gradient, i.e., the gradient of θ is not
backpropagated into α(θ).

Proof. By the definition of gradient descent, we have for any component θ in a high-dimensional
vector θ:
N
∂L X ∂z[i] ∂L ∂z[i0 ] ∂L
− =− + (13)
∂θ i=1
∂θ ∂z[i] ∂θ ∂z[i0 ]
∂L ∂z[i]
Here we use the “Denominator-layout notation” and treat ∂z[i] as a column vector while ∂θ as a
row vector. Using Lemma 4, we have:
 
∂L ∂z >
− = Cα ,z (14)
∂θ ∂θ
On the other hand, treating α as independent variables of θ, we compute (here ok is the k-th
component of z):    
∂Eα 1X ∂ok 1X ∂ok
= Cα , ok + Cα o k , (15)
∂θ 2 ∂θ 2 ∂θ
k k

For scalar x and y, Cα [x, y] = Cα [y, x] and k Cα [ak , bk ] = Cα [a, b> ] for row vector a and
P
column vector b. Therefore,  
∂Eα ∂z >
= Cα ,z (16)
∂θ ∂θ

Therefore, we have
∂Eα ∂L
=− (17)
∂θ ∂θ
and the proof is complete.

Theorem 2. If ψ(x) = ex/τ , then the corresponding pairwise importance α (Eqn. 4) is the solution
to the minimization problem:
 
 X 
α(θ) = arg min Eα (θ) − R(α), A := α : ∀i, αij = τ −1 ξi φ0 (ξi ), αij ≥ 0 (7)
α∈A  
j6=i

PN PN P
Here the regularization R(α) = RH (α) := τ i=1 H(αi· ) = −τ i=1 j6=i αij log αij .

13
Proof. We just need to solve the internal minimizer w.r.t. α. Note that each αi can be optimized
independently.
First, we know that Eα (θ) := 12 trCα [z, z] can be written as:
1X
αij tr(z[i] − z[j])(z[i] − z[j])> − tr(z[i] − z[i0 ])(z[i] − z[i0 ])> (18)
 
Eα (θ) =
2
i6=j
1X
αij kz[i] − z[j]k22 − kz[i] − z[i0 ]k22
 
= (19)
2
i6=j
X
αij d2ij − d2i

= (20)
i6=j

For each αi· , applying Lemma 5 with cij = d2ij − d2i , the optimal solution α is:
 
1  c 
ij
X  c 
ij
αij = exp − φ0  exp −  (21)
τ τ τ
j6=i
!  !
1 d2i − d2ij X d2
i − d 2
ij
= exp φ0  exp  (22)
τ τ τ
j6=i
 
X
= ψ 0 (d2i − d2ij )φ0  ψ(d2i − d2ij ) (23)
j6=i

= ψ 0 (d2i − d2ij )φ0 (ξi ) (24)


which coincides with Eqn. 4 that is from the gradient descent rule of the loss function Lφ,ψ .
In particular, for InfoNCE, we have φ(x) = τ log( + x), φ0 (x) = τ /(x + ) and therefore:
exp((d2i − d2ij )/τ ) exp(−d2ij /τ )
αij = 2 2 = (25)
2
 exp(−di /τ ) + j6=i exp(−d2ij /τ )
P P
 + j6=i exp((di − dij )/τ )
which is exactly the coefficients
P αij directly computed during minimization of Lnce . If  = 0, then
the constraint becomes j6=i αij = 1 and we have:
exp(−d2ij /τ )
αij = P 2 (26)
j6=i exp(−dij /τ )

That is, the coefficients α does not depend on intra-augmentation squared distance d2i .
Corollary 1 (Contrastive Learning as Coordinate-wise Optimization). If ψ(x) = ex/τ , minimizing
Lφ,ψ is equivalent to the following iterative procedure:
(Min-player α) αt = arg min Eα (θt ) − R(α) (8a)
α∈A
(Max-player θ) θt+1 = θt + η∇θ Eαt (θ) (8b)

Proof. The proof naturally follows from the conclusion of Theorem 1 and Theorem 2.

A.2 Section 4

Corollary 2 (Representation learning in Deep Linear CL reparameterizes Principal Component


Analysis (PCA)). When z = W (θ)x with a constraint W W > = I, Eα is the objective of Principal
Component Analysis (PCA) with reparameterization W = W (θ):
1
max Eα (θ) = tr(W (θ)Xα W > (θ)) s.t. W W > = I (9)
θ 2
here Xα := Cα [x] is the contrastive covariance of input x.

14
𝑊! ℎ(⋅) normalization

𝒇!"# 𝒇# ! 𝒇! 𝒇$!

Figure 4: Notations on normalization (Sec. A.2.1).

Proof. Notice that in deep linear setting, z = W (θ)x where W (θ) does not dependent on specific
samples. Therefore, Cα [z, z] = W (θ)Cα [x, x]W > (θ) = W (θ)Xα W > (θ).
>
Lemma 1. The training dynamics in DeepLin is Ẇl = W>l W>l Wl Cα [fl−1 ]

>
Proof. We can start from Eqn. 13 directly and takes out J>l . This leads to
N
!
>
X ∂L > ∂L > 0 >
Ẇl = J>l fl−1 [i] + f [i ] = J>l
0 ] l−1
Cα [z, fl−1 ] (27)
i=1
∂z[i] ∂z[i
> >
Using that z = J≥l fl−1 leads to the conclusion. If the network is linear, then J>l [i] = J>l is a
> >
constant. Then we can take the common factor J>l J≥l out of the summation, yield Ẇl = J>l J≥l Fl−1 .
Here Fl := Cα [fl ] is the contrastive covariance at layer l.

A.2.1 Section 4.2


For this we talk about more general cases where the deep network is nonlinear. Let h(·) be the
point-wise activation function and the network architecture looks like the following:
z[i] := WL h(WL−1 (h(. . . W1 x[i]))) (28)
We consider the case where h(·) satisfies the following constraints:
Definition 2 (Reversibility (Tian et al., 2020c) / Homogeneity (Du et al., 2018b)). The activation
function h(x) satisfies h(x) = h0 (x)x.

This is satisfied by linear, ReLU, leaky ReLU and many polynomial activations (with an addi-
tional constant). With this condition, we have fl [i] = Dl Wl fl−1 [i], where Dl = Dl (x[i]) :=
diag[h0 (wlk
>
fl−1 [i])] ∈ Rnl ×nl is a diagonal matrix. For ReLU activation, the diagonal entry of Dl
is binary.
Definition 3 (Reversible Layers (Tian et al., 2020c)). A layer is reversible if there exists J[i] so that
fout [i] = J[i]fin [i] and gin [i] = J > [i]gout [i] for each sample i.

It is clear that linear layers, ReLU and leaky ReLU are reversible. Lemma 6 tells us that `2 -
normalization and LayerNorm are also reversible.
d 2
Lemma 2. For MLP, if the weight Wl is below a `2 -norm or LayerNorm layer, then dt kWl kF = 0.

Proof. See Lemma 7 that proves more general cases.

A.2.2 Section 4.3


Definition 4 (Aligned-rank-1 solution). A solution θ = {Wl }L l=1 is called aligned-rank-1, if there
>
exists a set of unit vectors {vl }L
l=0 so that W l = v l vl−1 for 1 ≤ l ≤ L.
Theorem 3 (Representation Learning with DeepLin is PCA). If λmax (Xα ) > 0, then for any local
>
maximum θ ∈ Θ of Eqn. 11 whose W>1 W>1 has distinct maximal eigenvalue:
>
• there exists a set of unit vectors {vl }L
l=0 so that Wl = vl vl−1 for 1 ≤ l ≤ L, in particular,
v0 is the unit eigenvector corresponding to λmax (Xα ),
• θ is global optimal with objective 2E ∗ = λmax (Xα ).

15
Proof. A necessary condition for θ to be the local maximum is the critical point condition (here λl−1
is some constant):
>
W>l W>l Wl Fl−1 = λl−1 Wl (29)
Right multiplying Wl on both sides of the critical point condition for Wl , and taking matrix trace, we
have:
>
2E(θ) = tr(W>l W>l Wl Fl−1 Wl> ) = tr(λl−1 Wl Wl> ) = λl−1 (30)
Therefore, all λl are the same, denoted as λ, and they are equal to the objective value.
Now let’s consider l = 1. Then we have:
>
W>1 W>1 W1 X = λW1 (31)
>
Applying vec(AXB) = (B ⊗ A)vec(X), we have:
>
(X ⊗ W>1 W>1 )vec(W1 ) = λvec(W1 ) (32)
with the constraint that kvec(W1 )k22 = kW1 k2F = 1. Similarly, we have 2E(θ) = λ.
>
We then prove that λ is the largest eigenvalue of X ⊗ W>1 W>1 . We prove by contradiction. If not,
then vec(W1 ) is not the largest eigenvector, then there is always a direction W1 can move, while
respecting the constraint kW1 kF = 1 and keeping W>1 fixed, to make E(θ) strictly larger. Therefore,
>
for any local maximum θ, λ has to be the largest eigenvalue of X ⊗ W>1 W>1 .
Let {v0m } be the orthonormal basis of the eigenspace of λmax (X) and u Pbe the (unique by the
>
assumption)
P 2 maximal unit eigenvector of W >1 W >1 . Then vec(W
P 1 ) = m cm v0m ⊗ u where
c
m m = 1, or vec(W 1 ) = v 0 ⊗u where the unit vector v 0 := m c m v 0m . Plug vec(W1 ) = v0 ⊗u
into Eqn. 32, notice that v0 is still the largest eigenvector of X, and we have λ = λmax (X)kW>1 uk22 .
>
Now we show that λmax (W>1 W>1 ) = kW>1 k22 = 1. If not, i.e., kW>1 k2 < 1, then first
by Lemma 9, we know that W>1 := {WL , WL−1 , . . . , W2 } must not be aligned-rank-1. Since
>
W>1 W>1 is PSD and has unique maximal eigenvector u, the eigenvalue associated with u must be
strictly positive and thus W>1 u 6= 0.
Then by Lemma 10, W>1 is not a local maximum of J (W>1 ; u) := maxW>1 kW>1 uk2 s.t.
0
kWl kF = 1, which means that there exists W>1 := {WL0 , WL−1
0
, . . . , W20 } in the local neighborhood
of W>1 so that

• kWl0 kF = 1 for 2 ≤ l ≤ L. That is, W 0 is a feasible solution of J .


0 0
• J (W>1 ) := kW>1 uk2 > kW>1 uk2 = J (W>1 ).

Then let θ 0 := {WL0 , WL−1


0
, . . . , W20 , W1 } which is a feasible solution to DeepLin, we have:
>
2E(θ 0 ) = vec> (W1 )(X ⊗ W 0 >1 W>1
0
)vec(W1 ) (33)
>
= (v0>⊗ u )(X ⊗ W 0 >1 W>1
> 0
)(v0 ⊗ u) (34)
0 2
= λmax (X)kW>1 uk2 (35)
> λmax (X)kW>1 uk22 = λ = 2E(θ) (36)
This means that θ is not a local maximum, which is a contradiction. Note that θ 0 is not necessarily a
critical point (and Eqn. 29 may not hold for θ 0 ).
>
Therefore, λmax (W>1 W>1 ) = kW>1 k22 = 1 and thus 2E(θ) = λ = λmax (X).
Since kW>1 k2 = 1, again by Lemma 9, WL:2 is aligned-rank-1 and W>1 = vL v1> is also a rank-1
>
matrix. W>1 W>1 = v1 v1> has a unique maximal eigenvector v1 . Therefore vec(W1 ) = v0 ⊗ v1 , or
>
W1 = v1 v0 . As a result, θ := {WL:2 , W1 } is aligned-rank-1.
Finally, since all local maxima have the same objective value 2E = λmax (X), they are all global
maxima.

Remarks. Leveraging similar proof techniques, we can also show that with BatchNorm layers, the
local maxima are more constrained. From Lemma 11 we knows that if each hidden node is covered
with BatchNorm, then its fan-in weights are conserved. Therefore, without loss of generality, we
could set the per-filter normalization: kwlk k2 = 1. In this case we have:

16
Definition 5 (Aligned-uniform
√ solution). A solution θ is called aligned-uniform, if it is aligned-rank-
1, and [vl ]k = ±1/ nl for 1 ≤ l ≤ L − 1. The two end-point unit vectors (v0 and vL ) can still be
arbitrary.

Corollary 3. If we additionally use per-filter normalization
√ (i.e., kwlk k2 = 1/ nl ), then Thm. 3
holds and vl is more constrained: [vl ]k = ±1/ nl for 1 ≤ l ≤ L − 1.

Proof. Leveraging Lemma 12 in Theorem 3 yields the conclusion.

Remark. We could see that with BatchNorm, the optimization problem is more constrained, and the
set of local maxima have less degree of freedom. This makes optimization better behaved.

A.3 Section 5
0
Lemma 3 (Evaluation of ReLU2Layer). If Assumption 1 holds, setting w1k = max(w1k , 0) won’t
change the output of ReLU2Layer. Furthermore, if W1 ≥ 0, then the formula for linear network
Eα = 12 tr(W2 W1 Xα W1> W2> ) still works for ReLU2Layer.

Proof. For the first part, we just want to prove that if Assumption 1 holds, then a 2-layer ReLU
0
network with weights w1k and W2 has the same activation as another ReLU network with w1k =
0
max(w1k , 0) ≥ 0 and W2 = W2 .
We are comparing the two activations:
!
X
f1k = max w1km xkm , 0 (37)
m
!
X X
0
f1k = max max(w1km , 0)xkm , 0 = max(w1km , 0)xkm (38)
m m

The equality is due to the fact that xk ≥ 0 (by nonnegativeness). Now we consider two cases.
Case 1. If all w1k ≥ 0 then obviously they are identical.
Case 2. If there exists m so that w1km < 0. The only situation that the difference could happen is for
some specific xk [i] so that xkm [i] > 0. By Assumption 1(one-hotness), for m0 6= m, xkm [i] = 0 so
> >
the gate dk [i] = I(w1k xk > 0) = 0. On the other hand, w0 1k xk = 0 so d0k [i] = 0.
0
Therefore, in all situations, f1k = f1k .
For the second part, since W1 ≥ 0 and all input x ≥ 0 by non-negativeness, all gates are open and
the energy Eα of ReLU2Layer is the same as the linear model.
Theorem 4 (Dynamics of ReLU2Layer). If Assumption 1 holds, then the dynamics of ReLU2Layer
with w1k ≥ 0 is equivalent to linear dynamics with the Sticky Weight rule: any component that
reaches 0 stays 0.

Proof. Let w1k ≥ 0 be the k-th filter to be considered and w1km ≥ 0 its m-th component. Consider
0
a linear network with the same weights (w1k = w1k and W20 = W2 ) with only the ReLU activation
removed.
Now we consider the gradient rule of the ReLU network and the corresponding linear network with a
sticky weight rule (here gk [i] is the backpropagated gradient sent to node k for sample i, and dk [i] is
the binary gating for sample i at node k):
X
ẇ1km = gk [i]dk [i]xm [i] (39)
i
X
0
ẇ1km = I(w1km > 0) gk0 [i]xm [i] (40)
i

Thanks to Lemma 13, we know the forward pass between two networks are identical and thus
gk [i] = gk0 [i] so we don’t need to consider the difference between backpropagated gradient.

17
In the following, we will show that each summand of the two equations is identical.
Case 1. xm [i] = 0. In that case, gk [i]xm [i] = gk [i]dk [i]xm [i] = 0 regardless of whether the gate
dk [i] is open or closed.
Case 2. xm [i] > 0. There are two subcases:
Subcase 1: dk [i] = 1. In this case, the ReLU gating of k-th filter is open, then gk0 [i]xm [i] =
gk [i]xm [i] = gk [i]dk [i]xm [i]. By Assumption 1(One-hotness), for other m0 6= m, xkm0 [i] = 0, since
dk [i] = 1, it must be the case that w1km > 0 and thus I(w1km > 0) = 1. So the two summands are
identical.
Subcase 2: dk [i] = 0. Then w1km must be 0, otherwise since xk ≥ 0 (nonnegativeness), we have
>
w1k xk [i] ≥ w1km xm [i] > 0 and the gating of k-th filter must open. Therefore, the two summands
are both 0: the ReLU one is because dk [i] = 0 and the linear one is due to I(w1km > 0) = 0.
Theorem 5. If Assumption 1 holds, then in ReLU2Layer1Hid, w1 → em for certain m.

Proof. In ReLU2Layer1Hid, since there is only one node, we have X = Cα [x1 , x1 ] = Cα [x, x].
By Theorem 4, the dynamics of w1 is the linear dynamics plus the sticky weight rule, which is:
ẇ1 = diag(w1 > 0)Xw1 (41)

By Lemma 3, the negative parts of w1 can be removed without changing the result. Let’s only
consider the nonnegative part of w and remove corresponding rows and columns of X.
Note that the linear dynamics ẇ1 = Xw1 will converge to certain maximal eigenvector y (or its
scaled version, depending on whether we have norm constraint or not). By Lemma 14, as long as
X is not a scalar, y has at least one negative entry. Therefore, by continuity of the trajectory of the
linear dynamics, from w1 to y, the trajectory must cross the boundary of the polytope w1 ≥ 0 that
require all entries to be nonnegative.
After that, according to the sticky weight rule, in the ReLU dynamics, the corresponding component
(say w1m ) stays at zero. We can remove the corresponding m-th row and column of X, and the
process repeats until X becomes a scalar. Then w1 converges to that remaining dimension. Since
w1 ≥ 0, it must be the case that w1 → em for some m.
Theorem 6 (ReLU2Layer encourages diversity). If Assumption 1 holds, then for any local optimal
(W2 , W1 ) ∈ Θ of ReLU2Layer with E > 0, either W1 = ve> m for some m and v ≥ 0, or
rank(W1 ) > 1.

Proof. We just need to prove that if the local optimal solution (W2 , W1 ) satisfies rank(W1 ) = 1,
then W1 = ve>m for some m and v ≥ 0.
Since rank(W1 ) = 1 and kW1 kF = 1, by Lemma 8 we know that there exists unit vectors u and v
so that W1 = vu> . Since W1 ≥ 0, we can pick u ≥ 0 and v ≥ 0. Otherwise if u has both positive
and negative elements, then picking any nonzero element of v, the corresponding rows/colums of W1
will also have both signs, which is a contradiction.
Note that the objective function is
2E = tr(W2 F1 W2> ) = tr(W2 W1 Xα W1> W2> ) = (u> Xα u)kW2 uk22 > 0 (42)
Therefore, u> Xα u > 0 and kW2 uk2 > 0. By Lemma 10, we know that if W2 with the constraint
kW2 kF = 1 is an local optimal, W2 is a rank-1 matrix with decomposition W2 = bv > with
kbk2 = 1.
Then we have 2E = u> Xα u > 0 with u ≥ 0. From the proof of Lemma 14, we know that Xα has a
unique minimal all-positive eigenvector c > 0.
If there are ≥ 2 positive elements in u, then we can always create a vector a (with mixed signs in its
elements) so that (1) a has the same non-zero support as u and (2) a> c = 0. Therefore, a is in the
space of orthogonal complement of c. Since c is the unique minimal eigenvector, moving u along
the direction of a will strictly improve E, which contradicts with the fact that (W2 , W1 ) is locally
optimal.

18
Dataset Methods 100 epochs 300 epochs 500 epochs
Lnce 86.84 ± 0.26 89.19 ± 0.15 91.07 ± 0.12
CIFAR-10 α-CL-direct (Eqn. 43) 87.74 ± 0.28 89.76 ± 0.26 91.06 ± 0.09
α-CL-direct (Eqn. 44) 87.91 ± 0.12 89.89 ± 0.18 91.06 ± 0.17
Lnce 60.70 ± 0.40 64.22 ± 0.19 66.84 ± 0.16
CIFAR-100 α-CL-direct (Eqn. 43) 63.28 ± 0.31 65.71 ± 0.20 66.73 ± 0.13
α-CL-direct (Eqn. 44) 63.47 ± 0.06 65.86 ± 0.24 66.57 ± 0.21
Lnce 82.09 ± 0.31 86.96 ± 0.19 87.31 ± 0.17
STL10 α-CL-direct (Eqn. 43) 83.00 ± 0.28 87.35 ± 0.28 87.63 ± 0.29
α-CL-direct (Eqn. 44) 83.20 ± 0.17 87.36 ± 0.12 87.71 ± 0.14

Table 3: Top-1 downstream task accuracy with ResNet50 backbone and 256 batchsize. Learning rate is 0.001.
We also compare unnormalized α-CL-direct (Eqn. 43) versus (normalized) α-CL-direct (Eqn. 44). Normalized
version, which is used in the main text of the paper, performs slightly better.

Exponent p p=2 p=4 p=6 p=8 p = 10


Top-1 accuracy (500 epochs) 83.74 ± 0.18 84.06 ± 0.24 84.08 ± 0.42 83.91 ± 0.28 83.56 ± 0.13

Table 4: Ablation study on different exponent p in STL10 for the normalized pairwise importance (Eqn. 44) in
α-CL-direct.

Therefore, the unit vector u has only 1 positive entry, which is em for some m. Fig. 3 shows one
example of learned weights with rank > 1.

B More Experiments
We also provide experiments with different batchsize (i.e., 256) and ablation studies on different
exponent p in the direct version of α-CL. Note that we refer an unnormalized α-CL-direct as the
following:
αij = exp(−dpij /τ ) (43)
while (normalized) α-CL-direct as the following (same as Eqn. 12 in the main text):
exp(−dpij /τ )
αij = P p (44)
j exp(−dij /τ )

By default, we set the exponent p = 4 and τ = 0.5.

C Other Lemmas
Lemma 4 (Gradient Formula of contrastive Loss (Eqn. 1) (extension of Lemma 2 in (Jing et al.,
2022)). Consider the loss function
 
N
X X
min Lφ,ψ (θ) := φ ψ(d2i − d2ij ) (45)
θ
i=1 j6=i

Then for any matrix (or vector) variable A, we have:


N
X ∂Lφ,ψ ∂Lφ,ψ > 0
A> [i] + A [i ] = −Cα [z, A] (46)
i=1
∂z[i] ∂z[i0 ]
and
N
X ∂Lφ,ψ ∂Lφ,ψ
+ A[i0 ]
A[i] 0]
= −Cα [A, z > ] (47)
i=1
∂z[i] ∂z[i
P
where Cα [·, ·] is the contrastive covariance defined as (here βi := j6=i αij ):
N
X N
X
Cα [x, y] := αij (x[i] − x[j])(y[i] − y[j])> − βi (x[i] − x[i0 ])(y[i] − y[i0 ])> (48)
i,j=1 i=1

19
and α is defined as the following:
 
X
αij := φ0  ψ(d2i − d2ij ) ψ 0 (d2i − d2ij ) ≥ 0 (49)
j6=i

where φ0 , ψ 0 are derivatives of φ, ψ.

Proof. Taking derivative of the loss function L = Lφ,ψ w.r.t. z[i] and z[i0 ], we have:
∂L X X
= αij (z[j] − z[i0 ]) + αji (z[j] − z[i]) (50)
∂z[i]
j6=i j6=i
∂L X
= αij (z[i0 ] − z[i]) = βi (z[i0 ] − z[i]) (51)
∂z[i0 ]
j6=i

We just need to check the following:


 
X X X X
 αij (z[j] − z[i0 ]) + αji (z[j] − z[i]) A> [i] + βi (z[i0 ] − z[i])A> [i0 ] (52)
i j6=i j6=i i

To see this, we only need to check whether the following is true:


 
X X X X
− Σ0 =  αij (z[j] − z[i0 ]) + αji (z[j] − z[i]) A> [i] + βi (z[i0 ] − z[i])A> [i]
i j6=i j6=i i
(53)
which means that
 
X X X
− Σ0 =  αij (z[j] − z[i]) + αji (z[j] − z[i]) A> [i] (54)
i j6=i j6=i

Since αii (z[i] − z[i]) = 0 for arbitrarily defined αii , j can also take the value of i, this leads to
X X
− Σ0 = αij (z[j] − z[i])A> [i] + αji (z[j] − z[i])A> [i] (55)
i,j i,j

Swapping indices for the second term, we have:


X X
−Σ0 = αij (z[j] − z[i])A> [i] + αij (z[i] − z[j])A> [j] (56)
i,j i,j
X X
= αij (z[j] − z[i])A> [i] − αij (z[j] − z[i])A> [j] (57)
i,j i,j
X
= − αij (z[j] − z[i])(A> [j] − A> [i]) (58)
i,j

and the conclusion follows.


Lemma 5. The following minimization problem:
X X 1
min cj pj − τ H(p) s.t. pj = x0 φ0 (x0 ) (59)
pj
j j
τ

pj log pj is the entropy and x0 := j e−cj /τ , has close-form solution:


P P
where H(p) := − j
 
1 X
pj = exp(−cj /τ )φ0  exp(−cj /τ ) (60)
τ j

20
Proof. Define the following Lagrangian multiplier:
 
X X 1
J (α, θ) := cj pj − τ H(p) + µ  pj − x0 φ0 (x0 ) (61)
j j
τ

Taking derivative w.r.t pj and we have:


∂J
= cj + τ (log pj + 1) − µ = 0 (62)
∂pj
which gives the solution
µ  c   c 
j j
pj = exp − 1 exp − := Z exp − (63)
τ τ τ
where Z can be computed via the constraint:
1 x0 φ0 (x0 ) 1
Z= P −c /τ = φ0 (x0 ) (64)
τ je j τ

Lemma 6. The normalization function y = (x − mean(x))/kxk2 has the following for-


ward/backward rule:
∂y
y = J(x)x, = J > (x) (65)
∂x

where J(x) := kP ⊥1xk2 Px,1 is a symmetric matrix. For y = x/kxk2 , the relationship still holds
1
1 ⊥
with J(x) = kxk2 Px .

Proof. See Theorem 5 in (Tian, 2018).


Lemma 7. Suppose the output of a linear layer (with a weight matrix Wl ) connects to a `2 regular-
d
ization or LayerNorm through reversible layers, then dt kWl k2F = 0.

Proof. From Lemma, for each sample i, we have its gradient before/after the normalization layer
(say it is layer m) to be the following:
n n >
gm [i] = Jm [i] gm [i] (66)
n
where gm [i] is the gradient after back-propagating through normalization, and gm [i] is the gradient
sending from the top level.
Here Jmn
[i] = kP ⊥ f1m [i]k2 Pf⊥m [i],1 for LayerNorm and Jm n
[i] = 1 ⊥
kfm [i]k2 Pfm [i] for `2 normalization.
1
For Wl , its gradient update rule is:
X
>
Ẇl = g̃l [i]fl−1 [i] (67)
i

By reversibility, we know that g̃l [i] = J(>l̃,m] [i]g[i], where J(l̃,m] [i] is the Jacobian after the linear
layer ˜l till layer m, right before the normalization layer. Therefore, we have:
> n
X
tr(Wl> Ẇl ) = tr(Wl> J(>l̃,m] [i]Jm
n
[i] gm >
[i]fl−1 [i]) (68)
i
>
X
>
= tr(fl−1 [i]Wl> J(>l̃,m] [i]Jm
n n
[i] gm [i]) (69)
i
>
X
> n n
= tr(fm [i]Jm [i] gm [i]) (70)
i
= 0 (71)
The last two equality is due to reversibility fm [i] = J(l̃,m] [i]Wl fl−1 [i] and the property of normal-
n
ization layers: Jm [i]fm [i] = 0, since a vector projected to its own complementary space is always

zero Pfm [i] fm [i] = 0.

21
Then we have
d d
kWl k2F = tr(Wl> Wl ) = tr(Ẇl> Wl ) + tr(Wl> Ẇl ) = 0 (72)
dt dt

Lemma 8. For every rank-1 matrix A with kAkF = 1, there exists kuk2 = kvk2 = 1 so that
A = uv > .
>
Proof. Since A is rank-1, it is clear that there exists u0 and v 0 so that A = u0 v 0 . Since kAkF = 1,
we have kAk2F := tr(AA> ) = ku0 k22 kv 0 k22 = 1. Therefore, taking u = u0 /ku0 k2 and v =
v 0 /kv 0 k2 , we have A = uv > .
Lemma 9. If kWl kF = 1 for 1 ≤ l ≤ L, then kWL WL−1 . . . W1 k2 = 1 if any only if
WL , WL−1 , . . . , W1 are aligned-rank-1 (Def. 4).

Proof. If WL , WL−1 , . . . , W1 are aligned-rank-1, then by its definition, there exists unit vec-
>
tors {vl }L
l=0 so that Wl = vl vl−1 . Therefore, kWL WL−1 . . . W1 k22 = kvL v0> k22 =
λmax (vL v0> v0 vL
>
) = λmax (vL vL>
) = 1.
Then we prove the other direction. Note that
L
Y L
Y
kWL WL−1 . . . W1 k2 ≤ kWl k2 ≤ kWl kF = 1 (73)
l=1 l=1

and the equality only holds when all Wl are rank-1. By Lemma 8, for any l, there exists unit vectors
vl0 , vl−1 so that Wl = vl0 vl−1
>
. To show that they must be aligned (i.e. vl = ±vl0 ), we prove by
contradiction.
Suppose kWL WL−1 . . . W1 k2 = 1 but for some l, vl0 6= ±vl and thus |vl> vl0 | < 1. Then
Wl+1 Wl = (vl> vl0 )vl+1 vl−1
>
and kWl+1 Wl k2 ≤ kWl+1 Wl kF = |vl> vl0 | < 1. Therefore,
kWL WL−1 . . . W1 k2 < 1, which is a contradiction.
>
Note that for Wl = ±vl vl−1 , we can always move around the signs to either v0 or vL to fit into the
definition of aligned-rank-1.
Lemma 10. For the following optimization problem with a given fixed vector u 6= 0:
max J (W; u) := kWL WL−1 . . . W1 uk2 s.t. kWl kF = 1, (74)
W

where W = {WL , WL−1 , . . . , W1 }. If W ∗ is a local maximum solution (i.e., there exists a neighbor-
hood N (W ∗ ) of W ∗ so that for any W ∈ N (W ∗ ), J (W) ≤ J (W ∗ )), and J (W ∗ ) > 0, then W ∗
is an aligned-rank-1 solution (Def. 4).
0 ∗ ∗
Proof. Let vL−1 := WL−1 WL−2 . . . W1∗ u. Note that vL−1
0
6= 0 (otherwise J (W ∗ ) would be zero).
0
Consider the following optimization subproblem (here we optimize over WL and treat vL−1 as a
fixed vector).
∗ 0
max J (WL ; W−L ) = kWL vL−1 k2 s.t. kWL kF = 1 (75)
WL
By local optimality of W ∗ , WL∗ must be the local maximum of Eqn. 75 and thus a critical point, since
0
both the objective and the constraints are differentiable. Note that kWL vL−1 k2 is a vector 2-norm
and all critical points of Eqn. 75 must satisfy
0 >
WL vL−1 v 0 L−1 = λWL (76)
for some constant λ. Notice that to satisfy this condition, each row of WL must be an eigenvector
0 > >
of vL−1 v 0 L−1 . For a solution to be local maximal, λ is the largest eigenvalue of vL−1
0
v 0 L−1 , and
0 >
each row of WL is the corresponding eigenvector. It is clear that the rank-1 matrix vL−1 v 0 L−1 has
0
a unique maximum eigenvalue kvL−1 k22 > 0 with its corresponding one-dimensional eigenspace
span by vL−1 := vL−1 /kvL−1 k2 (while all other eigenvalues are zeros). Therefore, WL∗ as the local
0 0

maximum of Eqn. 75, must have:


WL∗ = vL vL−1
>
(77)

22
for some kvL k2 = 1.
0 ∗
Now let vL−2 := WL−2 . . . W1∗ u. Similarly, vL−2
0
6= 0 (otherwise J (W ∗ ) would be zero). Then
0 ∗ 0 0
vL−1 = WL−1 vL−2 . Treating vL−2 as a fixed vector and varying WL−1 and WL simultaneously,

then since WL:1 is a local maximal solution, WL∗ must take the form of Eqn. 77 given any WL−1 ∗
,
which means that the objective function now becomes

J (WL−1 ; W−(L−1) ) = kWL∗ vL−1
0 >
k2 = kvL vL−1 0
vL−1 0
k2 = kvL−1 0
k2 = kWL−1 vL−2 k2 (78)
and the subproblem becomes:
0
max kWL−1 vL−2 k2 s.t. kWL−1 kF = 1 (79)
WL−1


Repeating this process, we know WL−1 must satisfy:
∗ >
WL−1 = vL−1 vL−2 (80)
0 0
for vL−2 := vL−2 /kvL−2 k2 . This procedure can be repeated until W1 and the prove is complete.

d 2
Lemma 11. dt kwk k2 = 0, if node k is under BatchNorm.

Proof. For BN, it is a layer with reversibility on each filter k. We use fk , gk ∈ RN to represent the
activation/gradient at node k in a batch of size N . The forward/backward operation of BN can be
written as:
fkn = Jk fk , gk = Jk> gkn (81)
Here Jk = Jk> = 1
P⊥
kP1⊥ fk k2 fk ,1
is the Jacobian matrix at each node k.

We check how the weight wk changes under BatchNorm. Here we have fk = h(Fl−1 wk ) where h
is a reversible activation and Fl−1 ∈ RN ×nl−1 contains all output from the last layer. Then we have:
X
ẇk = h0i gk [i]fl−1 [i] = Fl−1
> >
Dk gk = Fl−1 Dk Jk> gkn (82)
i

where Dk := diag([h0i ]N
i=1 ) ∈ RN ×N . Due to reversibility, we have fk = h(Fl−1 wk ) = Dk Fl−1 wk .
Therefore,
wk> ẇk = wk> Fl−1
>
Dk Jk> gkn = fk> Jk> gkn = 0 (83)

Lemma 12 (BatchNorm regularization). Consider the following optimization problem with a fixed
vector u 6= 0:

max J (W) := kWL WL−1 . . . W1 uk2 s.t. kWL kF = 1, kwlk k2 = 1/ nl (84)
W

where W := {WL , WL−1 , . . . , W1 } and wlk are rows of Wl (i.e., weight of the k-th filter at layer l).
Then Lemma 10 still holds by replacing aligned-ranked-one with aligned-uniform condition.

Proof. The proof is basically the same. The only difference here is that the sub-problem (Eqn. 79)
becomes: √
0
max kWl vl−1 k2 s.t. kwlk k2 = 1/ nl (85)
Wl

for 1 ≤ l ≤ L − 1. The critical point condition now becomes (here Λ is a diagonal matrix):
0 >
Wl vl−1 v 0 l−1 = ΛWl (86)
0 >
That is, each row of Wl now has a different constant. Since the eigenvalue of vl−1 v 0 l−1 can only be
0 or 1, and 0 won’t work (otherwise the corresponding row of Wl would be a zero vector, violating
>
the row-norm constraint), all diagonal element of λ has to be 1. Therefore, Wl = vl vl−1 . Due to

row-normalization, we have [vl ]k = ±1/ nl for 1 ≤ l ≤ L − 1, while vL and v0 can still take
arbitrary unit vector.

23
Lemma 13. If Assumption 1(Nonnegativeness) holds, then a 2-layer ReLU network with weights
w1k ≥ 0 and W2 has the same activations (i.e., fl = fl0 ) as its linear network counterpart with the
0
same weights w1k = w1k and W20 = W2 .

Proof. SincePW20 = W2 , we only need to P


prove f1 = f10 . For
P each filter k, we have its activation
0 0
f1k = max( m w1km xkm , 0) and f1k = mP w1km xkm = m w1km xkm . By Assumption 1(non-
0
negativeness), all xkm ≥ 0. Since w1km ≥ 0, m w1km xkm ≥ 0 and f1k = f1k .

Lemma 14. If Assumption 1 holds, M ≥ 2, x1 covers all M modes, and αij > 0, then the maximal
eigenvector of Xα always contains at least one negative entry.

Proof. Let Xk := Cα [xk , xk ]. By Lemma 15, all off-diagonal elements of Xk are negative. Then
Xk can be written as Xk = βI − Xk0 for some β where Xk0 is a symmetric matrix whose entries
are all positive. By Perron–Frobenius theorem, Xk0 has a unique maximal eigenvector uk > 0 (with
all positive entries) and its associated positive eigenvalue λk > 0. Therefore, uk > 0 is also the
unique(!) minimal eigenvector of Xk . Since M ≥ 2, there exists a maximal eigenspace, in which
any maximal eigenvector yk satisfies yk> uk = 0. By Lemma 16, the theorem holds.
Lemma 15. If the receptive field Rk satisfies Assumption 1, and the collection of N vectors
{xk [i]}N
i=1 contains all M modes, then all off-diagonal elements of Cα [xk , xk ] are negative.

P
Proof. We check every entry of Xk := Cα [xk , xk ]. Let βi := j6=i αij . Note that for off-diagnoal
element [Xk ]ml with m 6= l, we have:
X X
[Xk ]ml = αij (xkm [i]−xkm [j])(xkl [i]−xkl [j])− βi (xkm [i]−xkm [i0 ])(xkl [i]−xkl [i0 ]) (87)
ij i

Let Am := {i : xkm [i] > 0} be the sample set in which the m-th component is strictly positive, and
Acm := {1, 2, . . . , N }\Am its complement. By Assumption 1(one-hotness), if i ∈ Am then i ∈ Acm0
for any m0 6= m.
Now we consider several cases for sample i and j:
Case 1, i, j ∈ Am . Then i, j ∈ Acl for l 6= m. This means that xkl [i] − xkl [j] = 0.
Case 2, i, j ∈ Acm . Then xkm [i] − xkm [j] = 0.
Case 3, i ∈ Am and j ∈ Acm . Since j ∈ Acm , we have xkm [i] − xkm [j] = xkm [i] > 0. On
the other hand, since i ∈ Am , i ∈ Acl , we have xkl [i] − xkl [j] = −xkl [j] ≤ 0. Therefore,
(xkm [i] − xkm [j])(xkl [i] − xkl [j]) ≤ 0.
Case 4. i ∈ Acm and j ∈ Am . This is similar to Case 3.
Putting them all together, since αij > 0, we know that
X
αij (xkm [i] − xkm [j])(xkl [i] − xkl [j]) ≤ 0 (88)
ij

Furthermore, it is strictly negative since for i ∈ Am and j ∈ Al , we have


(xkm [i] − xkm [j])(xkl [i] − xkl [j]) = −xkm [i]xkl [j] < 0 (89)
By our assumption that the N vectors {xk [i]}N
i=1 contains all M modes, both Am and Al are not
empty so this is achievable.
For the second summation, by Assumption 1(Augmentation), either i, i0 ∈ Am or i, i0 ∈ Acm , it is
always zero for m 6= l.
Lemma 16. If v > 0 is an all positive d-dimensional vector, u> v = 0, then
minm vm kuk∞
min um ≤ − (90)
m d − k kvk∞
where k is the number of nonnegative entries in u.

24
Proof. Let m0 := arg maxm |um |. If um0 = −kuk∞ = minm um then we have proven the theorem.
Otherwise u0 := um0 ≥ 0. um0 is the largest entry of {um }.
Since minm um < 0, by Rearrangement inequality we have:
X     
0 = u> v = um vm ≥ min vm u0 + (d − k) max vm min um (91)
m m m
m

The conclusion follows.

25

You might also like