2022 Cmim Net
2022 Cmim Net
Yuzhang Shang1 , Dan Xu2 , Ziliang Zong3 , Liqiang Nie4 , and Yan Yan1⋆
1
Illinois Institute of Technology, USA
2
Hong Kong University of Science and Technology, Hong Kong
3
Texas State University, USA
4
Harbin Institute of Technology, Shenzhen, China
[email protected], [email protected], [email protected],
[email protected], and [email protected]
1 Introduction
Although deep learning [27] has achieved remarkable success in various com-
puter vision tasks such as image classification [25] and semantic image segmen-
tation [5], its over-parametrization problem makes its computationally expensive
⋆
Corresponding author.
2 Y. Shang et al.
Features
Full-precision
Activations
AB
X1 f () Z1
AF
Sign
t1 AF
X Comparison X BNN Comparison
t2 BatchNorm
AB
X2 f () Z2 1-bit 3x3 Conv
Binary
Activations
Fig. 1. (a): In contrastive instance learning, the features produced by different trans-
formations of the same sample are contrasted to each other. (b): However BNN can
yield the binary activations AB and full-precision activations AF (i.e. two transforma-
tions of an image both from the same BNN) in the same forward pass, thus the BNN
can act as two image transformations in the literature of contrastive learning.
2 Related Work
In [22], the researchers introduce the sign function to binarize weights and activa-
tions to 1-bit, initiating the studies of BNNs. In this work, the straight-through
estimator (STE) [2] is utilized to approximate the derivative of the sign function.
Following the seminal art, copious studies contribute to improving the perfor-
mance of BNNs. For example, Rastegari et al. [38] disclose that the quantization
4 Y. Shang et al.
error between the full-precision weights and the corresponding binarized weights
is one of the major obstacles degrading the representation capabilities of BNNs.
Reducing the quantization error thus becomes a fundamental research direc-
tion to improve the performance of BNNs. Researchers propose XNOR-Net [38]
to introduce a scaling factor calculated by L1 norm for both weights and ac-
tivation functions to minimize the quantization error. Inspired by XNOR-Net,
XNOR++ [3] further learns both spatial and channel-wise scaling factors to im-
proves the performances. Bi-Real [31] proposes double residual connections with
full-precision downsampling layers to mitigate the excessive gradient vanishing
issue caused by binarization. ProxyBNN [18] designs a proxy matrix as a basis of
the latent parameter space to guide the alignment of the weights with different
bits by recovering the smoothness of BNNs. ReActNet [32] implements binariza-
tion with MobileNet [21] instead of ResNet, and achieves SoTA performance.
Nevertheless, we argue that those methods focusing on narrowing down the
quantization error and enhancing the gradient transmission reach their bottle-
neck (e.g. 1W32A ResNet-18 trained by ProxyBNN achieves 67.7% Top-1 ac-
curacy on ImageNet, while full-precision version is only 68.5%). Because they
neglect the activations in BNNs, especially the relationship between the binary
and latent full-precision activations. We treat them as discrete variables and
investigate them under the metric of mutual information. By maximizing the
mutual information via contrastive learning, the performance of BNNs is further
improved. The experimental results show that CMIM can consistently improve
the aforementioned methods by directly adding our CMIM module on them.
3.1 Preliminaries
f(\mathbf {W}^1,\cdots ,\mathbf {W}^K;\mathbf {x}) = (\mathbf {W}^{K}\cdot \sigma \cdot \mathbf {W}^{K-1}\cdot \cdots \cdot \sigma \cdot \mathbf {W}^{1})(\mathbf {x}), \label {eq:mlp1} (1)
where x is the input sample and Wk : Rdk−1 7−→ Rdk (k = 1, ..., K) stands for
the weight matrix connecting the (k − 1)-th and the k-th layer, with dk−1 and
dk representing the sizes of the input and output of the k-th network layer,
respectively. The σ(·) function performs element-wise activation operation on
the input feature maps.
Based on those predefined notions, the sectional MLP f k (x) with the front
k layers of the f (x) can be represented as:
f^k(\mathbf {W}^1,\cdots ,\mathbf {W}^k;\mathbf {x}) = (\mathbf {W}^{k}\cdot \sigma \cdots \sigma \cdot \mathbf {W}^{1})(\mathbf {x}). \label {eq:mlp2} (2)
And the MLP f can be seen as a special case in the function sequence {f k }(k ∈
{1, · · · , K}), i.e. f = f K .
Network Binarization via Contrastive Learning 5
I(\mathbf {X}, \mathbf {Y}) = \sum _{x,y}P_{\mathbf {X}\mathbf {Y}}(x, y)\log \frac {P_{\mathbf {X}\mathbf {Y}}(x, y)}{P_{\mathbf {X}}(x)P_{\mathbf {Y}}(y)}, \label {eq:mi} (3)
P
where
P PXY (x, y) is the joint distribution, PX (x) = y PXY (x, y) and PY (y) =
x PXY (x, y) are the marginals of X and Y, respectively.
Mutual information quantifies the amount of information obtained about one
random variable by observing the other random variable. It is a dimensionless
quantity with (generally) units of bits, and can be considered as the reduction in
uncertainty about one random variable given knowledge of another. High mutual
information indicates a large reduction in uncertainty and vice versa [26]. In the
content of binarization, considering the binary and full-precision activations as
random variables, we would like them share as much information as possible,
since the binary activations are proceeded from their corresponding full-precision
activations. Theoretically, the mutual information between those two variables
should be maximized.
Our motivation can also be testified from the perspective of RBNN [29]. In
RBNN, Lin et al. devise a rotation mechanism leading to around 50% weight
flips which maximizes the information gain, H(ak,i B ). As MI can be written in
another form as I(X, Y) = H(X) − I(X | Y), the MI between binary and FP
activations can be formulated as:
I(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}) = H(\mathbf {a}_B^{k,i}) - I(\mathbf {a}_B^{k,i} \mid \mathbf {a}_F^{k,j}), \label {eq:mi1} (4)
in which maximizing the first term on the right can partially lead to maximizing
the whole MI. In this work, we aim to universally maximize the targeted MI.
Recently, contrastive learning is proven to be an effective approach to MI
maximization, and many methods based on contrastive loss for self-supervised
learning are proposed, such as Deep InfoMax [20], Contrastive Predictive Cod-
ing [34], MemoryBank [42], Augmented Multiscale DIM [1], MoCo [16] and Sim-
Saim [7]. These methods are generally rooted in NCE [13] and InfoNCE [20]
which can serve as optimizing the lower bound of mutual information [36]. Intu-
itively, the key idea of contrastive learning is to pull representations in positive
pairs close and push representations in negative pairs apart in a contrastive
space, and thus the major obstacle for resorting to the contrastive loss is to
define the negative and positive pairs.
6 Y. Shang et al.
-0.28 -0.28
0.43 -0.32
Contrastive
0.91 0.91
Space
3.87 4.47
-0.85 -0.85
BNN BNN
+ +
- -
+ +
- -
+ +
+ +
+ +
Binary
Binary Activation
Activation - -
Fig. 2. Feeding two images into a BNN, and obtaining the three pairs of binary and
full-precision activations. Our goal is to embed the activations into a contrastive space,
then learn from the pair correlation with the contrastive learning task in Eq. 13.
q(D = 1 \mid \mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}) = \frac {q(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}\mid D=1)\frac {1}{N}}{q(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}\mid D = 1)\frac {1}{N} + q(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}\mid D = 1)\frac {N-1}{N}}. \label {eq:ctl1} (6)
q(D = 1 \mid \mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}) = \frac {P(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j})}{P(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}) + P(\mathbf {a}_B^{k,i})P(\mathbf {a}_F^{k,j})(N-1)}. \label {eq:ctl2} (7)
\begin {aligned} \log q(D = 1 \mid \mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}) \leq \log \frac {P(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j})}{P(\mathbf {a}_B^{k,i})P(\mathbf {a}_F^{k,j})} - \log (N-1). \label {eq:ctl3} \end {aligned} (8)
where I(akB , akF ) is the mutual information between the binary and full-precision
distributions of our targeted object. Instead of directly maximizing the mutual
information, maximizing the lower bound in the Eq. 9 is a practical solution.
However, q(D = 1 | ak,i k,j
B , aF ) is still hard to estimate. Thus, we introduce
critic function h with parameter ϕ (i.e. h(ak,i k,j
B , aF ; ϕ)) as previous contrastive
learning works [34,20,1,41,6]. Basically, the critic function h needs to map akB , akF
to [0, 1] (i.e. discriminate whether a given pair is positive or negative). In prac-
tice, we design our critic function for our BNN case based on the critic function
in [41]:
h(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}) =\exp (\frac {<\mathbf {a}_B^{k,i}, \mathbf {a}_F^{k,j}>}{\tau })/C, \label {eq:critic} (10)
<ak,i ,ak,j >
in which C = exp( B τ F ) + N/M , M is the number of all possible pairs, as
well as τ is a temperature parameter that controls the concentration level of the
distribution [19].
8 Y. Shang et al.
The activations of BNN have their properties can be used here, i.e.
\textit {sgn}(\mathbf {a}_F^{k,i}) = \mathbf {a}_B^{k,i}~~~\text {and}~~~<\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,i}> = \Vert \mathbf {a}_F^{k,i} \Vert _1 \label {eq:critic_1} (11)
h(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}) = \exp (\frac {<\textit {sgn}(\mathbf {a}_F^{k,i}), \mathbf {a}_F^{k,j}>}{\tau })= \left \{ \begin {array}{ll} \exp (\frac {\Vert \mathbf {a}_F^{k,i} \Vert _1}{\tau }) & \quad i = j, \\ \exp (\frac {<\textit {sgn}(\mathbf {a}_F^{k,i}), \mathbf {a}_F^{k,j}>}{\tau }) & \quad i \neq j \end {array} \right . \label {eq:critic_new} (12)
Critic in the view of activation flip. Eq. 12 reveals the working mechanism
of CMIM from a perspective of activation flip. Specifically, by turning the +
activation into −, binary activation in the critic can pull the activations in pos-
itive pair close and push the ones in the negative pair away via inner product.
For example, suppose ak,1 k,2
F =(0.3, −0.4, −0.6) and aF =(0.6, −0.9, 0.7), and then
aB =(+1, −1, −1) is the anchor. Thus, for the positive pair, < sgn(ak,1
k,1 k,1
B ), aF >
=0.3×(+1) + (−0.4)×(−1) + (−0.6)×(−1)=∥ak,1 F ∥1 maximizing their similarity
score; and for the negative pair, < sgn(ak,1
B ), ak,2
F > =0.6×(+1)+(−0.9)×(−1)+
f lipped
z }| {
(0.6) × (−1) gradually minimizing the score, where the flipped term serve as a
penalty for the negative pair. In this way, the binary anchor pull the positive
full-precision activation close, and push the negative full-precision ones away by
flipping numbers in the full-precision activations. Note that the process is itera-
tively operated during training, and thus all the binary activations can play the
role as anchor, which eventually leads to better representation capacity in the
contrastive space.
Loss Function. We define the contrastive loss function LkN CE between the k-th
layer’s activations AkB and AkF as: LkN CE =
\mathbb {E}_{q(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}\mid D=1)}\left [\log h(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j})\right ] + N\mathbb {E}_{q(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}\mid D=0)}\left [\log (1 - h(\mathbf {a}_B^{k,i},\mathbf {a}_F^{k,j}))\right ]. \label {eq:loss1}
(13)
We would comment on the above loss function from the perspective of con-
trastive learning. The first term of positive pairs is optimized for capturing more
intra-class correlations and the second term of negative pairs is for inter-class
decorrelation. Because the pair construction is instance-wise, the number of neg-
ative samples theoretically can be the size of the entire training set, e.g. 1.2 mil-
lion for ImageNet. With those additional hand-craft designed contrastive pairs
for the proxy optimization problem in Eq. 13, the representation capacity of
BNNs can be further improved, as many contrastive learning methods demon-
strated [7,34,20,1].
Combining the series of NCE loss from different layers LkN CE , (k =
1, · · · , K), the overall loss L can be defined as:
\mathcal {L} = \lambda \sum _{k=1}^{K}\frac {\mathcal {L}^{k}_{NCE}}{\beta ^{K-1-k}} + \mathcal {L}_{cls}, \label {eq:loss2} (14)
Network Binarization via Contrastive Learning 9
80
60
60 75 60
40
40 50 40
20
20 25 20
tsne-2d-two
tsne-2d-two
tsne-2d-two
tsne-2d-two
0
0 0
0
20
20 20
40 25
40 40
60 50
60
60 80 75
60 40
(a) XNOR 20 0 [38]
tsne-2d-one
20 40 60
(b)40 IR-Net
20 0 20[37]
tsne-2d-one
40 60 50
(c) 25
RBNN 0 25 [29]
tsne-2d-one
50 75 (d)75 CMIM
50 25 0 (ours)
tsne-2d-one
25 50
Fig. 3. t-SNE [33] visualization of the activations representing for random 10 classes
in CIFAR-100. Every color represents a different class. We can clearly witness the
improvement of our method for learning better binary representations.
where Lcls is the classification loss respect to the ground truth, λ is used to
control the degree of NCE loss, β is a coefficient greater than 1, and we denote
PK LkN CE
the CMIM loss as LCM IM = k=1 β K−1−k . Hence, the β K−1−k decreases with
Lk
k increasing and consequently the β K−1−k
N CE
increases. In this way, the activations
of latter layer can be substantially retained, which leads to better performance
in practice. The complete training process of CMIM is presented in Algorithm
1 in the Supplemental Materials.
bound on mutual information [36]. In the meantime, Tian et. al. [41] and Chen et.
al. [6] generalize the contrastive idea into the content of knowledge distillation
(KD) to pull-and-push the representations of teacher and student.
Our formulation of CMIM -BNN absorbs the core idea (i.e. construct the
appropriate positive and negative pairs for contrastive loss) of the existing con-
trastive learning methods, especially the contrastive knowledge distillation meth-
ods, CRD [41] and WCoRD [6]. However, our approach has several differences
from those methods. Firstly, our work can not be treated as a simply applica-
tion with the teacher-and-student framework. In KD, the teacher is basically
fixed to offer additional supervision signals and is not optimizable. But in our
formulation, we leverage the exclusive structure of BNN, where FP and binary
activations exist in the same forward pass, i.e. only one BNN is involved, without
using another network as a teacher. Therefore, the accuracy improvement of the
BNN trained by our method is purely benefited from the activation alignment in
a contrastive way, rather than a more accurate teacher network. Secondly, due to
the particular structure of BNNs (Eq. 11), our critic function is largely different
from the normal critic in contrastive learning (see Eq. 11 and Eq. 12). Impor-
tantly, the critic functions of CRD and WCoRD must utilize a fully-connected
layer over the representations to transform them into the same dimension and
further normalize them by L2 norm before the inner product, but ours does not.
In the literature of binarization, our designed critic function act as an activation
flip as we discussed below Eq. 12. Thirdly, instead of only using the activation
of the final layer, we align the activations layer-by-layer with a hyperparameter
to adjust the weight of each layer as shown in Eq. 14, which is a more suitable
design for BNN. In conclusion, using contrastive objective as a tool to realize
mutual information maximization for our network binarization is new.
4 Experiments
decay of 1e-4, initial learning rate of 0.1 with cosine learning rate scheduler (for
fair comparison, we also use ADAM optimizer in some ResNet-variant settings).