Lightweight Image Super-Resolution With Information Multi-Distillation Network
Lightweight Image Super-Resolution With Information Multi-Distillation Network
Multi-distillation Network
Zheng Hui Xinbo Gao
School of Electronic Engineering, Xidian University School of Electronic Engineering, Xidian University
Xi’an, China Xi’an, China
[email protected] [email protected]
ABSTRACT KEYWORDS
In recent years, single image super-resolution (SISR) methods using image super-resolution; lightweight network; information multi-
deep convolution neural network (CNN) have achieved impressive distillation; contrast-aware channel attention; adaptive cropping
results. Thanks to the powerful representation capabilities of the strategy
deep networks, numerous previous ways can learn the complex
non-linear mapping between low-resolution (LR) image patches ACM Reference Format:
and their high-resolution (HR) versions. However, excessive convo- Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei Wang. 2019. Lightweight
lutions will limit the application of super-resolution technology in Image Super-Resolution with Information Multi-distillation Network. In
low computing power devices. Besides, super-resolution of any ar- Proceedings of the 27th ACM International Conference on Multimedia (MM’19),
bitrary scale factor is a critical issue in practical applications, which October 21–25, 2019, Nice, France. ACM, New York, NY, USA, 9 pages. https:
//doi.org/10.1145/3343031.3351084
has not been well solved in the previous approaches. To address
these issues, we propose a lightweight information multi-distillation
network (IMDN) by constructing the cascaded information multi-
distillation blocks (IMDB), which contains distillation and selective 1 INTRODUCTION
fusion parts. Specifically, the distillation module extracts hierarchi- Single image super-resolution (SISR) aims at reconstructing a high-
cal features step-by-step, and fusion module aggregates them ac- resolution (HR) image from its low-resolution (LR) observation,
cording to the importance of candidate features, which is evaluated which is inherently ill-posed because many HR images that can
by the proposed contrast-aware channel attention mechanism. To be downsampled to an identical LR image. To address this prob-
process real images with any sizes, we develop an adaptive cropping lem, numerous image SR methods [11, 12, 25, 27, 36, 38] based on
strategy (ACS) to super-resolve block-wise image patches using the deep neural architectures [7, 9, 23] have been proposed and shown
same well-trained model. Extensive experiments suggest that the prominent performance.
proposed method performs favorably against the state-of-the-art SR Dong et al. [4, 5] first developed a three-layer network (SRCNN)
algorithms in term of visual quality, memory footprint, and infer- to establish a direct relationship between LR and HR. Then, Wang et
ence time. Code is available at https://github.com/Zheng222/IMDN. al. [31] proposed a neural network according to the conventional
sparse coding framework and further designed a progressive up-
CCS CONCEPTS sampling style to produce better SR results at the large scale factor
• Computing methodologies → Computational photography; (e.g., ×4). Inspired by VGG model [23] that used for ImageNet clas-
Reconstruction; Image processing. sification, Kim et al. [12, 13] first pushed the depth of SR network
to 20 and their model outperformed SRCNN by a large margin. This
indicates a deeper model is instructive to enhance the quality of
∗ Corresponding author
generated images. To accelerate the training of deep network, the
authors introduced global residual learning with a high initial learn-
ing rate. At the same time, they also presented a deeply-recursive
convolutional network (DRCN), which applied recursive learning
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed to SR problem. This way can significantly reduce the model param-
for profit or commercial advantage and that copies bear this notice and the full citation eters. Similarly, Tai et al. proposed two novel networks, and one is
on the first page. Copyrights for components of this work owned by others than ACM a deep recursive residual network (DRRN) [24], another is a persis-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a tent memory network (MemNet) [25]. The former mainly utilized
fee. Request permissions from [email protected]. recursive learning to reach the goal of economizing parameters.
MM ’19, October 21–25, 2019, Nice, France The latter model tackled the long-term dependency problem existed
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6889-6/19/10. . . $15.00 in the previous CNN architecture by several memory blocks that
https://doi.org/10.1145/3343031.3351084 stacked with a densely connected structure [9]. However, these two
algorithms required a long time and huge graphics memory con- divided the preceding extracted features into two parts, one was
sumption both in the training and testing phases. The primary rea- retained and another was further processed. Through this way, IDN
son is the inputs sent to these two models are interpolation version achieved good performance at a moderate size. But there is still
of LR images and the networks have not adopted any downsampling room for improvement in term of performance.
operations. This scheme will bring about a huge computational cost. Another factor that affects the inference speed is the depth of
To increase testing speed and shorten the testing time, Shi et al. [22] the network. In the testing phase, the previous layer and the next
first performed most of the mappings in low-dimensional space layer have dependencies. Simply, conducting the computation of
and designed an efficient sub-pixel convolution to upsample the the current layer must wait for the previous calculation is com-
resolutions of feature maps at the end of SR models. pleted. But multiple convolutional operations at each layer can be
To the same end, Dong et al. proposed fast SRCNN (FSRCNN) [6], processed in parallel. Therefore, the depth of model architecture is
which employed a learnable upsampling layer (transposed con- an essential factor affecting time performance. This point will be
volution) to accomplish post-upsampling SR. Afterward, Lai et verified in Section 4.
al. presented the Laplacian pyramid super-resolution network (Lap- As to solving the different scale factors (×2, ×3, ×4) SR problem
SRN) [14] to progressively reconstruct higher-resolution images. using a single model, previous solutions pretreated an image to
Some other work such as MS-LapSRN [15] and progressive SR the desired size and using the fully convolutional network without
(ProSR) [29] also adopt this progressive upsampling SR framework any downsampling operations. This way will inevitably lead to a
and achieve relatively high performance. EDSR [18] made a sig- substantial increase in the amount of calculation.
nificant breakthrough in term of SR performance, which won the To address the above issues, we propose a lightweight informa-
competition of NTIRE 2017 [1, 26]. The authors removed some un- tion multi-distillation network (IMDN) for better balancing perfor-
necessary modules (e.g., Batch Normalization) of the SRResNet [16] mance against applicability. Unlike most previous small parameters
to obtain better results. Based on EDSR, Zhang et al. incorporated models that use recursive structure, we elaborately design an in-
densely connected block [9, 27] into residual block [7] to construct formation multi-distillation block (IMDB) inspired by [11]. The
a residual dense network (RDN). Soon they exploited the residual- proposed IMDB extracts features at a granular level, which retains
in-residual architecture for the very deep model and introduced partial information and further treats other features at each step
channel attention mechanism [8] to form the very deep residual (layer) as illustrated in Figure 2. For aggregating features distilled by
attention networks (RCAN) [36]. More recently, Zhang et al. also all steps, we devise a contrast-aware channel attention layer, specif-
introduced spatial attention (non-local module) into the residual ically related to the low-level vision tasks, to enhance collected
block and then constructed residual non-local attention network various refined information. Concretely, we exploit more useful
(RNAN) [37] for various image restoration tasks. features (edges, corners, textures, et al. ) for image restoration. In
The major trend of these algorithms is increasing more convo- order to handle SR of any arbitrary scale factor with a single model,
lution layers to improve performance that measured by PSNR and we need to scale the input image to the target size, and then employ
SSIM [30]. As a result, most of them suffered from large model the proposed adaptive cropping strategy (see in Figure 4) to obtain
parameters, huge memory footprints, and slow training and testing image patches of appropriate size for lightweight SR model with
speeds. For instance, EDSR [18] has about 43M parameters, 69 lay- downsampling layers.
ers, and RDN [38] achieved comparable performance, which has The contributions of this paper can be summarized as follows:
about 22M parameters, over 128 layers. Another typical network
is RCAN [36], its depth up to 400 but the parameters are about
15.59M. However, these methods are still not suitable for resource-
constrained equipment. For the mobile devices, the desired practice • We propose a lightweight information multi-distillation net-
should be to pursuing higher SR performance as much as possible work (IMDN) for fast and accurate image super-resolution.
when the available memory and inference time are constrained Thanks to our information multi-distillation block (IMDB)
in a certain range. Many cases require not only the performance with contrast-aware attention (CCA) layer, we achieve com-
but also high execution speed, such as video applications, edge petitive results with a modest number of parameters (refer
devices, and smartphones. Accordingly, it is significant to devise a to Figure 6).
lightweight but efficient model for meeting such demands. • We propose the adaptive cropping strategy (ACS), which
Concerning the reduction of the parameters, many approaches allows the network included downsampling operations (e.g.,
adopted the recursive manner or parameter sharing strategy, such convolution layer with a stride of 2) to process images of any
as [13, 24, 25]. Although these methods did reduce the size of the arbitrary size. By adopting this scheme, the computational
model, they increased the depth or the width of the network to cost, memory occupation, and inference time can dramati-
make up for the performance loss caused by the recursive module. cally reduce in the case of treating indefinite magnification
This will lead to spending a great lot of calculating time when SR.
performing SR processing. To address this issue, the better way • We explore factors affecting actual inference time through
is to design the lightweight and efficient network structures that experiments and find the depth of the network is related
avoid using recursive paradigm. Ahn et al. developed CARN-M [2] to the execution speed. It can be a guideline for guiding
for mobile scenario through a cascading network architecture, but a lightweight network design. And our model achieves an
it is at the cost of a substantial reduction on PSNR. Hui et al. [11] excellent balance among visual quality, inference speed, and
proposed an information distillation network (IDN) that explicitly memory occupation.
2 RELATED WORK the loss function of our IMDN can be expressed by
2.1 Single image super-resolution N
1 Õ
H I M D N IiLR − IiH R
,
L (Θ) = (2)
With the rapid development of deep learning, numerous meth- N i=1 1
ods based on convolutional neural network (CNN) have been the
mainstream in SISR. The pioneering work of SR is proposed by where Θ indicates the updateable parameters of our model and ∥·∥ 1
Dong et al. [4, 5] named SRCNN. The SRCNN upscaled the LR is l 1 norm. Then we give more details about the entire framework.
image with bicubic interpolation before feeding into the network, We first conduct LR feature extraction implemented by one 3 × 3
which would cause substantial unnecessary computational cost. To convolution with 64 output channels. Then, the key component of
address this issue, the authors removed this pre-processing and our network utilizes multiple stacked information multi-distillation
upscaled the image at the end of the net to reduce the computation blocks (IMDBs) and assembles all intermediate features to fusing
in [6]. Lim et al. [18] modified SRResNet [16] to construct a more by a 1 × 1 convolution layer. This scheme, intermediate informa-
in-depth and broader residual network denoted as EDSR. With tion collection (IIC), is beneficial to guarantee the integrity of the
the smart topology structure and a significantly large number of collected information and can further boost the SR performance by
learnable parameters, EDSR dramatically advanced the SR perfor- increasing very few parameters. The final upsampler only consists
mance. Zhang et al. [38] introduced channel attention [8] into the of one learnable layer and a non-parametric operation (sub-pixel
residual block to further boost very deep network (more than 400 convolution) for saving parameters as much as possible.
layers without considering the depth of channel attention modules).
Liu [19] explored the effectiveness of non-local module applied to 3.2 Information multi-distillation block
image restoration. Similarly, Zhang et al. [37] utilized non-local As depicted in Figure 2, our information multi-distillation block
attention to better guide feature extraction in their trunk branch for (IMDB) is constructed by progressive refinement module, contrast-
reaching better performance. Very recently, Li et al. [17] exploited aware channel attention (CCA) layer, and a 1 × 1 convolution that
feedback mechanism that enhancing low-level representation with is used to reduce the number of feature channels. The whole block
high-level ones. adopts residual connection. The main idea of this block is extracting
For lightweight networks, Hui et al. [11] developed the informa- useful features little by little like DenseNet [9]. Then we give more
tion distillation network for better exploiting hierarchical features details to these modules.
by separation processing of the current feature maps. And Ahn [2]
designed an architecture that implemented a cascading mechanism
Table 1: PRM architecture. The columns represent layer,
on a residual network to boost the performance.
kernel-size, stride, input channels, and output channels. The
symbols, C, and L denote a convolution layer, and Leaky
2.2 Attention model
ReLU (α = 0.05).
Attention model, aiming at concentrating on more useful informa-
tion in features, has been widely used in various computer vision Layer Kernel Stride Input_channel Output_channel
tasks. Hu et al. [8] introduced squeeze-and-excitation (SE) block that CL 3 1 64 64
models channel-wise relationships in a computationally efficient CL 3 1 48 64
manner and enhances the representational ability of the network, CL 3 1 48 64
showing its effectiveness on image classification. CBAM [32] modi- CL 3 1 48 16
fied the SE block to exploit both spatial and channel-wise attention.
Wang et al. [28] proposed the non-local module to generate the
wide attention map by calculating the correlation matrix between 3.2.1 Progressive refinement module. As labeled with the gray box
each spatial point in the feature map, then the attention map guided in Figure 2, the progressive refinement module (PRM) first adopts
dense contextual information aggregation. the 3 × 3 convolution layer to extract input features for multiple
subsequent distillation (refinement) steps. For each step, we employ
3 METHOD channel split operation on the preceding features, which will pro-
duce two-part features. One is preserved and the other portion is
3.1 Framework fed into the next calculation unit. The retained part can be regarded
In this section, we describe our proposed information multi-distillation as the refined features. Given the input features Fin , this procedure
network (IMDN) in detail, its graphical depiction is shown in Fig- in the n-th IMDB can be described as
ure 1(a). The upsampler (see Figure 1(b)) includes one 3 × 3 con- Frnef ined _1 , Fcoar
n n n n
se_1 = Split 1 CL 1 Fin ,
volution with 3 × s 2 output channels and a sub-pixel convolution.
Frnef ined _2 , Fcoar
n n n n
Given an input LR image ILR , its corresponding target HR image se_2 = Split 2 CL 2 Fcoar se_1 ,
IH R . The super-resolved image IS R can be generated by (3)
Frnef ined _3 , Fcoar
n n n n
se_3 = Split 3 CL 3 Fcoar se_2 ,
IS R = H I M D N ILR ,
(1)
Frnef ined _4 = CLn4 Fcoarn
se_3 ,
where H I M D N (·) is our IMDN. It is optimized with mean absolute
error (MAE) loss followed most of previous works [2, 11, 18, 36, 38]. where CLnj denotes the j-th convolution layer (including Leaky
N
Given a training set IiLR , IiH R i=1 that has N LR-HR pairs. Thus, ReLU) of the n-th IMDB, Split jn indicates the j-th channel split layer
16
Channel Split 64 Channel Split progressive
Conv‐3
48 refinement
Conv‐3 Conv‐3 Channel Split module
16 (PRM)
Information Multiple Distillations Network (IMDN) Concat Conv‐3
Information Multiple Distillations Network (IMDN)
Conv‐1
Channel Split
Concat
Information Multiple Distillations Network (IMDN)
Conv‐3
Conv‐1
Conv‐3 Channel Split
Sub‐pixel
Upsampler
Conv‐3
Sub‐pixel
Upsampler
Conv‐3
Conv‐1
Conv‐3
Conv‐3
Channel Split
IMDB
IMDB
IMDB
IMDB
Conv‐3
Conv‐1
Conv‐3
Conv‐3
IMDB
IMDB
IMDB
IMDB
SR
LR
SR
LR
Conv‐3
Concat
Conv‐3
Channel Split
64
Upsampler
64 3 s2 CCA Layer
annel Split Conv‐3
64 64 64 3 s2
64 64
Conv‐3
Conv‐1
Conv‐3
48
IMDB
IMDB
IMDB
IMDB
plit
Channel Split
Conv‐3 progressive (a) IMDN (b) Upsampler Conv‐1
W
LR
progressive 16
Conv‐3 refinement Conv‐3
lW network (IMDN). (a) The orange box represents Leaky ReLU acti-
refinement
nv‐3 Channel Split module 2
Figure 1: The architecture of information multi-distillation
Conv‐3 (PRM)Channel Split
module
Channel Split (PRM)
vation
Conv‐3 function and the details of IMDB is shown in Figure 2. (b) s represents the upscale factor.
Conv‐3
Conv‐3
Channel Split
Channel Split 64 64 64
Concat
Conv‐3
H
Channel Split Conv‐3 2
progressive
sigmod
Contrast
Conv‐1
Conv‐1
Conv‐1 Conv‐3
sigmod
Contrast
Conv‐1
Conv‐1
sigmod
refinement Contrast
Conv‐1
Conv‐1
Channel Split Channel Split
Conv‐3 Channel Split module
Conv‐3 Conv‐3 lH
(PRM)
Concat Conv‐3
Concat 4 64 4 64
CCA Layer 64
CCA Layer 64
4 64
Channel Split
Concat Conv‐1
48 48 Figure 3: Contrast-aware channel attention module.
Conv‐1
16 16 Conv‐3
Conv‐1
sigmod
Contrast
Conv‐1
Conv‐1
Channel Split
the global information in these high-level or mid-level vision. Al-
though
s2 the average pooling can indeed improve the PSNR value, it
s2
Conv‐3
lacks the information about structures, textures, and edges that are
Upsampler
sigmod
Conv‐1
Conv‐1
sigmod
Contrast
Conv‐1
Conv‐1
Upsampler
Concat
Upsampler
2 lH lH 4 64
Conv‐3
Conv‐3
Conv‐1
Conv‐3
2
IMDB
IMDB
IMDB
IMDB
Conv‐3
Conv‐3
Conv‐3
Conv‐3
in Figure 3, the contrast-aware channel attention module is special IMDB
IMDB
IMDB
IMDB
IMDB
IMDB
IMDB
IMDB
64
SR
LR
CCA Layer
SR
LR
SR
to low-level vision, e.g., image super-resolution, and enhancement. LR
48
4 64 4 64 Conv‐1 Specifically, we replace global average pooling with the summation
W 16 W
2 lW 2 lW of standard deviation and mean (evaluating the contrast degree of
64 map). Let’s 64 denote X =64 [x 1 , .64 . 64
. . , xc , 64. . , xC ] as the input,
64 64 a feature
which has C feature maps with spatial size of H × W . Therefore,
Figure 2: The architecture of our proposed information
the contrast information value can be calculated by
multi-distillation block (IMDB). Here, 64, 48, and 16 all repre-
Upsampler
zc = HGC (xc )
Upsampler
Conv‐1
Conv‐1
Conv‐1
Conv‐1
sent the output channels of the convolution layer. “Conv-3”
Conv‐1
Conv‐1
Conv‐1
Conv‐1
Mean
Mean
Mean
Mean
Conv‐1
Conv‐3
Conv‐1
Conv‐3
sigmod
Contrast
Conv‐1
Conv‐1
IMDB
IMDB
IMDB
IMDB
IMDB
u
2
Õ © 2 1lH Õ
SR
u
u
t
cates the proposed contrast-aware channel attention (CCA) 1 i, j i, j ª
Conv‐3
= xc − xc ® +
that is depicted in Figure 3. Each convolution followed by HW HW (5)
LR
(i, j)∈x c « (i, j)∈x c
Contrast
Contrast
sigmod
sigmod
Contrast
Contrast
Conv‐1
Conv‐1
Conv‐1
Conv‐1
sigmod
sigmod
Conv‐1
Conv‐1
Conv‐1
Conv‐1
64 a Leaky 64
64 ReLU64activation function except for the last 1 × 1 ¬
4 64 We omit them for concise.
convolution. 1 Õ
i, j
x ,
W
HW lW c
4 64 2 (i, j)∈x c 4 64
n
4 64 4 64 64
of the n-th IMDB, Fr e f ined_j represents the j-th refined features where zc is the c-th element of output. HGC (·) indicates the global
n
(preserved), and Fcoar se_j is the j-th coarse features to be further
contrast (GC) information evaluation function. With the assistance
processed. The hyperparameter of PRM architecture is shown in of the CCA module, our network can steadily improve the accuracy
Table 1. The following stage is concatenating refined features from of SISR.
Upsampler
Conv‐1
Conv‐1
IMDB
IMDB
n
Fdist ill ed =
SR
sigmod
Conv‐1
Conv‐1
sigmod
Contrast
Conv‐1
Conv‐1
sigmod
Contrast
Conv‐1
Conv‐1
16
Channel Split (PRM) Channel Split
Concat Conv‐3
Conv‐3 Conv‐3
Conv‐1
Channel Split
Concat Concat Concat
4 64 Conv‐3
4 64
CCA Layer 64 Conv‐1 CCA Layer 64
sigmod
Contrast
Conv‐1
Conv‐1
48 Channel Split 48
Conv‐1 Conv‐1
W 16 with existing works [2, 11, 12, 18, 24, 36, 38], we calculate the values
16
2 lW Conv‐3
on the luminance channel (i.e., Y channel of the YCbCr channels
Concat
converted from
4 64 the RGB channels).
CCA Layer 64
Additionally, for any/unknown scale factor experiments, we use
H 48
2
Conv‐1 RealSR dataset from NTIRE2019 Real Super-Resolution Challenge1 .
sigmod
Contrast
Conv‐1
Conv‐1
W 16
sigmod
H
Contrast
Conv‐1
Conv‐1
Conv‐3 Upsampler
lH It is a novel dataset of real low and high
2
H
2
Conv‐3
lH
IMDB
IMDB
IMDB
Conv‐3
Conv‐3
IMDB
IMDB
IMDB
IMDB
The training data consists of 60 real low, and high resolution paired
LR
SR
LR
4 64 images, and the validation data contains 20 LR-HR pairs. It is note-
4 64
H
2 W
lW
worthy that the LR andW HR
have
l the same size. W
2
sigmod
Contrast
Conv‐1
Conv‐1
2 H
lH 64
64 2
64
Conv‐3
lH
IMDB
IMDB
IMDB
IMDB
(a) The first image patch (b) The last image patch 4.2 Implementation details
LR
s2 s2 To obtain LR DIV2K training images, we downscale HR images
Figure 4: The diagrammatic 4sketch
64 of adaptive cropping
W
Upsampler
2 lWfactors (×2, ×3, and ×4) using bicubic interpolation
with the scaling
Conv‐1
Conv‐1
Conv‐1
Conv‐1
strategy (ACS). The cropped image patches in the green dot-
IMDB Upsampler
Mean
Mean
Conv‐3
Conv‐3
Conv‐1
Conv‐3
Conv‐1
Conv‐1
Conv‐1
Conv‐1
IMDB
IMDB
IMDB
Mean
in MATLAB R2017a. The HR image patches with
Conv‐1
Conv‐3
ted boxes.
IMDB
IMDB
IMDB
SR
LR
SR
Contrast
Contrast
sigmod
Conv‐1
Conv‐1
Conv‐1
Conv‐1
64 64 64 64 perform randomly horizontal flip and 90 degree rotation. Our model
Upsampler
Contrast
Contrast
sigmod
sigmod
Conv‐1
Conv‐1
Conv‐1
Conv‐1
Conv‐1
Conv‐1
Conv‐1
Conv‐1
64 64
Mean
Mean
Conv‐3
Conv‐3
Conv‐1
Conv‐3
IMDB
IMDB
SR
LR
Contrast
Contrast
IMDN and IMDN_AS. We apply PyTorch framework to implement
sigmod
sigmod
Conv‐1
Conv‐1
Conv‐1
Conv‐1
64 64 64 64
the proposed network on the desktop computer with 4.2GHz Intel i7-
Figure 5: The network structure of our IMDN_AS. “s2” rep-
7700K CPU, 64G RAM, and NVIDIA TITAN Xp GPU (12G memory).
resents the stride of 2. 4 64 4 64
4.3 Model analysis
details about ACS. This image patch must satisfy In this subsection, we investigate model parameters, the effective-
H
ness of IMDB, the intermediate information collection scheme, and
+ ∆l H %4 = 0,
2 adaptive cropping strategy.
(6)
W
+ ∆lW %4 = 0,
2
32.4
where ∆l H , ∆lW are extra increments of height and width, respec- 32.2
IMDN
CARN
tively. They can be computed by 32
IDN EDSR‐baseline
H
31.8 DRRN
∆l H = paddinдH − + paddinдH %4, MemNet
2 31.6
LapSRN
(7)
PSNR (dB)
W 31.4
VDSR
∆lW = paddinдW − + paddinдW %4, 31.2
2
31
where paddinдH , paddinдW are preset additional lengths. In gen- 30.8
FSRCNN
eral, their values are setting by 30.6
SRCNN
paddinдH = paddinдW = 4k, k ≥ 1. (8) 30.4
30.2
Here, k is an integer greater than or equal to 1. These four patches 0 0.5 1 1.5 2
Number of parameters (K) 𝟏𝟎𝟑
can be processed in parallel (they have the same sizes), after which
the outputs are pasted to their original location, and the extra Figure 6: Trade-off between performance and number of pa-
increments (∆l H and ∆lW ) are discarded. rameters on Set5 ×4 dataset.
4 EXPERIMENTS
4.3.1 Model parameters. To construct a lightweight SR model, the
4.1 Datasets and metrics parameters of the network is vital. From Table 5, we can observe
In our experiments, we use the DIV2K dataset [1], which contains that our IMDN with fewer parameters achieves comparative or
800 high-quality RGB training images and widely used in image better performance when comparing with other state-of-the-art
restoration tasks [18, 36–38]. For evaluation, we use five widely methods, such as EDSR-baseline (CVPRW’17), IDN (CVPR’18), SR-
used benchmark datasets: Set5 [3], Set14 [33], BSD100 [20], Ur- MDNF (CVPR’18), and CARN (ECCV’18). We also visualize the
ban100 [10], and Manga109 [21]. We evaluate the performance of trade-off analysis between performance and model size in Figure 6.
the super-resolved images using two metrics, including peak signal- We can see that our IMDN achieves a better trade-off between the
to-noise ratio (PSNR) and structure similarity index (SSIM) [30]. As performance and model size.
Sub‐pixel
Upsampler
Conv‐3
Conv‐1
Conv‐3
IMDB
IMDB
SR
3 s2 Table 2: Investigations of CCA module and IIC scheme.
64 64
499K 32.11 / 0.8934 28.52 / 0.7797 27.53 / 0.7342 25.90 / 0.7797 30.28 / 0.9054
Table 3: Comparison with original channel attention (CA) can easily observe that the presented IMDN_AS achieves better
and the presented contrast-aware channel attention (CCA). performance in term of image quality, execution speed, and foot-
print. Accordingly, it also suggests the proposed ACS is powerful
Module Set5 Set14 BSD100 Urban100 to address SR problem of any scales.
IMDN_basic_B4 + CA 32.0821 28.5086 27.5124 25.8829
IMDN_basic_B4 + CCA 32.0964 28.5118 27.5185 25.8916 4.4 Comparison with state-of-the-arts
We compare our IMDN with 11 state-of-the-art methods: SRCNN [4,
H 5], FSRCNN [6], VDSR [12], DRCN [13], LapSRN [14], DRRN [24],
Upsampler
lH
2
Conv‐3
Conv‐3
IMDB
IMDB
IMDB
SR
LR
CARN [2]. Table 5 shows quantitative comparisons for ×2, ×3, and
×4 SR. It can find out that our IMDN performs favorably against
other compared approaches on most datasets, especially at the
64 64 scaling factor of ×2.
Figure 8 shows ×2, ×3 and ×4 visual comparisons on Set5 and
Figure 7: The structure of IMDN_basic_B4.
Urban100 datasets. For “img_67” image from Urban100, we can see
that grid structure is recovered better than others. It also demon-
4.3.2 Ablation studies of CCA module and IIC scheme. To quickly
Conv‐1
Conv‐1
Conv‐1
Conv‐1
Mean
Mean
Contrast
sigmod
Conv‐1
Conv‐1
Conv‐1
Conv‐1
ment, named IMDN_B4. When removing the CCA module and IIC
scheme, the IMDN_B4 becomes IMDN_basic_B4 as illustrated in of convolutions, the total number of parameters can be computed
Figure 7. From Table 2, we can find out that the CCA module leads to as
4 64 performance improvement4(PSNR: 64 +0.09dB, SSIM: +0.0012 for ×4 Õ L
Manga109) only by increasing 2K parameters (which is an increase Params = nl −1 · nl · fl2 + nl , (9)
of 0.4%). The results compared with the CA module are placed in l =1 | {z } |{z}
conv bias
Table 3. To study the efficiency of PRM in IMDB, we replace it with
three cascaded 3 × 3 convolution layers (64 channels) and remove where l is the layer index, L denotes the total number of layers, and f
the final 1 × 1 convolution (used for fusion). The compared results represents the spatial size of the filters. The number of convolutional
are given in Table 2. Although this network has more parameters kernels belong to l-th layer is nl , and its input channels are nl −1 .
(510K), its performance is much lower than our IMDN_basic_B4 Suppose that the spatial size of output feature maps is ml × ml , the
(480K) especially on Urban100 and Manga109 datasets. time complexity can be roughly calculated by
L
!
Õ
2 2
O nl −1 · nl · fl · ml . (10)
Table 4: Quantitative evaluation of VDSR and our IMDN_AS
l =1
in PSNR, SSIM, LPIPS, running time, and memory occupa-
tion. We assume that the size of the HR image is m × m and then the
computational costs can be calculated by Equation 10 (see Table 7).
Method PSNR SSIM LPIPS [35] Time Memory
4.5.2 Running Time. We use official codes of the compared meth-
VDSR [12] 28.75 0.8439 0.2417 0.0290 7,855M
IMDN_AS 29.35 0.8595 0.2147 0.0041 3,597M
ods to test their running time in a feed-forward process. From
Table 6, we can be informed of actual execution time is related
to the depth of networks. Although EDSR has a large number of
4.3.3 Investigation of ACS. To verify the efficiency of the proposed parameters (43M), it runs very fast. The only drawback is that it
adaptive cropping strategy (ACS), we use RealSR training images takes up more graphics memory. The main reason should be the
to train VDSR [12] and our IMDN_AS. The results, evaluated on convolution computation for each layer are parallel. And RCAN has
RealSR RGB validation dataset, are illustrated in Table 4 and we only 16M parameters, its depth is up to 415 and results in very slow
1 http://www.vision.ee.ethz.ch/ntire19/ inference speed. Compared with CARN [2] and EDSR-baseline [18],
Table 5: Average PSNR/SSIM for scale factor ×2, ×3 and ×4 on datasets Set5, Set14, BSD100, Urban100, and Manga109. Best and
second best results are highlighted and underlined.
Urban100 (2×): MemNet [25] IDN [11] EDSR-baseline [18] CARN [2] IMDN (Ours)
img_67 24.98/0.9613 24.68/0.9594 26.01/0.9695 25.96/0.9692 27.75/0.9773
Urban100 (3×): MemNet [25] IDN [11] EDSR-baseline [18] CARN [2] IMDN (Ours)
img_76 24.97/0.8359 24.95/0.8332 25.85/0.8565 25.92/0.8583 26.19/0.8610
Figure 8: Visual comparisons of IMDN with other SR methods on Set5 and Urban100 datasets.
Table 7: The computational costs. For representing concisely,
we omit m2 . Least and second least computational costs are For more intuitive comparisons with other approaches, we pro-
highlighted and underlined. vide the trade-off between the running time and performance on
Set5 dataset for ×4 SR in the Figure 9. It shows our IMDN gains
Scale LapSRN [14] IDN [11] EDSR-b [18] CARN [2] IMDN comparable execution time and best PSNR value.
×2 112K 175K 341K 157K 173K
×3 76K 75K 172K 90K 78K 5 CONCLUSION
×4 76K 51K 122K 76K 45K In this paper, we propose an information multi-distillation network
for lightweight and accurate single image super-resolution. We
32.3 construct a progressive refinement module to extract hierarchical
IMDN
32.2 feature step-by-step. By cooperating with the proposed contrast-
32.1
CARN aware channel attention module, the SR performance is significantly
EDSR‐baseline and steadily improved. Additionally, we present the adaptive crop-
32
31.8 IDN which is critical for the application of SR algorithms in the ac-
31.7
tual scenes. Numerous experiments have shown that the proposed
DRRN_B1U9
31.6
method achieves a commendable balance between factors affecting
31.5
LapSRN practical use, including visual quality, execution speed, and mem-
DRCN
31.4
ory consumption. In the future, this approach will be explored to
VDSR
facilitate other image restoration tasks such as image denoising
31.3
1 0.1 0.01 0.001 and enhancement.
Execution time (sec)