0% found this document useful (0 votes)
67 views16 pages

Review DL2019

This document provides an overview of recent deep learning approaches for single image super-resolution (SISR). It reviews representative methods and groups them into two categories: exploring efficient neural network architectures for SISR, and developing effective optimization objectives. For each category, benchmark methods are established and limitations discussed. Representative works overcoming these limitations are then presented and compared based on their contributions and experimental results. The review concludes by discussing current challenges and future trends in leveraging deep learning for SISR.

Uploaded by

Thuan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views16 pages

Review DL2019

This document provides an overview of recent deep learning approaches for single image super-resolution (SISR). It reviews representative methods and groups them into two categories: exploring efficient neural network architectures for SISR, and developing effective optimization objectives. For each category, benchmark methods are established and limitations discussed. Representative works overcoming these limitations are then presented and compared based on their contributions and experimental results. The review concludes by discussing current challenges and future trends in leveraging deep learning for SISR.

Uploaded by

Thuan Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

Deep Learning for Single Image Super-Resolution:


A Brief Review
Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, Qingmin Liao

Abstract—Single image super-resolution (SISR) is a notori- of the mapping that we aim to develop between the LR
ously challenging ill-posed problem that aims to obtain a high- space and the HR space, and the other is the inefficiency
resolution (HR) output from one of its low-resolution (LR) of establishing a complex high-dimensional mapping given
versions. Recently, powerful deep learning algorithms have been
applied to SISR and have achieved state-of-the-art performance. massive raw data. Benefiting from the strong capacity of
arXiv:1808.03344v3 [cs.CV] 12 Jul 2019

In this survey, we review representative deep learning-based extracting effective high-level abstractions that bridge the LR
SISR methods and group them into two categories according and HR space, recent DL-based SISR methods have achieved
to their contributions to two essential aspects of SISR: the significant improvements, both quantitatively and qualitatively.
exploration of efficient neural network architectures for SISR In this survey, we attempt to give an overall review of recent
and the development of effective optimization objectives for deep
SISR learning. For each category, a baseline is first established, DL-based SISR algorithms. We mainly focus on two areas:
and several critical limitations of the baseline are summarized. efficient neural network architectures designed for SISR and
Then, representative works on overcoming these limitations are effective optimization objectives for DL-based SISR learning.
presented based on their original content, as well as our critical The reason for this taxonomy is that when we apply DL
exposition and analyses, and relevant comparisons are conducted algorithms to tackle a specified task, it is best for us to
from a variety of perspectives. Finally, we conclude this review
with some current challenges and future trends in SISR that consider both the universal DL strategies and the specific
leverage deep learning algorithms. domain knowledge. From the perspective of DL, although
many other techniques such as data preprocessing [6] and
Index Terms—Single image super-resolution, deep learning,
neural networks, objective function model training techniques are also quite important [7], [8], the
combination of DL and domain knowledge in SISR is usually
the key to success and is often reflected in the innovations
I. I NTRODUCTION
of neural network architectures and optimization objectives

D EEP learning (DL) [1] is a branch of machine learn-


ing algorithms that aims at learning the hierarchical
representations of data. Deep learning has shown prominent
for SISR. In each of these two focused areas, based on the
benchmark, several representative works are discussed mainly
from the perspective of their contributions and experimental
superiority over other machine learning algorithms in many results as well as our comments and views.
artificial intelligence domains, such as computer vision [2], The rest of the paper is arranged as follows. In Section II,
speech recognition [3], and natural language processing [4]. we present relevant background concepts of SISR and DL.
Generally, the strong capacity of DL to address substantial In Section III, we survey the literature on exploring efficient
unstructured data is attributable to two main contributors: neural network architectures for various SISR tasks. In Sec-
the development of efficient computing hardware and the tion IV, we survey the studies on proposing effective objective
advancement of sophisticated algorithms. functions for different purposes. In Section V, we summarize
Single image super-resolution (SISR) is a notoriously chal- some trends and challenges for DL-based SISR. We conclude
lenging ill-posed problem because a specific low-resolution this survey in Section VI.
(LR) input can correspond to a crop of possible high-resolution
(HR) images, and the HR space (in most instances, it refers II. BACKGROUND
to the natural image space) that we intend to map the LR A. Single Image Super-Resolution
input to is usually intractable [5]. Previous methods for SISR
Super-resolution (SR) [9] refers to the task of restoring high-
mainly have two drawbacks: one is the unclear definition
resolution images from one or more low-resolution observa-
This work was partly supported by the National Natural Science Foun- tions of the same scene. According to the number of input
dation of China (Nos.61471216 and 61771276), the National Key Research LR images, the SR can be classified into single image super-
and Development Program of China (No.2016YFB0101001) and the Spe-
cial Foundation for the Development of Strategic Emerging Industries of resolution (SISR) and multi-image super-resolution (MISR).
Shenzhen (Nos. JCYJ20170307153940960 and JCYJ20170817161845824). Compared with MISR, SISR is much more popular because
(Corresponding author: Wenming Yang) of its high efficiency. Since an HR image with high perceptual
W. Yang, X. Zhang, W. Wang and Q. Liao are with the Department of
Electronic Engineering, Graduate School at Shenzhen, Tsinghua University, quality has more valuable details, it is widely used in many
China (E-mail: {yang.wenming@sz, xc-zhang16@mails, wangwei17@mails, areas, such as medical imaging, satellite imaging and security
liaoqm@}.tsinghua.edu.cn. imaging. In the typical SISR framework, as depicted in Fig. 1,
Y. Tian is with the University of Rochester, USA (E-mail:
[email protected]). the LR image y is modeled as follows:
J.-H. Xue is with the Department of Statistical Science, University College
London, UK (E-mail: [email protected]). y = (x ⊗ k)↓s + n, (1)
2

Figure 1: Sketch of the overall framework of SISR.


Figure 2: Sketch of the SRCNN architecture.
where x⊗k is the convolution between the blurry kernel k and
the unknown HR image x, ↓s is the downsampling operator them to achieve the final purpose, where the whole learning
with scale factor s, and n is the independent noise term. process can be seen as an entirety [31].
Solving (1) is an extremely ill-posed problem because one LR Because of the high approximating capacity and hierarchical
input may correspond to many possible HR solutions. To date, property of an artificial neural network (ANN), most mod-
mainstream algorithms of SISR are mainly divided into three ern deep learning models are based on ANNs [32]. Early
categories: interpolation-based methods, reconstruction-based ANNs can be traced back to perceptron algorithms in the
methods and learning-based methods. 1960s [33]. Then, in the 1980s, the multilayer perceptron
Interpolation-based SISR methods, such as bicubic inter- could be trained with the backpropagation algorithm [34], and
polation [10] and Lanczos resampling [11], are very speedy the convolutional neural network (CNN) [35] and recurrent
and straightforward but suffer from accuracy shortcomings. neural network (RNN) [36], two representative derivatives
Reconstruction-based SR methods [12], [13], [14], [15] often of the traditional ANN, were introduced to the computer
adopt sophisticated prior knowledge to restrict the possible so- vision and speech recognition fields, respectively. Despite
lution space with an advantage of generating flexible and sharp remarkable progress achieved by ANNs during that period,
details. However, the performance of many reconstruction- there were still many deficiencies handicapping ANNs from
based methods degrades rapidly when the scale factor in- developing further [37], [38]. Thereafter, the rebirth of the
creases, and these methods are usually time-consuming. modern ANN was marked by pretraining the deep neural
Learning-based SISR methods, also known as example- network (DNN) with the restricted Boltzmann machine (RBM)
based methods, are brought into focus because of their fast proposed by Hinton in 2006 [39]. Consequently, benefiting
computation and outstanding performance. These methods from the boom of computing power and the development of
usually utilize machine learning algorithms to analyze statis- advanced algorithms, models based on the DNN have achieved
tical relationships between the LR and its corresponding HR remarkable performance in various supervised tasks [40],
counterpart from substantial training examples. The Markov [41], [2]. Meanwhile, DNN-based unsupervised algorithms
random field (MRF) [16] approach was first adopted by such as the deep Boltzmann machine (DBM) [42], variational
Freeman et al. to exploit the abundant real-world images to autoencoder (VAE) [43], [44] and generative adversarial nets
synthesize visually pleasing image textures. Neighbor embed- (GAN) [45] have attracted much attention owing to their
ding methods [17] proposed by Chang et al. took advantage potential to address challenging unlabeled data. Readers can
of similar local geometry between LR and HR to restore refer to [46] for an extensive analysis of DL.
HR image patches. Inspired by the sparse signal recovery
theory [18], researchers applied sparse coding methods [19], III. D EEP A RCHITECTURES FOR SISR
[20], [21], [22], [23], [24] to SISR problems. Lately, ran-
dom forest [25] has also been used to achieve improvement In this section, we mainly discuss the efficient architectures
in the reconstruction performance. Meanwhile, many works proposed for SISR in recent years. First, we set the network
combined the merits of reconstruction-based methods with architecture of super-resolution CNN (SRCNN) [47], [48] as
the learning-based approaches to further reduce artifacts in- the benchmark. When we discuss each related architecture in
troduced by external training examples [26], [27], [28], [29]. detail, we focus on their universal parts that can apply to other
Very recently, DL-based SISR algorithms have demonstrated tasks and their specific parts that characterize SISR properties.
great superiority to reconstruction-based and other learning- To meaningfully construct fair comparisons among different
based methods. models, we will illustrate the importance of the training dataset
and attempt to compare models with the same training dataset.

B. Deep Learning A. Benchmark of Deep Architecture for SISR


Deep learning is a branch of machine learning algorithms We select the SRCNN architecture as the benchmark in
based on directly learning diverse representations of data [30]. this section. The overall architecture of SRCNN is shown in
In contrast to traditional task-specific learning algorithms Fig. 2. As established in many traditional methods, for sim-
that select useful handcrafted features with expert domain plicity, SRCNN only implements the luminance components
knowledge, deep learning algorithms aim to learn informative for training. SRCNN is a three-layer CNN, where the filter
hierarchical representations automatically and then leverage sizes of each layer are 64 × 1 × 9 × 9, 32 × 64 × 5 × 5
3

Figure 3: Sketch of the deconvolution layer used in


FSRCNN [48], where ~ denotes the convolution operator.
Figure 4: Detailed sketch of ESPCN [49]. The top process
with the yellow arrow depicts the ESPCN from the view of
and 1 × 32 × 5 × 5. The functions of these three nonlinear
zero interpolation, while the bottom process with the black
transformations are patch extraction, nonlinear mapping and
arrow is the original ESPCN; ~ denotes the convolution
reconstruction. The loss function for optimizing SRCNN is
operator.
the mean square error (MSE), which will be discussed in the
next section.
The formulation of SRCNN is relatively simple and can be
envisioned as an ordinary CNN that approximates the complex pooling and stride convolution are the common downsampling
mapping between the LR and HR spaces in an end-to-end operators in the basic CNN architecture. Naturally, people
manner. SRCNN reportedly demonstrated vast superiority over can implement the upsampling operation, which is known
concurrent traditional methods, and we argue that its acclaim as deconvolution [50] or transposed convolution [51]. Given
is owing to the CNN’s strong capability of learning valid the upsampling factor, the deconvolution layer is composed
representations from big data in an end-to-end manner. of an arbitrary interpolation operator (usually, we choose the
Despite the success of SRCNN, the following problems have nearest neighbor interpolation for simplicity) and a following
inspired more effective architectures: convolution operator with a stride of 1, as shown in Fig. 3.
1) The input of SRCNN is the bicubic LR, an approxi- Readers should be aware that such deconvolution may not
mation of HR. However, these interpolated inputs have three completely recover the information missing from convolution
drawbacks: (a) detail-smoothing effects introduced by these with pooling or stride convolution. Such a deconvolution
inputs may lead to further wrong estimations of the image layer has been successfully adopted in the context of network
structure; (b) employing interpolated versions as input is very visualization [52], semantic segmentation [53] and generative
time-consuming; and (c) when the downsampling kernel is modeling [54]. For a more detailed illustration of the de-
unknown, one specific interpolated input as a raw estimation convolution layer, readers can refer to [55]. To the best of
is unreasonable. Therefore, the first question emerges: can we our knowledge, FSRCNN [56] is the first work using this
design CNN architectures that directly implement LR as input normal deconvolution layer to reconstruct HR images from
to address these problems?1 LR feature maps. As mentioned previously, the usage of the
2) The SRCNN is just a three-layer architecture. Can more deconvolution layer has two main advantages: one is that a
complex CNN architectures (with different depths, widths and reduction in computation is achieved because we just need
topologies) achieve better results? If yes, then how can we to increase resolution at the end of the network; the other is
design such models of greater complexity? that when the downsampling kernel is unknown, many reports,
3) The prior terms in the loss function that reflect properties e.g., [57], have shown that when an inaccurate estimation is
of HR images are trivial. Can we integrate any property input, there are side effects on the final performance.
of the SISR process into the design of the CNN frame or Although the normal deconvolution layer, which has al-
other parts in the algorithms for SISR? If yes, then can these ready been involved in popular open source packages such
deep architectures with SISR properties be more effective in as Caffe [58] and TensorFlow [59], offers a reasonably good
addressing some challenging SISR problems, such as the large solution to the first question, there is still an underlying
scale factor SISR and the unknown downsampling of SISR? problem: when we use the nearest neighbor interpolation, the
Based on some solutions to these three questions, recent points in the upsampled features are repeated several times in
studies on deep architectures for SISR will be discussed in each direction. This configuration of the upsampled pixels is
Sections III-B1, III-B2 and III-B3. redundant. To circumvent this problem, Shi et al. proposed an
efficient subpixel convolution layer in [49], known as ESPCN;
B. State-of-the-Art Deep SISR Networks the structure of ESPCN is shown in Fig. 4. Rather than
increasing resolution by explicitly enlarging feature maps as
1) Learning Effective Upsampling with CNN: One solution the deconvolution layer does, ESPCN expands the channels
to the first question is to design a module in the CNN architec- of the output features for storing the extra points to increase
ture that adaptively increases the resolution. Convolution with resolution and then rearranges these points to obtain the HR
1 Generally, the first problem can be grouped into the third problem below.
output through a specific mapping criterion. As the expansion
Because the solutions to this problem form the basis of many other models, is carried out in the channel dimension, a smaller kernel size
it is necessary to introduce this problem separately as the first drawback. is sufficient. [55] further shows that when the ordinary but
4

redundant nearest neighbor interpolation is replaced with the weight different parts distinguishingly in an adaptive way.
interpolation that pads the subpixels with zeroes, the decon- It is hard to go deeper with a plain architecture such as
volution layer can be simplified into the subpixel convolution VGG-net. Various deep models based on skip-connections
in ESPCN. Obviously, compared with the nearest neighbor can be extremely deep and have achieved state-of-the-art
interpolation, this interpolation is more efficient, which can performance in many tasks. Among them, ResNet [64], [65],
also verify the effectiveness of ESPCN. proposed by He et al., is the most representative model.
2) The Deeper, The Better: In the DL research, there is Readers can refer to [66], [67] for further discussions on why
theoretical work [60] showing that the solution space of a ResNet works well. In [68], the authors proposed SRResNet,
DNN can be expanded by increasing its depth or its width. which is composed of 16 residual units (a residual unit consists
In some situations, to attain more hierarchical representations of two nonlinear convolutions with residual learning). In each
more effectively, many works mainly focus on improvements unit, batch normalization (BN) [69] is used to stabilize the
acquired by increasing the depth. Recently, various DL- training process. The overall architecture of SRResNet is
based applications have also demonstrated the great power of shown in Fig. 5(c). Based on the original residual unit in [65],
very deep neural networks despite many training difficulties. Tai et al. proposed DRRN [70], in which basic residual units
VDSR [61] is the first very deep model used in SISR. As are rearranged in a recursive topology to form a recursive
shown in Fig. 5(a), VDSR is a 20-layer VGG-net [62]. The block, as shown in Fig. 5(d). Then, to accommodate parameter
VGG architecture sets all kernel sizes as 3 × 3 (the kernel size reduction, each block shares the same parameters and is reused
is usually odd and takes the increase in the receptive field into recursively, such as in the single recursive convolution kernel
account, and 3 × 3 is the smallest kernel size). To train this in DRCN.
deep model, the authors used a relatively high initial learning EDSR [71] was proposed by Lee et al. and has currently
rate to accelerate convergence and used gradient clipping to achieved state-of-the-art performance. EDSR has mainly made
prevent the annoying gradient explosion problem. three improvements on the overall frame: 1) Compared with
In addition to the innovative architecture, VDSR has made the residual unit used in previous work, EDSR removes the us-
two more contributions. The first one is that a single model age of BN, as shown in Fig. 5(e). The original ResNet with BN
is used for multiple scales since the SISR processes with was designed for classification, where inner representations are
different scale factors have a strong relationship with each highly abstract, and these representations can be insensitive to
other. This fact is the basis of many traditional SISR methods. the shift introduced by BN. Regarding image-to-image tasks
Similar to SRCNN, VDSR takes the bicubic of LR as input. such as SISR, since the input and output are strongly related,
During training, VDSR puts the bicubics of LR of different if the convergence of the network is not a problem, then
scale factors together for training. For larger scale factors such a shift may harm the final performance. 2) Except for
(×3, ×4), the mapping for a smaller scale factor (×2) may regular depth increasing, EDSR also increases the number of
also be informative. The second contribution is the residual output features of each layer on a large scale. To relinquish
learning. Unlike the direct mapping from the bicubic version the difficulties of training such a wide ResNet, the residual
to HR, VDSR uses deep CNN to learn the mapping from the scaling trick proposed in [72] is employed. 3) Additionally,
bicubic to the residual between the bicubic and HR. The au- inspired by the fact that the SISR processes with different
thors argued that residual learning could improve performance scale factors have strong relationships with each other, when
and accelerate convergence. training the models for ×3 and ×4 scales, the authors of [71]
The convolution kernels in the nonlinear mapping part of initialized the parameters with the pretrained ×2 network. This
VDSR are very similar, and in order to reduce parameters, pretraining strategy accelerates the training and improves the
Kim et al. further proposed DRCN [63], which utilizes the final performance.
same convolution kernel in the nonlinear mapping part 16 The effectiveness of the pretraining strategy in EDSR im-
times, as shown in Fig. 5(b). To overcome the difficulties of plies that models for different scales may share many inter-
training a deep recursive CNN, a multisupervised strategy is mediate representations. To explore this idea further, similar
applied, and the final result can be regarded as the fusion of to building a multiscale architecture as VDSR does on the
16 intermediate results. The coefficients for fusion are a list condition of bicubic input, the authors of EDSR proposed
of trainable positive scalars with the summation of 1. As they MDSR to achieve the multiscale architecture, as shown in
showed, DRCN and VDSR have a quite similar performance. Fig. 5(g). In MDSR, the convolution kernels for nonlinear
Here, we believe that it is necessary to emphasize the impor- mapping are shared across different scales, where only the
tance of the multisupervised training in DRCN. This strategy front convolution kernels for extracting features and the final
not only creates short paths through which the gradients can subpixel upsampling convolution are different. At each update
flow more smoothly during backpropagation but also guides all during training MDSR, minibatches for ×2, ×3 and ×4 are
the intermediate representations to reconstruct raw HR outputs. randomly chosen, and only the corresponding parts of MDSR
Finally, fusing all these raw HR outputs produces a wonderful are updated.
result. However, for fusion, this strategy has two flaws: 1) In addition to ResNet, DenseNet [73] is another effective
once the weight scalars are determined in the training process, architecture based on skip connections. In DenseNet, each
they will not change with different inputs; and 2) using a layer is connected with all the preceding representations,
single scalar to weight HR outputs does not take pixelwise and the bottleneck layers are used in units and blocks to
differences into consideration, that is, it would be better to reduce the parameter amounts. In [74], the authors pointed
5

out that ResNet enables feature re-usage while DenseNet three novel works within this scope: DEGREE [83], combining
enables new feature exploration. Based on the basic DenseNet, the progressive property of ResNet with traditional subband
SRDenseNet [75], as shown in Fig. 5(f), further concatenates reconstruction; LapSRN [84], generating SR of different scales
all the features from different blocks before the deconvolution progressively; and PixelSR [85], leveraging conditional autore-
layer, which is shown to be effective in improving perfor- gressive models to generate SR pixel-by-pixel.
mance. MemNet [76], proposed by Tai et al., uses the residual Compared with other deep architectures, ResNet is in-
unit recursively to replace the normal convolution in the block triguing for its progressive properties. Taking SRResNet for
of the basic DenseNet and adds dense connections among example, one can observe that directly sending the repre-
different blocks, as shown in Fig. 5(h). The authors explained sentations produced by intermediate residual blocks to the
that the local connections in the same block resemble the final reconstruction part will also yield a quite good raw
short-term memory and the connections with previous blocks HR estimator. The deeper these representations are, the better
resemble the long-term memory [77]. Recently, RDN [78] was the results that can be obtained. A similar phenomenon of
proposed by Zhang et al. and uses a similar structure. In an ResNet applied in recognition is reported in [66]. DEGREE,
RDN block, basic convolution units are densely connected proposed by Yang et al., combines this progressive property
similar to DenseNet, and at the end of an RDN block, a of ResNet with the subband reconstruction of traditional SR
bottleneck layer is used, following with the residual learning methods [86]. The residues learned in each residual block
across the whole block. Before entering the reconstruction can be used to reconstruct high-frequency details, resembling
part, features from all previous blocks are fused by the dense the signals from a certain high-frequency band. To simulate
connection and residual learning. subband reconstruction, a recursive residual block is used.
3) Combining Properties of the SISR Process with the Compared with the traditional supervised subband recovery
Design of the CNN Frame: In this subsection, we discuss some methods that need to obtain subband ground truth by diverse
deep frames whose architectures or procedures are inspired filters, this simulation with recursive ResNet avoids explicitly
by some representative methods for SISR. Compared with the estimating intermediate subband components, benefiting from
abovementioned NN-oriented methods, these methods can be the end-to-end representation learning.
better interpreted, and they sometimes are more sophisticated As mentioned above, models for small scale factors can
in addressing certain challenging cases for SISR. be used for a raw estimator of a large scale SISR. In the
Combining sparse coding with deep NN: The sparse SISR community, SISR under large scale factors (e.g.,×8)
prior in nature images and the relationships between the HR has been a challenging problem for a long time. In such
and LR spaces rooted from this prior were widely used for situations, plausible priors are imposed to restrict the solution
their great performance and theoretical support. SCN [79] space. A straightforward way to address this is to gradually
was proposed by Wang et al. and uses the learned iterative increase resolution by adding extra supervision on the auxil-
shrinkage and thresholding algorithm (LISTA) [80], which iary SISR process of the small scale. Based on this heuristic
produces an approximate estimation of sparse coding based prior, LapSRN, proposed by Lai et al., uses the Laplacian
on NN, to solve the time-consuming inference in traditional pyramid structure to reconstruct HR outputs. LapSRN has
sparse coding SISR. They further introduced a cascaded ver- two branches: the feature extraction branch and the image
sion (CSCN) [81] that employs multiple SCNs. Previous works reconstruction branch, as shown in Fig. 6. At each scale, the
such as SRCNN tried to explain general CNN architectures image reconstruction branch estimates a raw HR output of
with the sparse coding theory, which from today’s view the present stage, and the feature extraction branch outputs
may be somewhat unconvincing. SCN combines these two a residue between the raw estimator and the corresponding
important concepts innovatively and gains both quantitative ground truth as well as extracts useful representations for the
and qualitative improvements. next stage.
Learning to ensemble by NN: Different models specialize When faced with large scale factors with a severe loss of
in different image patterns of SISR. From the perspective necessary details, some researchers suggest that synthesizing
of ensemble learning, a better result can be acquired by rational details can achieve better results. In this situation, deep
adaptively fusing various models with different purposes at generative models, which will be discussed in the next sec-
the pixel level. Motivated by this idea, MSCN was proposed tions, could be good choices. Compared with the traditional in-
by Liu et al. [82] by developing an extra module in the form dependent point estimation of the lost information, conditional
of a CNN, taking the LR as input and outputting several autoregressive generative models using conditional maximum
tensors with the same shape as the HR. These tensors can likelihood estimation in directional graphical models gradually
be viewed as adaptive elementwise weights for each raw HR generate high-resolution images based on the previously gen-
output. By selecting NNs as the raw SR inference modules, erated pixels. PixelRNN [87] and PixelCNN [88] are recent
the raw estimating parts and the fusing part can be optimized representative autoregressive generative models. The current
jointly. However, in MSCN, the summation of coefficients at pixel in PixelRNN and PixelCNN is explicitly dependent on
each pixel is not 1, which may be slightly incongruous. the left and top pixels that have already been generated. To
Deep architectures with progressive methodology: In- implement such operations, novel network architectures are
creasing SISR performance progressively has been extensively elaborated. PixelSR was proposed by Dahl et al. and first
studied previously, and many recent DL-based approaches also applies conditional PixelCNN to SISR. The overall architec-
exploit it from various perspectives. Here, we mainly discuss ture is shown in Fig. 7. The conditioning CNN takes LR
6

(a) VDSR (b) DRCN

(c) SRResNet (d) DRRN

(e) EDSR (f) DenseSR

(g) MDSR (h) MemNet

Figure 5: Sketch of several deep architectures for SISR.

where x is the LR input, yi is the current HR pixel to


be generated, y<i are the generated pixels, Ai (·) denotes
the conditioning network predicting a vector of logit values
corresponding to the possible values, and Bi (·) denotes the
prior network predicting a vector of logit values of the ith
output pixel. Pixels with the highest probability are taken as
the final output pixel.
Similarly, the whole network is optimized by minimiz-
ing cross-entropy loss (maximizing the corresponding log-
likelihood) between the model’s prediction and the discrete
ground-truth labels.
Figure 6: LapSRN architecture. Red arrows indicate the
convolutional layer; blue arrows indicate transposed Deep architectures with backprojection: Iterative back-
convolutions (upsampling); green arrows denote elementwise projection [89] is an early SR algorithm that iteratively com-
addition operators. putes the reconstruction error and then feeds it back to tune
the HR results. Recently, DBPN [90], proposed by Haris et al.,
uses deep architectures to simulate iterative backprojection and
further improves performance with dense connections [73],
as input, which provides LR-conditional information to the which is shown to achieve wonderful performance in the ×8
whole model, and the PixelCNN part is the autoregressive scale. As shown in Fig. 8, the dense connection and 1 × 1
inference part. The current pixel is determined by these two convolution for reducing the dimension is first applied across
parts together using the current softmax probability: different up-projection (down-projection) units; next, in the
tth up-projection unit, the current LR feature input Let−1 is
P (yi |x, y<i ) = softmax(Ai (x) + Bi (y<i )), (2) first deconvoluted to obtain a raw HR feature H0t , and H0t is
7

degradation level are obtained by grid search.


Reconstruction-based frameworks based on priors of-
fered by deep NN: Sophisticated priors are of key points
for efficient reconstruction-based SISR algorithms to address
different cases flexibly. Recent works showed that deep NNs
could provide well-performing priors mainly from two per-
spectives: priors in the deep NN learn from data in advance
within a plug-and-play approach and direct reconstruction of
output, leveraging intriguing but still unclear priors of deep
architectures themselves.
Figure 7: Sketch of the pixel recursive SR architecture. Given the degraded version y, the reconstruction-based
algorithms aim to obtain the desired result x̂ by solving

x̂ = arg min ||Hx − y||22 + R(x), (3)

where H is the degradation matrix and R(x) is regularization,


also called a prior from the Bayesian view. [94] split (3) into
a data part and a prior part with variable splitting techniques
and then replaced the prior part with efficient denoising
algorithms. Regarding different degradation cases, one only
needs to change denoising algorithms for the prior part,
behaving in so-called plug-and-play manners. Recent works
[95], [96], [97] use deep discriminatively trained NNs under
different noise levels as denoisers in various inverse problems,
and IRCNN [96] is the first one among them to address
Figure 8: Sketch of the DBPN architecture. SISR. In IRCNN, they first trained a series of CNN-based
denoisers with different noise levels, and took backprojection
as the reconstruction part. The LR is first preceded by several
backprojected to the LR feature Lt0 . The residue between two
backprojection iterations and then denoised by CNN denoisers
LR features elt = Lt−1 −Lt−1 0 is then deconvoluted and added
with decreasing noise levels along with backprojection. The it-
to H0t to obtain a finer HR feature H t . The down-projection
eration number is empirically set to 30. In IRCNN, the authors
unit is defined very similarly in an inverse way.
use deep networks to learn a set of image priors and then plug
Usage of additional information from LR: Although
the priors into the reconstruction framework; the experimental
modern deep NNs are skillful in extracting various ranges of
results in these cases are better than the contemporary methods
useful representations in end-to-end manners, in some cases, it
that only employ example-based training.
is still helpful to select some information to process explicitly.
For example, the DEGREE [83] takes the edge map of LR Recently, Ulyanov et al. showed in [98] that the structure
as another input. Recent studies tend to use more complex of deep neural networks could capture a considerable amount
information of LR directly, two examples of which are the of low-level image statistical priors. They reported that when
following: SFT-GAN [91], with extra semantic information of neural networks are used to fit images of different statistical
LR for better perceptual quality, and SRMD [92], incorporat- properties, the convergence speed for different kinds of images
ing degradation into input for multiple degradations. can also be different. As shown in Fig. 9, natural-looking
[93] reported that using a semantic prior helps improve the images, whose different parts are highly relevant, will converge
performance of many SISR algorithms. Leveraging powerful much faster. In contrast, images such as noises and shuffled
deep architectures recently designed for segmentation, Wang images, which have little inner relationship, tend to converge
et al. [91] used semantic segmentation maps of interpreted more slowly. Many inverse problems such as denoising and
LR as additional input and deliberated the spatial feature super-resolution are modeled as the pixel-wise summation of
transformation (SFT) layer to handle them. With this extra the original image and the independent additive noises. Based
information from high-level tasks, the proposed work is more on the observed prior, when used to fit these degraded images,
skilled in generating textual details. the neural networks tend to fit the natural-looking images first,
To take degradations of different LRs into account, SRMD which can be used to retain the natural-looking parts as well
first applied a parametric zero-mean anisotropic Gaussian as to filter the noisy ones. To illustrate the effectiveness of the
kernel to stand for the blur kernel and the additive white proposed prior for SISR, only given the LR x0 , the authors
Gaussian noise with hyperparameter ρ2 to represent noise. took a fixed random vector z as input to fit the HR x with a
Then, a simple regression is used to obtain its covariance randomly initialized DNN fθ by optimizing
matrix. These sufficient statistics are dimensionally stretched min ||d(fθ (z)) − x0 ||22 , (4)
θ
to concatenate with LR in the channel dimension, and with
such input, a deep model is trained. Notably, when SRMD where d(·) is a common differentiable downsampling operator.
is tested with real images, the needed parameters on the The optimization is terminated in advance for only filtering
8

1) PSNR/SSIM [104] for measuring reconstruction qual-


ity: Given two images I and Iˆ both with N pixels, the MSE
and peak signal-to-noise ratio (PSNR) are defined as
1 ˆ 2,
M SE = ||I − I||F
(5)
N
L2
( ) (6)
P N SR = 10 log10M SE ,
where || · ||2F is the Frobenius norm and L is usually 255. The
structural similarity index (SSIM) is defined as

ˆ = 2µI µIˆ + k1 σ ˆ + k2
SSIM (I, I) 2 2 · 2 II 2 , (7)
Figure 9: Learning curves for the reconstruction of different µI + µIˆ + k1 σI + σIˆ + k2
kinds of images. We re-implement the experiment in [98]
with the image ‘butterfly’ in Set5. where µI and σI2 is the mean and variance of I, σI Iˆ is the
covariance between I and I, ˆ and k1 and k2 are constant
relaxation terms.
noisy parts. Although these totally unsupervised methods 2) Number of parameters of NN for measuring storage
are outperformed by other supervised learning methods, they efficiency (Params).
perform considerably better than some other naive methods. 3) Number of composite multiply-accumulate operations
Deep architectures with internal examples: Internal- for measuring computational efficiency (Mult&Adds):
example SISR algorithms are based on the recurrence of small Since operations in NNs for SISR are mainly multiplications
pieces of information across different scales of a single image, with additions, we use Mult&Adds in CARN [105] to measure
which are shown to be better at addressing specific details computation, assuming that the desired SR is 720p.
rarely existing in other external images [99]. ZSSR [100], pro- Notably, it has been shown in [48] and [49] that the training
posed by Shocher et al., is the first literature combining deep datasets have a great influence on the final performance, and
architectures with internal-example learning. In ZSSR, other usually, more abundant training data will lead to better results.
than the image for testing, no extra images are needed, and all Generally, these models are trained via three main datasets: 1)
the patches for training are taken from different degraded pairs 91 images from [19] and 200 images from [106], called the
of the test image. As demonstrated in [101], the visual entropy 291 dataset (some models only use 91 images); 2) images
inside a single image is much smaller than the large training derived from ImageNet [107] randomly; and 3) the newly
dataset collected from wide ranges, so unlike external-example published DIV2K dataset [108]. In addition to the different
SISR algorithms, a very small CNN is sufficient. As we number of images each dataset contains, the quality of images
mentioned previously for VDSR, the training data for a small- in each dataset is also different. Images in the 291 dataset are
scale model can also be useful for training large-scale models. usually small (on average, 150×150), images in ImageNet are
Additionally, based on this trick, ZSSR can be more robust by much larger, while images in DIV2K are of very high quality.
collecting more internal training pairs with small scale factors Because of the restricted resolution of the images in the 291
for training large-scale models. However, this approach will dataset, models on this set have difficulties in obtaining large
increase runtime immensely. Notably, when combined with patches with large receptive fields. Therefore, models based on
the kernel estimation algorithms mentioned in [102], ZSSR the 291 dataset usually take the bicubic of LR as input, which
performs quite well with the unknown degradation kernels. is quite time-consuming. Table I compares different models
Recently, Tirer et al. argued that degradation in LR de- on the mentioned criteria.
creases the performance of internal-example algorithms [103]. From Table I, we can see that generally as the depth and
Therefore, they proposed to use reconstruction-based deep the number of parameters grow, the performance improves.
frame IDBP [97] to obtain an initial SR result and then conduct However, the growth rate of performance levels off. Recently,
internal-example-based network training similar to ZSSR. This some works on designing light models [109], [105], [110]
method was believed to combine two successful techniques and learning sparse structural NN [111] were proposed to
that address the mismatch between training and test, and it achieve relatively good performance with less storage and
has achieved robust performance in these cases. computation, which are very meaningful in practice.
For the second part, we mainly show that the performance
of the models for some specific degradation dropped drasti-
C. Comparisons among Different Models and Discussion cally when the true degradation mismatches the one assumed
for training. For example, we use four models, including
In this section, we will summarize recent progress in deep EDSR trained with bicubic degradation [71], IRCNN [96],
architectures for SISR from two perspectives: quantitative SRMD [92] and ZSSR [100], to address LRs generated by
comparisons for those trained by specific blurring, and com- Gaussian kernel degradation (kernel size of 7 × 7 with band-
parisons on those models for handling nonspecific blurring. width 1.6), as shown in Fig. 10, and the performance of EDSR
For the first part, quantitative criteria mainly include the dropped drastically with obvious blur, while other models
following: for nonspecific degradation perform quite well. Therefore, to
9

Table I: Comparisons among some representative deep models.


Models PSNR/SSIM(×4) Train data Parameters Mult&Adds
SRCNN EX [48] 30.49/0.8628 ImageNet subset 57K 52.5G
ESPCN [49] 30.90/- ImageNet subset 20K 1.43G
VDSR [61] 31.35/0.8838 G200+Yang91 665K 612.6G
DRCN [63] 31.53/0.8838 Yang91 1.77M(recursive) 17974.3G
DRRN [70] 31.68/0.8888 G200+Yang91 297K(recursive) 6796.9G
LapSRN [84] 31.54/0.8855 G200+Yang91 812K 29.9G
SRResNet [68] 32.05/0.9019 ImageNet subset 1.5M 127.8G
MemNet [76] 31.74/0.8893 G200+Yang91 677K(recursive) 2265.0G
RDN [78] 32.61/0.9003 DIV2K 22.6M 1300.7G
EDSR [71] 32.62/0.8984 DIV2K 43M 2890.0G
MDSR [71] 32.60/0.8982 DIV2K 8M 407.5G
DIV2K+Flickr+
DBPN [90] 32.47/0.898 10M 5715.4G
ImageNet subset

address some longstanding problems in SISR, such as un- determined by the training data regardless of the parameter θ
known degradation, the direct usage of general deep learning of the model (or the model distribution Pmodel (x; θ)). Hence,
techniques may not be sufficient. More effective solutions can when we use the training samples to estimate parameter θ,
be achieved by combining the power of DL and the specific minimizing this KLD is equivalent to MLE.
properties of the SISR scene. Here, we have demonstrated that MSE is a special case
of MLE, and MLE is a special case of KLD. However,
IV. O PTIMIZATION O BJECTIVES FOR DL- BASED SISR we may conjecture whether the assumptions underlying these
A. Benchmark of Optimization Objectives for DL-based SISR specializations are violated. This consideration has led to some
emerging objective functions from four perspectives:
We select the MSE loss used in SRCNN as the benchmark. 1) Translating MLE into MSE can be achieved by assuming
It is known that using MSE favors a high PSNR, and PSNR Gaussian white noise. Although the Gaussian model is the
is a widely used metric for quantitatively evaluating image most widely used model for its simplicity and technical
restoration quality. Optimizing MSE can be viewed as a support, what if this independent Gaussian noise assumption
regression problem, leading to a point estimation of θ as is violated in a complicated scene such as SISR?
X
min ||F (xi ; θ) − yi ||2 , (8)
2) To use MLE, we need to assume the parametric form
θ of the data distribution. What if the parametric form is
i
misspecified?
where (xi , yi ) are the ith training examples and F (x; θ) is
a CNN parameterized by θ. Here, (8) can be interpreted 3) Apart from KLD in (10), are there any other distances
in a probabilistic way by assuming Gaussian white noise between probability measures that we can use as the optimiza-
(N (; 0, σ 2 I)) independent of the image in the regression tion objectives for SISR?
model, and then, the conditional probability of y given x 4) Under specific circumstances, how can we choose the
becomes a Gaussian distribution with mean F (x; θ) and the suitable objective functions according to their properties?
diagonal covariance matrix σ 2 I, where I is the identity matrix: Based on some solutions to these four questions, recent
work on optimization objectives for DL-based SISR will be
p(y|x) = N (y; F (x; θ), σ 2 I). (9) discussed in Sections IV-B, IV-C, IV-D and IV-E, respectively.
Then, using maximum likelihood estimation (MLE) on the
training examples with (9) will lead to (8). B. Objective Functions Based on non-Gaussian Additive
The Kullback-Leibler divergence (KLD) between the condi- Noises
tional empirical distribution Pdata and the conditional model
distribution Pmodel is defined as The poor perceptual quality of the SISR images obtained by
optimizing MSE directly demonstrates a fact: using Gaussian
Pdata (z)
DKL (Pdata ||Pmodel ) = Ez∼Pdata [log ]. (10) additive noise in the HR space is not good enough. To address
Pmodel (z) this problem, solutions are proposed from two aspects: use
We call (10) the forward KLD, where z = y|x denotes the other distributions for this additive noise, or transfer the HR
HR (SR) conditioned on its LR counterpart, Pdata and Pmodel space to some space where the Gaussian noise is reasonable.
are the conditional distributions of HR|LR and SR|LR, 1) Denote Additive Noise with Other Probability Distribu-
respectively, where Ex∼Pdata [log Pdata (z)] is an intrinsic term tions: In [112], Zhao et al. investigated the difference between
10

(a) HR (b) EDSR(27.80dB/0.9012) (c) IRCNN(34.63dB/0.9548) (d) ZSSR(30.45dB/0.9384) (e) SRDM(37.71dB/0.9723)

Figure 10: Comparisons of ’monarch’ in Set14 for scale 2 with Gaussian kernel degradation. We can see that, given the
degradation mismatch with that of training, the performance of EDSR decreases drastically.

mean absolute error (MAE) and MSE used to it optimize NN trained by minimizing the Euclidean distance as
in image processing. Similar to (8), MAE can be written as 2
min ||Φ(x) − Ψ(r)|| . (14)
X Φ
min ||F (xi ; θ) − yi ||1 . (11)
θ After Φ is obtained, the final result r can be inferred with
i
SGD by solving
From the perspective of probability, (11) can be interpreted
as introducing Laplacian white noise, and similar to (9), the
conditional probability becomes 2
min ||Φ(x) − Ψ(r)|| . (15)
r
p(y|x) = Laplace(y; F (x; θ), bI). (12)
For further improvement, [113] also proposed a fine-tuning
Compared with MSE in regression, MAE is believed to be algorithm in which Φ and Ψ can be fine-tuned to the data.
more robust against outliers. As reported in [112], when MAE Similar to the alternating updating in GAN, Φ and Ψ are fine-
is used to optimize an NN, the NN tends to converge faster tuned with SGD based on the current r. However, this fine-
and produce better results. The authors argued that the reason tuning will involve calculating the gradient of the partition
might be because MAE could guide NN to reach a better local function Z, which is a well-known difficult decomposition into
minimum. Other similar loss functions in robust statistics can the positive phase and the negative phase of learning. Hence
be viewed as modeling additive noises with other probability to avoid sampling within inner loops, a biased estimator of
distributions. this gradient is chosen for simplicity.
Although these specific distributions often cannot represent The inference algorithm in [113] is extremely time-
unknown additive noise very precisely, their corresponding consuming. To improve efficiency, Johnson et al. utilized
robust statistical loss functions are used in many DL-based this perceptual loss in an end-to-end training manner [114].
SISR works for their conciseness and advantages over MSE. In [114], the SISR network is directly optimized with SGD
2) Using MSE in a Transformed Space: Alternatively, we by minimizing the MSE in the feature manifold produced by
can search for a mapping Ψ to transform the HR space to some VGG-16 as follows:
space where Gaussian white noise can be used reasonably. min ||Ψ(F (x; θ)) − Ψ(y)|| ,
2
(16)
From this perspective, Bruna et al. [113] proposed so-called θ
perceptual loss to leverage deep architectures. In [113], the where Ψ is the mapping represented by VGG-16, F (x; θ) de-
conditional probability of the residual r between HR and LR notes the SISR network, and y is the ground truth. Compared
given the LR x is stimulated by the Gibbs energy model: with [113], [114] replaces the nonlinear mapping Φ and the
2 expensive inference with an end-to-end trained CNN, and their
p(r|x) = exp(−||Φ(x) − Ψ(r)|| − log Z), (13)
results show that this change does not affect the restoration
where Φ and Ψ are two mappings between the original quality but does accelerate the whole process.
spaces and the transformed ones, and Z is the partition Perceptual loss mitigates blurring and leads to more
function. The features produced by sophisticated supervised visually-pleasing results compared with directly optimizing
deep architectures have been shown to be perceptually stable MSE in the HR space. However, there remains no theoretical
and discriminative, denoted by Ψ(r)2 . Then, Ψ represents the analysis on why this approach works. In [113], the author
corresponding deep architectures. In contrast, Φ is the mapping generally concluded that successful supervised networks used
between the LR space and the manifold represented by Ψ(r), for high-level tasks could produce very compact and stable
features. In these feature spaces, small pixel-level variation and
2 Either the scattering network or VGG can be denoted by Ψ. When Ψ is much other trivial information can be omitted, making these
VGG, there is no residual learning and fine-tuning. feature maps mainly focus on pixels of human interest. At
11

the same time, with the deep architectures, the most specific bound can be rewritten as
and discriminative information of the input is shown to be 1 X
retained in feature spaces because of the great performance − log ||Aj ||1 , (22)
N j
of the models applied in various high-level tasks. From this
perspective, using MSE in these feature spaces will focus more where Aj = (A1j , · · · , Akj )T , and k · k1 is the `1 norm. When
on the parts that are attractive to human observers with little the bandwidth h → 0, the affinity Akj will degrade into the
loss of original contents, so perceptually pleasing results can indicator function, which means if xk = yj , Akj ≈ 1; other-
be obtained. wise, Akj ≈ 0. In this case, the `1 norm can be approximated
well by the `∞ norm, which returns the maximum element of
the vector. Thus, (22) can degenerate into the contextual loss
C. Optimizing Forward KLD with Nonparametric Estimation in [115], [116]:
1 X
Parametric estimation methods such as MLE need to specify − log max Akj . (23)
N j k
in advance the parametric form the distribution of data, which
suffers from model misspecification. Different from parametric Recently, implicit likelihood estimation (IMLE) [117]
estimation, nonparametric estimation methods such as kernel was proposed and its conditional version was applied to
distribution estimation (KDE) fit the data without distributional SISR [118]. Here, we will briefly show that minimizing IMLE
assumptions, which are robust when the real distributional equals minimizing an upper bound of the forward KLD with
form is unknown. Based on nonparametric estimation, re- KDE. Let us use a Gaussian kernel as
cently, the contextual loss [115], [116] was proposed by 1

kx − yk22

Mechrez et al. to maintain natural image statistics. In the K(x, y) = √ exp − . (24)
2πh 2h2
contextual loss, a Gaussian kernel function is applied:
As with (20), the optimization objective can be rewritten as
K(x, y) = exp(−dist(x, y)/h − log Z), (17) X − kzk −wj k22
1 X
− log e 2h2 . (25)
where dist(x, y) can be any symmetric distance between x N j
k
R y, h is the bandwidth, and the partition function Z =
and
With {wj }m and {zk }N
exp(−dist(x, y)/h)dy. Then, Pdata and Pmodel are j=1 k=1 , we can obtain a simple upper
X bound of (25) as
Pdata (z) = K(z, zi ), 1 X
 kzk −wj k2
2

zi ∼Pdata − log m min e− 2h2
X (18) N j
k
Pmodel (z) = K(z, wj ), (26)
wj ∼Pmodel
1 X ||zk − wj ||22
= (min − log m).
N j 2h2
and (10) can be rewritten as k

DKL (Pdata ||Pmodel ) = Minimizing (26) equals minimizing


X
1 X X X
min kzk − wj k22 ,
[log K(zk , zi ) − log K(zk , wj )]. j (27)
N k
k zi ∼Pdata wj ∼Pmodel
(19) which is the core of the optimization objective of IMLE.
As above, the recently proposed contextual loss and IMLE
The first log term in (19) is a constant with respect to the are illustrated via nonparametric estimation and KLD. Visually
model parameters. Let us denote the kernel K(zk , wj ) in the pleasing results were reported using the contextual loss and
second log term by Akj . Then, the optimization objective in IMLE. However, as KDE is generally very time-consuming,
(19) can be rewritten as several reasonable approximations along with acceleration
1 X X algorithms were applied.
− log Akj . (20)
N j
k
D. Other Distances between Probability Measures Used in
With the Jensen inequality, we can obtain a lower bound of SISR
(20):
As KLD is an asymmetric (pseudo) distance for measuring
1 X X 1 XX similarity between two distributions, in this subsection, we be-
− log Akj ≥ − log Akj ≥ 0. (21)
N j
N j gin with the inverse form of forward KLD, namely, backward
k k
KLD. The backward KLD is defined as
The first equality holds if and only if ∀k, k 0 ,P j Akj =
P
P Pmodel (z)
j Ak j . Both equalities hold if and only if ∀k, j Akj = 0. DKL (Pmodel ||Pdata ) = Ez∼Pmodel [log
0
]. (28)
When (20) reaches 0, the given lower bound also reaches 0. Pdata (z)
Therefore, we can take this lower bound as the optimization When Pmodel = Pdata , both KLDs reach the minimum of
objective alternatively. 0. However, when the solution is inadequate, these two KLDs
We can further simplify the lower bound in (21). The lower will lead to quite different results. Here, we use a toy example
12

detailed discussion on GANs, readers can refer to [45]. Recent


works have shown that sophisticated architectures and suitable
hyperparameters can help GANs perform excellently. The
representative works on GAN-based SISR are [68] and [121].
In [68], the generator of the GAN is the SRResNet mentioned
previously, and the discriminator refers to the design criterion
of DCGAN [54]. In the context of GANs, a recent work [121]
follows a similar path except with a different architecture.
Very recently, by leveraging the extension of the basic GAN
(a) forward KLD (b) backward KLD
framework [122], [123] was proposed as an unsupervised SR
Figure 11: A toy example to illustrate the difference between algorithm. Fig. 12 shows the results of the GAN and MSE
the forward KLD and the backward KLD. with the same architecture; despite the lower PSNR due to
artifacts, the visual quality improves by using the GAN for
SISR.
to illustrate a simple case of inadequate solutions, as shown Generally, GANs offer an implicit optimization strategy in
in Fig. 11. an adversarial training way by using deep neural networks.
The unknown wanted distribution is a Gaussian mixture Based on this, more rational but complicated measures such
model (GMM) with two modes, denoted as P (x), and we as Wasserstein distances [124], f -divergence [125]3 and maxi-
model it by a single Gaussian distribution. We can easily see mum mean discrepancy (MMD) [126] are taken as alternatives
that optimizing the forward KLD results in a solution locating to JSD for training GANs.
at the middle areas of two modes, while optimizing the
backward KLD makes the result close to the most prominent E. Characters of Different objective functions
mode.
From Fig. 11 we can see that, under inadequate solutions, Now, we can see that those losses mentioned in Section
optimizing the forward KLD will lead to the well-known IV-B explicitly model the relation between LR and its HR
regression-to-the-mean problem, while optimizing the back- counterpart. Here, we follow the methodology of [127] and
ward KLD only concentrates on the main modality. The former call the losses that were based on measuring the dissimilarity
is one of the reasons for blurring, and some researchers [119] between training pairs the distortion-aimed losses. When the
argued that the latter improves the visual quality but makes training data are not sufficient, distortion losses usually ignore
the results collapse to some patterns. the particularity of data and appear ineffective to measure the
Different distances may lead to different results under an similarity between the source and target distributions.
inadequate solution. Readers can refer to [120] for further The losses mentioned in Sections IV-C and IV-D are rooted
understanding. In most low-level computer vision tasks, Pdata from measuring the similarity between distributions, which is
is an empirical distribution and Pmodel is an intractable thought to measure the perceptual quality. Here, we call them
distribution. For this reason, the backward KLD is unpractical perception-aimed losses. Recently, Blau et al. [127] discussed
for optimizing deep architectures. To relieve optimizing diffi- the inherent trade-off between the two kinds of losses. Their
culties, we replace the asymmetric KLD with the symmetric discussion can be simplified into an optimization problem:
Jensen-Shannon divergence (JSD) as follows: P (D) = min d(PY , PŶ ) s.t. E[∆(Y, Ŷ )] ≤ D.
PŶ |X (31)
1 Pdata + Pmodel
JS(Pdata ||Pmodel ) = KL[Pdata || ]+ ∆(·, ·) is distortion-aimed loss, and d(·, ·) is the (pseudo)
2 2 (29)
1 Pdata + Pmodel distance between distributions. Furthermore, the author also
KL[Pmodel || ].
2 2 proved that if d(·, ·) is convex in its second argument, then
Optimizing (29) explicitly is also very difficult. Generative the P (D) is monotonically nonincreasing and convex. From
adversarial nets (GANs) proposed by Goodfellow et al. use the this property, we can draw the curve of P (D) and easily see
objective function below to implicitly address this problem in this trade-off, as shown in Fig. 13(a), such that improving one
a game theory scenario, successfully avoiding the troubling must be at the expense of the other. However, as shown in
approximate inference and approximation of the partition Section IV-B, using MSE in the VGG feature space achieves
function gradient: a better quality, and choosing suitable ∆ and d may ease this
trade-off.
min max[Ez∼Pdata log D(z) + Ez∼Pmodel log(1 − D(z))], For the perception-aimed losses mentioned in Sections IV-C
G D
(30) and IV-D, up to now, there has been no rigorous analysis
on their differences. Here, we apply the nonreference quality
where G is the main part called the generator supervised
assessment proposed by Ma et al. [95] with RMSE to conduct
by an auxiliary part D called the discriminator. The two
quantitative comparisons, and the representative qualitative
parts update alternatively, and when the discriminator cannot
comparisons are depicted in Fig. 13(b). To summarize, we
give useful information to the generator anymore, in other
words, the outputs of the generator totally confuse the dis- 3 Forward KLD, backward KLD and JSD can all be regarded as the special
criminator, the optimization procedure is completed. For the cases of f -divergence.
13

(a) HR (b) (c) (d) (e) SRCX(20.88dB/0.6002)


bicubic(21.59dB/0.6423) SRResNet(23.53dB/0.7832) SRGAN(21.15dB/0.6868)

Figure 12: Visual comparisons between the MSE, MSE + GAN and MAE +GAN + Contextual Loss (The authors of [68]
and [116] released their results.) We can see that the perceptual loss leads to a lower PSNR/SSIM but a better visual quality.

of traditional SISR tasks by a large margin. However, the


large scale of SISR and the SISR with unknown corruption,
the two major challenges in the SR community, are still
lacking very effective remedies. DL algorithms are thought
to be skilled at addressing many inferences or unsupervised
problems, which is of key importance to address these two
challenges. Therefore, by leveraging the great power of DL,
more effective solutions to these two demanding problems are
expected.
(a) Perception-distortion trade-off (b) Perception-distortion evaluation
3) Theoretical Understanding of Deep Models for SISR:
The success of deep learning is said to be attributed to
Figure 13: (a) The perception-distortion space is divided by learning powerful representations. However, to date, we still
the perception-distortion curve, where an area cannot be cannot understand these representations very well, and the
attained. (b) Use of the nonreference metric proposed deep architectures are treated as a black box. For DL-based
by [95] and RMSE to perform quantitative comparisons from SISR, the deep architectures are often viewed as a universal
the perception and distortion views; the included methods approximation, and the learned representations are often omit-
are [47], [84], [71], [61], [68], [121], [116], [114]. ted for simplicity. This behavior is not beneficial for further
exploration. Therefore, we should not only focus on whether
a deep model works but also concentrate on why and how it
should be aware that there is no one-fits-all objective function, works. That is, more theoretical explorations are needed.
and we should choose one that is suitable to the context of an 4) More Rational Assessment Criteria for SISR in Different
application. Applications: In many applications, we need to design the
desired objective function for a specific application. However,
V. T RENDS AND C HALLENGES in most cases, we cannot give an explicit and precise definition
to assess the requirement for the application, which leads to
Along with the promising performance that DL algorithms
the vagueness of the optimization objectives. Many works,
have achieved in SISR, there remain several important chal-
although for different purposes, simply employ MSE as the
lenges and inherent trends as follows.
assessment criterion, which has been shown as a poor criterion
1) Lighter Deep Architectures for Efficient SISR: Although
in many cases. In the future, we think that it is of great
the high accuracy of advanced deep models has been achieved
necessity to make clear definitions for assessments in various
for SISR, it is still difficult to deploy these models to real-
applications. Based on these criteria, we can design better
world scenarios, which is mainly due to massive parameters
targeted optimization objectives and compare algorithms in
and computation. To address this issue, we need to design light
the same context more rationally.
deep models or slim the existing deep models for SISR with
fewer parameters and computation at the expense of little or
no performance degradation. Hence, in the future, researchers VI. C ONCLUSION
are expected to focus more on reducing the size of NNs for This paper presents a brief review of recent deep learning
speeding up the SISR process. algorithms on SISR. It divides the recent works into two
2) More Effective DL Algorithms for Large-scale SISR and categories: the deep architectures for simulating the SISR
SISR with Unknown Corruption: Generally, DL algorithms process and the optimization objectives for optimizing the
proposed in recent years have improved the performance whole process. Despite the promising results reported so far,
14

there are still many underlying problems. We summarize the [18] M. Aharon, M. Elad, A. Bruckstein et al., “K-SVD: An algorithm for
main challenges into three aspects: the acceleration of deep designing overcomplete dictionaries for sparse representation,” IEEE
Transactions on Signal Processing, vol. 54, no. 11, p. 4311, 2006.
models, the extensive comprehension of deep models and the [19] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
criteria for designing and evaluating the objective functions. via sparse representation,” IEEE Transactions on Image Processing,
Along with these challenges, several directions may be further vol. 19, no. 11, pp. 2861–2873, 2010.
[20] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
explored in the future. sparse-representations,” in Proceedings of the International Conference
on Curves and Surfaces, 2010, pp. 711–730.
ACKNOWLEDGMENT [21] R. Timofte, V. De, and L. Van Gool, “Anchored neighborhood regres-
sion for fast example-based super-resolution,” in Proceedings of the
We are grateful to the authors of [47], [84], [71], [61], IEEE international Conference on Computer Vision, 2013, pp. 1920–
[68], [121], [116], [114], [96], [92], [100] for kindly releasing 1927.
their experimental results or codes, as well as to the three [22] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored
neighborhood regression for fast super-resolution,” in Proceedings of
anonymous reviewers for their constructive criticism, which the Asian Conference on Computer Vision, 2014, pp. 111–126.
has significantly improved our manuscript. Moreover, we [23] F. Cao, M. Cai, Y. Tan, and J. Zhao, “Image super-resolution via
thank Qiqi Bao for helpful discussions. adaptive `p (0 < p < 1) regularization and sparse representation,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 27,
no. 7, pp. 1550–1561, 2016.
R EFERENCES [24] J. Liu, W. Yang, X. Zhang, and Z. Guo, “Retrieval compensated group
structured sparsity for image super-resolution,” IEEE Transactions on
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
Multimedia, vol. 19, no. 2, pp. 302–316, 2017.
no. 7553, p. 436, 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [25] S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image
with deep convolutional neural networks,” in Proceedings of the upscaling with super-resolution forests,” in Proceedings of the IEEE
Advances in Neural Information Processing Systems, 2012, pp. 1097– Conference on Computer Vision and Pattern Recognition, 2015, pp.
1105. 3791–3799.
[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, [26] K. Zhang, D. Tao, X. Gao, X. Li, and J. Li, “Coarse-to-fine learning for
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural single-image super-resolution,” IEEE Transactions on Neural Networks
networks for acoustic modeling in speech recognition: The shared and Learning Systems, vol. 28, no. 5, pp. 1109–1122, 2017.
views of four research groups,” IEEE Signal Processing Magazine, [27] J. Yu, X. Gao, D. Tao, X. Li, and K. Zhang, “A unified learning
vol. 29, no. 6, pp. 82–97, 2012. framework for single image super-resolution,” IEEE Transactions on
[4] R. Collobert and J. Weston, “A unified architecture for natural language Neural Networks and Learning systems, vol. 25, no. 4, pp. 780–792,
processing: Deep neural networks with multitask learning,” in Proceed- 2014.
ings of the International Conference on Machine Learning, 2008, pp. [28] C. Deng, J. Xu, K. Zhang, D. Tao, X. Gao, and X. Li, “Similarity
160–167. constraints-based structured output regression machine: An approach to
[5] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A image super-resolution,” IEEE Transactions on Neural Networks and
benchmark,” in Proceedings of the European Conference on Computer Learning Systems, vol. 27, no. 12, pp. 2472–2485, 2016.
Vision, 2014, pp. 372–386. [29] W. Yang, Y. Tian, F. Zhou, Q. Liao, H. Chen, and C. Zheng, “Consis-
[6] R. Timofte, R. Rothe, and L. Van Gool, “Seven ways to improve tent coding scheme for single-image super-resolution via independent
example-based single image super resolution,” in Proceedings of the dictionaries,” IEEE Transactions on Multimedia, vol. 18, no. 3, pp.
IEEE Conference on Computer Vision and Pattern Recognition, 2016, 313–325, 2016.
pp. 1865–1873. [30] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
[7] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” review and new perspectives,” IEEE Transactions on Pattern Analysis
arXiv preprint arXiv:1412.6980, 2014. and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: [31] H. A. Song and S.-Y. Lee, “Hierarchical representation using NMF,”
Surpassing human-level performance on ImageNet classification,” in in Proceedings of the International Conference on Neural Information
Proceedings of the IEEE International Conference on Computer Vision, Processing, 2013, pp. 466–473.
2015, pp. 1026–1034. [32] J. Schmidhuber, “Deep learning in neural networks: An overview,”
[9] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image re- Neural Networks, vol. 61, pp. 85–117, 2015.
construction: a technical overview,” IEEE Signal Processing Magazine, [33] N. Rochester, J. Holland, L. Haibt, and W. Duda, “Tests on a cell
vol. 20, no. 3, pp. 21–36, 2003. assembly theory of the action of the brain, using a large digital
[10] R. Keys, “Cubic convolution interpolation for digital image process- computer,” IRE Transactions on Information Theory, vol. 2, no. 3, pp.
ing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 80–93, 1956.
vol. 29, no. 6, pp. 1153–1160, 1981.
[34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
[11] C. E. Duchon, “Lanczos filtering in one and two dimensions,” Journal
sentations by back-propagating errors,” Nature, vol. 323, no. 6088, p.
of Applied Meteorology, vol. 18, no. 8, pp. 1016–1022, 1979.
533, 1986.
[12] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, and A. K. Katsaggelos, “Soft-
[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
cuts: a soft edge smoothness prior for color image super-resolution,”
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
IEEE Transactions on Image Processing, vol. 18, no. 5, pp. 969–981,
zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
2009.
1989.
[13] J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient
profile prior,” in Proceedings of the IEEE Conference on Computer [36] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14,
Vision and Pattern Recognition, 2008, pp. 1–8. no. 2, pp. 179–211, 1990.
[14] Q. Yan, Y. Xu, X. Yang, and T. Q. Nguyen, “Single image superresolu- [37] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-
tion based on gradient profile sharpness,” IEEE Transactions on Image cies with gradient descent is difficult,” IEEE Transactions on Neural
Processing, vol. 24, no. 10, pp. 3187–3202, 2015. Networks, vol. 5, no. 2, pp. 157–166, 1994.
[15] A. Marquina and S. J. Osher, “Image super-resolution by TV- [38] J. F. Kolen and S. C. Kremer, Gradient Flow in Recurrent Nets:
regularization and Bregman iteration,” Journal of Scientific Computing, The Difficulty of Learning LongTerm Dependencies. IEEE, 2001.
vol. 37, no. 3, pp. 367–382, 2008. [Online]. Available: https://ieeexplore.ieee.org/document/5264952
[16] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super- [39] G. E. Hinton, “Learning multiple layers of representation,” Trends in
resolution,” IEEE Computer Graphics and Applications, vol. 22, no. 2, Cognitive Sciences, vol. 11, no. 10, pp. 428–434, 2007.
pp. 56–65, 2002. [40] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and
[17] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through J. Schmidhuber, “Flexible, high performance convolutional neural
neighbor embedding,” in Proceedings of the IEEE Conference on networks for image classification,” in Proceedings of the International
Computer Vision and Pattern Recognition, 2004, pp. 275–282. Joint Conference on Artificial Intelligence, 2011, pp. 1237–1242.
15

[41] D. CireşAn, U. Meier, J. Masci, and J. Schmidhuber, “Multi-column [65] ——, “Identity mappings in deep residual networks,” in Proceedings
deep neural network for traffic sign classification,” Neural Networks, of the European Conference on Computer Vision, 2016, pp. 630–645.
vol. 32, pp. 333–338, 2012. [66] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave
[42] R. Salakhutdinov and H. Larochelle, “Efficient learning of deep Boltz- like ensembles of relatively shallow networks,” in Proceedings of the
mann machines,” in Proceedings of the International Conference on Advances in Neural Information Processing Systems, 2016, pp. 550–
Artificial Intelligence and Statistics, 2010, pp. 693–700. 558.
[43] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv [67] D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and
preprint arXiv:1312.6114, 2013. B. McWilliams, “The shattered gradients problem: If resnets are the
[44] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop- answer, then what is the question?” in Proceedings of the International
agation and approximate inference in deep generative models,” arXiv Conference on Machine Learning, 2017, pp. 342–350.
preprint arXiv:1401.4082, 2014. [68] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” image super-resolution using a generative adversarial network,” in
in Proceedings of the Advances in Neural Information Processing Proceedings of the IEEE conference on computer vision and pattern
Systems, 2014, pp. 2672–2680. recognition, 2017, pp. 4681–4690.
[46] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. [69] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
MIT press Cambridge, 2016, vol. 1. network training by reducing internal covariate shift,” in Proceedings
[47] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convo- of the International Conference on Machine Learning, 2015, pp. 448–
lutional network for image super-resolution,” in Proceedings of the 456.
European Conference on Computer Vision, 2014, pp. 184–199. [70] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive
[48] ——, “Image super-resolution using deep convolutional networks,” residual network,” in Proceedings of the IEEE Conference on Computer
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vision and Pattern Recognition, 2017, pp. 3147–3155.
vol. 38, no. 2, pp. 295–307, 2016. [71] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
[49] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, residual networks for single image super-resolution,” in Proceedings
D. Rueckert, and Z. Wang, “Real-time single image and video super- of the IEEE Conference on Computer Vision and Pattern Recognition
resolution using an efficient sub-pixel convolutional neural network,” in Workshops, 2017, pp. 136–144.
Proceedings of the IEEE Conference on Computer Vision and Pattern [72] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
Recognition, 2016, pp. 1874–1883. inception-resnet and the impact of residual connections on learning,”
[50] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional in Proceedings of the Association for the Advancement of Artificial
networks for mid and high level feature learning,” in Proceedings of the Intelligence, 2017, pp. 4278–4284.
IEEE International Conference on Computer Vision, 2011, pp. 2018– [73] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely
2025. connected convolutional networks,” in Proceedings of the IEEE Con-
[51] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep ference on Computer Vision and Pattern Recognition, 2017, pp. 4700–
learning,” arXiv preprint arXiv:1603.07285, 2016. 4708.
[52] M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- [74] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path
lutional networks,” in Proceedings of the European Conference on networks,” in Proceedings of the Advances in Neural Information
Computer Vision, 2014, pp. 818–833. Processing Systems, 2017, pp. 4470–4478.
[53] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [75] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using
for semantic segmentation,” in Proceedings of the IEEE Conference on dense skip connections,” in Proceedings of the IEEE International
Computer vision and Pattern Recognition, 2015, pp. 3431–3440. Conference on Computer Vision, 2017, pp. 4809–4817.
[54] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation [76] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A persistent memory
learning with deep convolutional generative adversarial networks,” network for image restoration,” in Proceedings of the IEEE Conference
arXiv preprint arXiv:1511.06434, 2015. on Computer Vision and Pattern Recognition, 2017, pp. 4539–4547.
[55] W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig, and [77] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Z. Wang, “Is the deconvolution layer the same as a convolutional Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
layer?” arXiv preprint arXiv:1609.07009, 2016. [78] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
[56] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution network for image super-resolution,” in Proceedings of the IEEE
convolutional neural network,” in Proceedings of the European Con- Conference on Computer Vision and Pattern Recognition, 2018, pp.
ference on Computer Vision, 2016, pp. 391–407. 2472–2481.
[57] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin, “Accurate [79] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for
blur models vs. image priors in single image super-resolution,” in image super-resolution with sparse prior,” in Proceedings of the IEEE
Proceedings of the IEEE International Conference on Computer Vision, International Conference on Computer Vision, 2015, pp. 370–378.
2013, pp. 2832–2839. [80] K. Gregor and Y. LeCun, “Learning fast approximations of sparse cod-
[58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, ing,” in Proceedings of the International Conference on International
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for Conference on Machine Learning, 2010, pp. 399–406.
fast feature embedding,” in Proceedings of the 22nd ACM International [81] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust
Conference on Multimedia, 2014, pp. 675–678. single image super-resolution via deep networks with sparse prior,”
[59] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–
S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: A system for 3207, 2016.
large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283. [82] D. Liu, Z. Wang, N. Nasrabadi, and T. Huang, “Learning a mixture
[60] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of of deep networks for single image super-resolution,” in Proceedings of
linear regions of deep neural networks,” in Proceedings of the Advances the Asian Conference on Computer Vision, 2016, pp. 145–156.
in Neural Information Processing Systems, 2014, pp. 2924–2932. [83] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan, “Deep
[61] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution edge guided recurrent residual learning for image super-resolution,”
using very deep convolutional networks,” in Proceedings of the IEEE IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5895–
Conference on Computer Vision and Pattern Recognition, 2016, pp. 5907, 2017.
1646–1654. [84] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian
[62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for pyramid networks for fast and accurate super-resolution,” in Proceed-
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. ings of the IEEE International Conference on Computer Vision, 2017,
[63] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional pp. 624–632.
network for image super-resolution,” in Proceedings of the IEEE [85] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,”
Conference on Computer Vision and Pattern Recognition, 2016, pp. in Proceedings of the IEEE International Conference on Computer
1637–1645. Vision, 2017, pp. 5439–5448.
[64] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [86] A. Singh and N. Ahuja, “Super-resolution using sub-band self-
recognition,” in Proceedings of the IEEE Conference on Computer similarity,” in Proceedings of the Asian Conference on Computer
Vision and Pattern Recognition, 2016, pp. 770–778. Vision, 2014, pp. 552–568.
16

[87] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recur- [109] Z. Yang, K. Zhang, Y. Liang, and J. Wang, “Single image super-
rent neural networks,” in Proceedings of the International Conference resolution with a parameter economic residual-like convolutional neural
on International Conference on Machine Learning, 2016, pp. 1747– network,” in Proceedings of the International Conference on Multime-
1756. dia Modeling, 2017, pp. 353–364.
[88] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves [110] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-
et al., “Conditional image generation with PixelCNN decoders,” in Pro- resolution via information distillation network,” in Proceedings of the
ceedings of the Advances in Neural Information Processing Systems, IEEE Conference on Computer Vision and Pattern Recognition, 2018,
2016, pp. 4790–4798. pp. 723–731.
[89] M. Irani and S. Peleg, “Improving resolution by image registration,” [111] X. Fan, Y. Yang, C. Deng, J. Xu, and X. Gao, “Compressed multi-
CVGIP: Graphical models and image processing, vol. 53, no. 3, pp. scale feature fusion network for single image super-resolution,” Signal
231–239, 1991. Processing, vol. 146, pp. 50–60, 2018.
[90] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection [112] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for neural
networks for super-resolution,” in Proceedings of the IEEE Conference networks for image processing,” IEEE Transactions on Computational
on Computer Vision and Pattern Recognition Workshops, 2018, pp. Imaging, vol. 3, no. 1, pp. 47–51, 2017.
1664–1673. [113] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with deep
[91] X. Wang, K. Yu, C. Dong, and C. Change Loy, “Recovering realistic convolutional sufficient statistics,” arXiv preprint arXiv:1511.05666,
texture in image super-resolution by deep spatial feature transform,” in 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern [114] J. Johnson, A. Alahi, and F.-F. Li, “Perceptual losses for real-time
Recognition, 2018, pp. 606–615. style transfer and super-resolution,” in Proceedings of the European
[92] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional Conference on Computer Vision, 2016, pp. 694–711.
super-resolution network for multiple degradations,” in Proceedings of [115] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor, “Learning to
the IEEE Conference on Computer Vision and Pattern Recognition, maintain natural image statistics,” arXiv preprint arXiv:1803.04626,
2018, pp. 3262–3271. 2018.
[93] R. Timofte, V. De Smet, and L. Van Gool, “Semantic super-resolution: [116] R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss for
When and where is it useful?” Computer Vision and Image Under- image transformation with non-aligned data,” in Proceedings of the
standing, vol. 142, pp. 1–12, 2016. European Conference on Computer Vision, 2018, pp. 768–783.
[94] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and- [117] K. Li and J. Malik, “Implicit maximum likelihood estimation,” arXiv
play priors for model based reconstruction,” in Proceedings of the IEEE preprint arXiv:1809.09087, 2018.
Global Conference on Signal and Information Processing, 2013, pp. [118] K. Li, S. Peng, and J. Malik, “Super-resolution via conditional implicit
945–948. maximum likelihood estimation,” arXiv preprint arXiv:1810.01406,
[95] T. Meinhardt, M. Moller, C. Hazirbas, and D. Cremers, “Learning 2018.
proximal operators: Using denoising networks for regularizing inverse [119] F. Huszár, “How (not) to train your generative model: Scheduled
imaging problems,” in Proceedings of the IEEE International Confer- sampling, likelihood, adversary?” arXiv preprint arXiv:1511.05101,
ence on Computer Vision, 2017, pp. 1781–1790. 2015.
[96] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser [120] L. Theis, A. v. d. Oord, and M. Bethge, “A note on the evaluation of
prior for image restoration,” in Proceedings of the IEEE Conference generative models,” arXiv preprint arXiv:1511.01844, 2015.
on Computer Vision and Pattern Recognition, 2017, pp. 3929–3938. [121] M. S. Sajjadi, B. Schölkopf, and M. Hirsch, “EnhanceNet: Single image
super-resolution through automated texture synthesis,” in Proceedings
[97] T. Tirer and R. Giryes, “Image restoration by iterative denoising
of the IEEE International Conference on Computer Vision, 2017, pp.
and backward projections,” IEEE Transactions on Image Processing,
4501–4510.
vol. 28, no. 3, pp. 1220–1234, 2019.
[122] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
[98] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in
translation using cycle-consistent adversarial networks,” in Proceedings
Proceedings of the IEEE Conference on Computer Vision and Pattern
of the IEEE international conference on computer vision, 2017, pp.
Recognition, 2018, pp. 9446–9454.
2223–2232.
[99] K. Zhang, X. Gao, D. Tao, and X. Li, “Single image super-resolution [123] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, “Unsu-
with multiscale similarity learning,” IEEE Transactions on Neural pervised image super-resolution using cycle-in-cycle generative adver-
Networks and Learning Systems, vol. 24, no. 10, pp. 1648–1659, 2013. sarial networks,” in 2018 IEEE Conference on Computer Vision and
[100] A. Shocher, N. Cohen, and M. Irani, “zero-shot super-resolution using Pattern Recognition Workshops, 2018, pp. 814–823.
deep internal learning,” in Proceedings of the IEEE Conference on [124] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
Computer Vision and Pattern Recognition, 2018, pp. 3118–3126. adversarial networks,” in Proceedings of the International Conference
[101] M. Zontak and M. Irani, “Internal statistics of a single natural image,” on Machine Learning, 2017, pp. 214–223.
in Proceedings of the IEEE Conference on Computer Vision and Pattern [125] S. Nowozin, B. Cseke, and R. Tomioka, “f-GAN: Training generative
Recognition Workshops, 2011, pp. 977–984. neural samplers using variational divergence minimization,” in Pro-
[102] T. Michaeli and M. Irani, “Nonparametric blind super-resolution,” in ceedings of the Advances in Neural Information Processing Systems,
Proceedings of the IEEE International Conference on Computer Vision, 2016, pp. 271–279.
2013, pp. 945–952. [126] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas,
[103] T. Tirer and R. Giryes, “Super-resolution based on image-adapted CNN A. Smola, and A. Gretton, “Generative models and model crit-
denoisers: Incorporating generalization of training data and internal icism via optimized maximum mean discrepancy,” arXiv preprint
learning in test time,” arXiv preprint arXiv:1811.12866, 2018. arXiv:1611.04488, 2016.
[104] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image [127] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in
quality assessment: from error visibility to structural similarity,” IEEE Proceedings of the IEEE Conference on Computer Vision and Pattern
Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. Recognition, 2018, pp. 6228–6237.
[105] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight
super-resolution with cascading residual network,” in Proceedings of
the European Conference on Computer Vision, 2018, pp. 252–268.
[106] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
segmented natural images and its application to evaluating segmen-
tation algorithms and measuring ecological statistics,” in Proceedings
of the IEEE International Conference on Computer Vision, 2001, pp.
416–423.
[107] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “ImageNet:
A large-scale hierarchical image database,” in Proceedings of the IEEE
International Conference on Computer Vision, 2009, pp. 248–255.
[108] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image
super-resolution: Dataset and study,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops,
2017, pp. 126–135.

You might also like