Review DL2019
Review DL2019
Abstract—Single image super-resolution (SISR) is a notori- of the mapping that we aim to develop between the LR
ously challenging ill-posed problem that aims to obtain a high- space and the HR space, and the other is the inefficiency
resolution (HR) output from one of its low-resolution (LR) of establishing a complex high-dimensional mapping given
versions. Recently, powerful deep learning algorithms have been
applied to SISR and have achieved state-of-the-art performance. massive raw data. Benefiting from the strong capacity of
arXiv:1808.03344v3 [cs.CV] 12 Jul 2019
In this survey, we review representative deep learning-based extracting effective high-level abstractions that bridge the LR
SISR methods and group them into two categories according and HR space, recent DL-based SISR methods have achieved
to their contributions to two essential aspects of SISR: the significant improvements, both quantitatively and qualitatively.
exploration of efficient neural network architectures for SISR In this survey, we attempt to give an overall review of recent
and the development of effective optimization objectives for deep
SISR learning. For each category, a baseline is first established, DL-based SISR algorithms. We mainly focus on two areas:
and several critical limitations of the baseline are summarized. efficient neural network architectures designed for SISR and
Then, representative works on overcoming these limitations are effective optimization objectives for DL-based SISR learning.
presented based on their original content, as well as our critical The reason for this taxonomy is that when we apply DL
exposition and analyses, and relevant comparisons are conducted algorithms to tackle a specified task, it is best for us to
from a variety of perspectives. Finally, we conclude this review
with some current challenges and future trends in SISR that consider both the universal DL strategies and the specific
leverage deep learning algorithms. domain knowledge. From the perspective of DL, although
many other techniques such as data preprocessing [6] and
Index Terms—Single image super-resolution, deep learning,
neural networks, objective function model training techniques are also quite important [7], [8], the
combination of DL and domain knowledge in SISR is usually
the key to success and is often reflected in the innovations
I. I NTRODUCTION
of neural network architectures and optimization objectives
redundant nearest neighbor interpolation is replaced with the weight different parts distinguishingly in an adaptive way.
interpolation that pads the subpixels with zeroes, the decon- It is hard to go deeper with a plain architecture such as
volution layer can be simplified into the subpixel convolution VGG-net. Various deep models based on skip-connections
in ESPCN. Obviously, compared with the nearest neighbor can be extremely deep and have achieved state-of-the-art
interpolation, this interpolation is more efficient, which can performance in many tasks. Among them, ResNet [64], [65],
also verify the effectiveness of ESPCN. proposed by He et al., is the most representative model.
2) The Deeper, The Better: In the DL research, there is Readers can refer to [66], [67] for further discussions on why
theoretical work [60] showing that the solution space of a ResNet works well. In [68], the authors proposed SRResNet,
DNN can be expanded by increasing its depth or its width. which is composed of 16 residual units (a residual unit consists
In some situations, to attain more hierarchical representations of two nonlinear convolutions with residual learning). In each
more effectively, many works mainly focus on improvements unit, batch normalization (BN) [69] is used to stabilize the
acquired by increasing the depth. Recently, various DL- training process. The overall architecture of SRResNet is
based applications have also demonstrated the great power of shown in Fig. 5(c). Based on the original residual unit in [65],
very deep neural networks despite many training difficulties. Tai et al. proposed DRRN [70], in which basic residual units
VDSR [61] is the first very deep model used in SISR. As are rearranged in a recursive topology to form a recursive
shown in Fig. 5(a), VDSR is a 20-layer VGG-net [62]. The block, as shown in Fig. 5(d). Then, to accommodate parameter
VGG architecture sets all kernel sizes as 3 × 3 (the kernel size reduction, each block shares the same parameters and is reused
is usually odd and takes the increase in the receptive field into recursively, such as in the single recursive convolution kernel
account, and 3 × 3 is the smallest kernel size). To train this in DRCN.
deep model, the authors used a relatively high initial learning EDSR [71] was proposed by Lee et al. and has currently
rate to accelerate convergence and used gradient clipping to achieved state-of-the-art performance. EDSR has mainly made
prevent the annoying gradient explosion problem. three improvements on the overall frame: 1) Compared with
In addition to the innovative architecture, VDSR has made the residual unit used in previous work, EDSR removes the us-
two more contributions. The first one is that a single model age of BN, as shown in Fig. 5(e). The original ResNet with BN
is used for multiple scales since the SISR processes with was designed for classification, where inner representations are
different scale factors have a strong relationship with each highly abstract, and these representations can be insensitive to
other. This fact is the basis of many traditional SISR methods. the shift introduced by BN. Regarding image-to-image tasks
Similar to SRCNN, VDSR takes the bicubic of LR as input. such as SISR, since the input and output are strongly related,
During training, VDSR puts the bicubics of LR of different if the convergence of the network is not a problem, then
scale factors together for training. For larger scale factors such a shift may harm the final performance. 2) Except for
(×3, ×4), the mapping for a smaller scale factor (×2) may regular depth increasing, EDSR also increases the number of
also be informative. The second contribution is the residual output features of each layer on a large scale. To relinquish
learning. Unlike the direct mapping from the bicubic version the difficulties of training such a wide ResNet, the residual
to HR, VDSR uses deep CNN to learn the mapping from the scaling trick proposed in [72] is employed. 3) Additionally,
bicubic to the residual between the bicubic and HR. The au- inspired by the fact that the SISR processes with different
thors argued that residual learning could improve performance scale factors have strong relationships with each other, when
and accelerate convergence. training the models for ×3 and ×4 scales, the authors of [71]
The convolution kernels in the nonlinear mapping part of initialized the parameters with the pretrained ×2 network. This
VDSR are very similar, and in order to reduce parameters, pretraining strategy accelerates the training and improves the
Kim et al. further proposed DRCN [63], which utilizes the final performance.
same convolution kernel in the nonlinear mapping part 16 The effectiveness of the pretraining strategy in EDSR im-
times, as shown in Fig. 5(b). To overcome the difficulties of plies that models for different scales may share many inter-
training a deep recursive CNN, a multisupervised strategy is mediate representations. To explore this idea further, similar
applied, and the final result can be regarded as the fusion of to building a multiscale architecture as VDSR does on the
16 intermediate results. The coefficients for fusion are a list condition of bicubic input, the authors of EDSR proposed
of trainable positive scalars with the summation of 1. As they MDSR to achieve the multiscale architecture, as shown in
showed, DRCN and VDSR have a quite similar performance. Fig. 5(g). In MDSR, the convolution kernels for nonlinear
Here, we believe that it is necessary to emphasize the impor- mapping are shared across different scales, where only the
tance of the multisupervised training in DRCN. This strategy front convolution kernels for extracting features and the final
not only creates short paths through which the gradients can subpixel upsampling convolution are different. At each update
flow more smoothly during backpropagation but also guides all during training MDSR, minibatches for ×2, ×3 and ×4 are
the intermediate representations to reconstruct raw HR outputs. randomly chosen, and only the corresponding parts of MDSR
Finally, fusing all these raw HR outputs produces a wonderful are updated.
result. However, for fusion, this strategy has two flaws: 1) In addition to ResNet, DenseNet [73] is another effective
once the weight scalars are determined in the training process, architecture based on skip connections. In DenseNet, each
they will not change with different inputs; and 2) using a layer is connected with all the preceding representations,
single scalar to weight HR outputs does not take pixelwise and the bottleneck layers are used in units and blocks to
differences into consideration, that is, it would be better to reduce the parameter amounts. In [74], the authors pointed
5
out that ResNet enables feature re-usage while DenseNet three novel works within this scope: DEGREE [83], combining
enables new feature exploration. Based on the basic DenseNet, the progressive property of ResNet with traditional subband
SRDenseNet [75], as shown in Fig. 5(f), further concatenates reconstruction; LapSRN [84], generating SR of different scales
all the features from different blocks before the deconvolution progressively; and PixelSR [85], leveraging conditional autore-
layer, which is shown to be effective in improving perfor- gressive models to generate SR pixel-by-pixel.
mance. MemNet [76], proposed by Tai et al., uses the residual Compared with other deep architectures, ResNet is in-
unit recursively to replace the normal convolution in the block triguing for its progressive properties. Taking SRResNet for
of the basic DenseNet and adds dense connections among example, one can observe that directly sending the repre-
different blocks, as shown in Fig. 5(h). The authors explained sentations produced by intermediate residual blocks to the
that the local connections in the same block resemble the final reconstruction part will also yield a quite good raw
short-term memory and the connections with previous blocks HR estimator. The deeper these representations are, the better
resemble the long-term memory [77]. Recently, RDN [78] was the results that can be obtained. A similar phenomenon of
proposed by Zhang et al. and uses a similar structure. In an ResNet applied in recognition is reported in [66]. DEGREE,
RDN block, basic convolution units are densely connected proposed by Yang et al., combines this progressive property
similar to DenseNet, and at the end of an RDN block, a of ResNet with the subband reconstruction of traditional SR
bottleneck layer is used, following with the residual learning methods [86]. The residues learned in each residual block
across the whole block. Before entering the reconstruction can be used to reconstruct high-frequency details, resembling
part, features from all previous blocks are fused by the dense the signals from a certain high-frequency band. To simulate
connection and residual learning. subband reconstruction, a recursive residual block is used.
3) Combining Properties of the SISR Process with the Compared with the traditional supervised subband recovery
Design of the CNN Frame: In this subsection, we discuss some methods that need to obtain subband ground truth by diverse
deep frames whose architectures or procedures are inspired filters, this simulation with recursive ResNet avoids explicitly
by some representative methods for SISR. Compared with the estimating intermediate subband components, benefiting from
abovementioned NN-oriented methods, these methods can be the end-to-end representation learning.
better interpreted, and they sometimes are more sophisticated As mentioned above, models for small scale factors can
in addressing certain challenging cases for SISR. be used for a raw estimator of a large scale SISR. In the
Combining sparse coding with deep NN: The sparse SISR community, SISR under large scale factors (e.g.,×8)
prior in nature images and the relationships between the HR has been a challenging problem for a long time. In such
and LR spaces rooted from this prior were widely used for situations, plausible priors are imposed to restrict the solution
their great performance and theoretical support. SCN [79] space. A straightforward way to address this is to gradually
was proposed by Wang et al. and uses the learned iterative increase resolution by adding extra supervision on the auxil-
shrinkage and thresholding algorithm (LISTA) [80], which iary SISR process of the small scale. Based on this heuristic
produces an approximate estimation of sparse coding based prior, LapSRN, proposed by Lai et al., uses the Laplacian
on NN, to solve the time-consuming inference in traditional pyramid structure to reconstruct HR outputs. LapSRN has
sparse coding SISR. They further introduced a cascaded ver- two branches: the feature extraction branch and the image
sion (CSCN) [81] that employs multiple SCNs. Previous works reconstruction branch, as shown in Fig. 6. At each scale, the
such as SRCNN tried to explain general CNN architectures image reconstruction branch estimates a raw HR output of
with the sparse coding theory, which from today’s view the present stage, and the feature extraction branch outputs
may be somewhat unconvincing. SCN combines these two a residue between the raw estimator and the corresponding
important concepts innovatively and gains both quantitative ground truth as well as extracts useful representations for the
and qualitative improvements. next stage.
Learning to ensemble by NN: Different models specialize When faced with large scale factors with a severe loss of
in different image patterns of SISR. From the perspective necessary details, some researchers suggest that synthesizing
of ensemble learning, a better result can be acquired by rational details can achieve better results. In this situation, deep
adaptively fusing various models with different purposes at generative models, which will be discussed in the next sec-
the pixel level. Motivated by this idea, MSCN was proposed tions, could be good choices. Compared with the traditional in-
by Liu et al. [82] by developing an extra module in the form dependent point estimation of the lost information, conditional
of a CNN, taking the LR as input and outputting several autoregressive generative models using conditional maximum
tensors with the same shape as the HR. These tensors can likelihood estimation in directional graphical models gradually
be viewed as adaptive elementwise weights for each raw HR generate high-resolution images based on the previously gen-
output. By selecting NNs as the raw SR inference modules, erated pixels. PixelRNN [87] and PixelCNN [88] are recent
the raw estimating parts and the fusing part can be optimized representative autoregressive generative models. The current
jointly. However, in MSCN, the summation of coefficients at pixel in PixelRNN and PixelCNN is explicitly dependent on
each pixel is not 1, which may be slightly incongruous. the left and top pixels that have already been generated. To
Deep architectures with progressive methodology: In- implement such operations, novel network architectures are
creasing SISR performance progressively has been extensively elaborated. PixelSR was proposed by Dahl et al. and first
studied previously, and many recent DL-based approaches also applies conditional PixelCNN to SISR. The overall architec-
exploit it from various perspectives. Here, we mainly discuss ture is shown in Fig. 7. The conditioning CNN takes LR
6
ˆ = 2µI µIˆ + k1 σ ˆ + k2
SSIM (I, I) 2 2 · 2 II 2 , (7)
Figure 9: Learning curves for the reconstruction of different µI + µIˆ + k1 σI + σIˆ + k2
kinds of images. We re-implement the experiment in [98]
with the image ‘butterfly’ in Set5. where µI and σI2 is the mean and variance of I, σI Iˆ is the
covariance between I and I, ˆ and k1 and k2 are constant
relaxation terms.
noisy parts. Although these totally unsupervised methods 2) Number of parameters of NN for measuring storage
are outperformed by other supervised learning methods, they efficiency (Params).
perform considerably better than some other naive methods. 3) Number of composite multiply-accumulate operations
Deep architectures with internal examples: Internal- for measuring computational efficiency (Mult&Adds):
example SISR algorithms are based on the recurrence of small Since operations in NNs for SISR are mainly multiplications
pieces of information across different scales of a single image, with additions, we use Mult&Adds in CARN [105] to measure
which are shown to be better at addressing specific details computation, assuming that the desired SR is 720p.
rarely existing in other external images [99]. ZSSR [100], pro- Notably, it has been shown in [48] and [49] that the training
posed by Shocher et al., is the first literature combining deep datasets have a great influence on the final performance, and
architectures with internal-example learning. In ZSSR, other usually, more abundant training data will lead to better results.
than the image for testing, no extra images are needed, and all Generally, these models are trained via three main datasets: 1)
the patches for training are taken from different degraded pairs 91 images from [19] and 200 images from [106], called the
of the test image. As demonstrated in [101], the visual entropy 291 dataset (some models only use 91 images); 2) images
inside a single image is much smaller than the large training derived from ImageNet [107] randomly; and 3) the newly
dataset collected from wide ranges, so unlike external-example published DIV2K dataset [108]. In addition to the different
SISR algorithms, a very small CNN is sufficient. As we number of images each dataset contains, the quality of images
mentioned previously for VDSR, the training data for a small- in each dataset is also different. Images in the 291 dataset are
scale model can also be useful for training large-scale models. usually small (on average, 150×150), images in ImageNet are
Additionally, based on this trick, ZSSR can be more robust by much larger, while images in DIV2K are of very high quality.
collecting more internal training pairs with small scale factors Because of the restricted resolution of the images in the 291
for training large-scale models. However, this approach will dataset, models on this set have difficulties in obtaining large
increase runtime immensely. Notably, when combined with patches with large receptive fields. Therefore, models based on
the kernel estimation algorithms mentioned in [102], ZSSR the 291 dataset usually take the bicubic of LR as input, which
performs quite well with the unknown degradation kernels. is quite time-consuming. Table I compares different models
Recently, Tirer et al. argued that degradation in LR de- on the mentioned criteria.
creases the performance of internal-example algorithms [103]. From Table I, we can see that generally as the depth and
Therefore, they proposed to use reconstruction-based deep the number of parameters grow, the performance improves.
frame IDBP [97] to obtain an initial SR result and then conduct However, the growth rate of performance levels off. Recently,
internal-example-based network training similar to ZSSR. This some works on designing light models [109], [105], [110]
method was believed to combine two successful techniques and learning sparse structural NN [111] were proposed to
that address the mismatch between training and test, and it achieve relatively good performance with less storage and
has achieved robust performance in these cases. computation, which are very meaningful in practice.
For the second part, we mainly show that the performance
of the models for some specific degradation dropped drasti-
C. Comparisons among Different Models and Discussion cally when the true degradation mismatches the one assumed
for training. For example, we use four models, including
In this section, we will summarize recent progress in deep EDSR trained with bicubic degradation [71], IRCNN [96],
architectures for SISR from two perspectives: quantitative SRMD [92] and ZSSR [100], to address LRs generated by
comparisons for those trained by specific blurring, and com- Gaussian kernel degradation (kernel size of 7 × 7 with band-
parisons on those models for handling nonspecific blurring. width 1.6), as shown in Fig. 10, and the performance of EDSR
For the first part, quantitative criteria mainly include the dropped drastically with obvious blur, while other models
following: for nonspecific degradation perform quite well. Therefore, to
9
address some longstanding problems in SISR, such as un- determined by the training data regardless of the parameter θ
known degradation, the direct usage of general deep learning of the model (or the model distribution Pmodel (x; θ)). Hence,
techniques may not be sufficient. More effective solutions can when we use the training samples to estimate parameter θ,
be achieved by combining the power of DL and the specific minimizing this KLD is equivalent to MLE.
properties of the SISR scene. Here, we have demonstrated that MSE is a special case
of MLE, and MLE is a special case of KLD. However,
IV. O PTIMIZATION O BJECTIVES FOR DL- BASED SISR we may conjecture whether the assumptions underlying these
A. Benchmark of Optimization Objectives for DL-based SISR specializations are violated. This consideration has led to some
emerging objective functions from four perspectives:
We select the MSE loss used in SRCNN as the benchmark. 1) Translating MLE into MSE can be achieved by assuming
It is known that using MSE favors a high PSNR, and PSNR Gaussian white noise. Although the Gaussian model is the
is a widely used metric for quantitatively evaluating image most widely used model for its simplicity and technical
restoration quality. Optimizing MSE can be viewed as a support, what if this independent Gaussian noise assumption
regression problem, leading to a point estimation of θ as is violated in a complicated scene such as SISR?
X
min ||F (xi ; θ) − yi ||2 , (8)
2) To use MLE, we need to assume the parametric form
θ of the data distribution. What if the parametric form is
i
misspecified?
where (xi , yi ) are the ith training examples and F (x; θ) is
a CNN parameterized by θ. Here, (8) can be interpreted 3) Apart from KLD in (10), are there any other distances
in a probabilistic way by assuming Gaussian white noise between probability measures that we can use as the optimiza-
(N (; 0, σ 2 I)) independent of the image in the regression tion objectives for SISR?
model, and then, the conditional probability of y given x 4) Under specific circumstances, how can we choose the
becomes a Gaussian distribution with mean F (x; θ) and the suitable objective functions according to their properties?
diagonal covariance matrix σ 2 I, where I is the identity matrix: Based on some solutions to these four questions, recent
work on optimization objectives for DL-based SISR will be
p(y|x) = N (y; F (x; θ), σ 2 I). (9) discussed in Sections IV-B, IV-C, IV-D and IV-E, respectively.
Then, using maximum likelihood estimation (MLE) on the
training examples with (9) will lead to (8). B. Objective Functions Based on non-Gaussian Additive
The Kullback-Leibler divergence (KLD) between the condi- Noises
tional empirical distribution Pdata and the conditional model
distribution Pmodel is defined as The poor perceptual quality of the SISR images obtained by
optimizing MSE directly demonstrates a fact: using Gaussian
Pdata (z)
DKL (Pdata ||Pmodel ) = Ez∼Pdata [log ]. (10) additive noise in the HR space is not good enough. To address
Pmodel (z) this problem, solutions are proposed from two aspects: use
We call (10) the forward KLD, where z = y|x denotes the other distributions for this additive noise, or transfer the HR
HR (SR) conditioned on its LR counterpart, Pdata and Pmodel space to some space where the Gaussian noise is reasonable.
are the conditional distributions of HR|LR and SR|LR, 1) Denote Additive Noise with Other Probability Distribu-
respectively, where Ex∼Pdata [log Pdata (z)] is an intrinsic term tions: In [112], Zhao et al. investigated the difference between
10
Figure 10: Comparisons of ’monarch’ in Set14 for scale 2 with Gaussian kernel degradation. We can see that, given the
degradation mismatch with that of training, the performance of EDSR decreases drastically.
mean absolute error (MAE) and MSE used to it optimize NN trained by minimizing the Euclidean distance as
in image processing. Similar to (8), MAE can be written as 2
min ||Φ(x) − Ψ(r)|| . (14)
X Φ
min ||F (xi ; θ) − yi ||1 . (11)
θ After Φ is obtained, the final result r can be inferred with
i
SGD by solving
From the perspective of probability, (11) can be interpreted
as introducing Laplacian white noise, and similar to (9), the
conditional probability becomes 2
min ||Φ(x) − Ψ(r)|| . (15)
r
p(y|x) = Laplace(y; F (x; θ), bI). (12)
For further improvement, [113] also proposed a fine-tuning
Compared with MSE in regression, MAE is believed to be algorithm in which Φ and Ψ can be fine-tuned to the data.
more robust against outliers. As reported in [112], when MAE Similar to the alternating updating in GAN, Φ and Ψ are fine-
is used to optimize an NN, the NN tends to converge faster tuned with SGD based on the current r. However, this fine-
and produce better results. The authors argued that the reason tuning will involve calculating the gradient of the partition
might be because MAE could guide NN to reach a better local function Z, which is a well-known difficult decomposition into
minimum. Other similar loss functions in robust statistics can the positive phase and the negative phase of learning. Hence
be viewed as modeling additive noises with other probability to avoid sampling within inner loops, a biased estimator of
distributions. this gradient is chosen for simplicity.
Although these specific distributions often cannot represent The inference algorithm in [113] is extremely time-
unknown additive noise very precisely, their corresponding consuming. To improve efficiency, Johnson et al. utilized
robust statistical loss functions are used in many DL-based this perceptual loss in an end-to-end training manner [114].
SISR works for their conciseness and advantages over MSE. In [114], the SISR network is directly optimized with SGD
2) Using MSE in a Transformed Space: Alternatively, we by minimizing the MSE in the feature manifold produced by
can search for a mapping Ψ to transform the HR space to some VGG-16 as follows:
space where Gaussian white noise can be used reasonably. min ||Ψ(F (x; θ)) − Ψ(y)|| ,
2
(16)
From this perspective, Bruna et al. [113] proposed so-called θ
perceptual loss to leverage deep architectures. In [113], the where Ψ is the mapping represented by VGG-16, F (x; θ) de-
conditional probability of the residual r between HR and LR notes the SISR network, and y is the ground truth. Compared
given the LR x is stimulated by the Gibbs energy model: with [113], [114] replaces the nonlinear mapping Φ and the
2 expensive inference with an end-to-end trained CNN, and their
p(r|x) = exp(−||Φ(x) − Ψ(r)|| − log Z), (13)
results show that this change does not affect the restoration
where Φ and Ψ are two mappings between the original quality but does accelerate the whole process.
spaces and the transformed ones, and Z is the partition Perceptual loss mitigates blurring and leads to more
function. The features produced by sophisticated supervised visually-pleasing results compared with directly optimizing
deep architectures have been shown to be perceptually stable MSE in the HR space. However, there remains no theoretical
and discriminative, denoted by Ψ(r)2 . Then, Ψ represents the analysis on why this approach works. In [113], the author
corresponding deep architectures. In contrast, Φ is the mapping generally concluded that successful supervised networks used
between the LR space and the manifold represented by Ψ(r), for high-level tasks could produce very compact and stable
features. In these feature spaces, small pixel-level variation and
2 Either the scattering network or VGG can be denoted by Ψ. When Ψ is much other trivial information can be omitted, making these
VGG, there is no residual learning and fine-tuning. feature maps mainly focus on pixels of human interest. At
11
the same time, with the deep architectures, the most specific bound can be rewritten as
and discriminative information of the input is shown to be 1 X
retained in feature spaces because of the great performance − log ||Aj ||1 , (22)
N j
of the models applied in various high-level tasks. From this
perspective, using MSE in these feature spaces will focus more where Aj = (A1j , · · · , Akj )T , and k · k1 is the `1 norm. When
on the parts that are attractive to human observers with little the bandwidth h → 0, the affinity Akj will degrade into the
loss of original contents, so perceptually pleasing results can indicator function, which means if xk = yj , Akj ≈ 1; other-
be obtained. wise, Akj ≈ 0. In this case, the `1 norm can be approximated
well by the `∞ norm, which returns the maximum element of
the vector. Thus, (22) can degenerate into the contextual loss
C. Optimizing Forward KLD with Nonparametric Estimation in [115], [116]:
1 X
Parametric estimation methods such as MLE need to specify − log max Akj . (23)
N j k
in advance the parametric form the distribution of data, which
suffers from model misspecification. Different from parametric Recently, implicit likelihood estimation (IMLE) [117]
estimation, nonparametric estimation methods such as kernel was proposed and its conditional version was applied to
distribution estimation (KDE) fit the data without distributional SISR [118]. Here, we will briefly show that minimizing IMLE
assumptions, which are robust when the real distributional equals minimizing an upper bound of the forward KLD with
form is unknown. Based on nonparametric estimation, re- KDE. Let us use a Gaussian kernel as
cently, the contextual loss [115], [116] was proposed by 1
kx − yk22
Mechrez et al. to maintain natural image statistics. In the K(x, y) = √ exp − . (24)
2πh 2h2
contextual loss, a Gaussian kernel function is applied:
As with (20), the optimization objective can be rewritten as
K(x, y) = exp(−dist(x, y)/h − log Z), (17) X − kzk −wj k22
1 X
− log e 2h2 . (25)
where dist(x, y) can be any symmetric distance between x N j
k
R y, h is the bandwidth, and the partition function Z =
and
With {wj }m and {zk }N
exp(−dist(x, y)/h)dy. Then, Pdata and Pmodel are j=1 k=1 , we can obtain a simple upper
X bound of (25) as
Pdata (z) = K(z, zi ), 1 X
kzk −wj k2
2
zi ∼Pdata − log m min e− 2h2
X (18) N j
k
Pmodel (z) = K(z, wj ), (26)
wj ∼Pmodel
1 X ||zk − wj ||22
= (min − log m).
N j 2h2
and (10) can be rewritten as k
Figure 12: Visual comparisons between the MSE, MSE + GAN and MAE +GAN + Contextual Loss (The authors of [68]
and [116] released their results.) We can see that the perceptual loss leads to a lower PSNR/SSIM but a better visual quality.
there are still many underlying problems. We summarize the [18] M. Aharon, M. Elad, A. Bruckstein et al., “K-SVD: An algorithm for
main challenges into three aspects: the acceleration of deep designing overcomplete dictionaries for sparse representation,” IEEE
Transactions on Signal Processing, vol. 54, no. 11, p. 4311, 2006.
models, the extensive comprehension of deep models and the [19] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution
criteria for designing and evaluating the objective functions. via sparse representation,” IEEE Transactions on Image Processing,
Along with these challenges, several directions may be further vol. 19, no. 11, pp. 2861–2873, 2010.
[20] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
explored in the future. sparse-representations,” in Proceedings of the International Conference
on Curves and Surfaces, 2010, pp. 711–730.
ACKNOWLEDGMENT [21] R. Timofte, V. De, and L. Van Gool, “Anchored neighborhood regres-
sion for fast example-based super-resolution,” in Proceedings of the
We are grateful to the authors of [47], [84], [71], [61], IEEE international Conference on Computer Vision, 2013, pp. 1920–
[68], [121], [116], [114], [96], [92], [100] for kindly releasing 1927.
their experimental results or codes, as well as to the three [22] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored
neighborhood regression for fast super-resolution,” in Proceedings of
anonymous reviewers for their constructive criticism, which the Asian Conference on Computer Vision, 2014, pp. 111–126.
has significantly improved our manuscript. Moreover, we [23] F. Cao, M. Cai, Y. Tan, and J. Zhao, “Image super-resolution via
thank Qiqi Bao for helpful discussions. adaptive `p (0 < p < 1) regularization and sparse representation,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 27,
no. 7, pp. 1550–1561, 2016.
R EFERENCES [24] J. Liu, W. Yang, X. Zhang, and Z. Guo, “Retrieval compensated group
structured sparsity for image super-resolution,” IEEE Transactions on
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
Multimedia, vol. 19, no. 2, pp. 302–316, 2017.
no. 7553, p. 436, 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [25] S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image
with deep convolutional neural networks,” in Proceedings of the upscaling with super-resolution forests,” in Proceedings of the IEEE
Advances in Neural Information Processing Systems, 2012, pp. 1097– Conference on Computer Vision and Pattern Recognition, 2015, pp.
1105. 3791–3799.
[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, [26] K. Zhang, D. Tao, X. Gao, X. Li, and J. Li, “Coarse-to-fine learning for
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural single-image super-resolution,” IEEE Transactions on Neural Networks
networks for acoustic modeling in speech recognition: The shared and Learning Systems, vol. 28, no. 5, pp. 1109–1122, 2017.
views of four research groups,” IEEE Signal Processing Magazine, [27] J. Yu, X. Gao, D. Tao, X. Li, and K. Zhang, “A unified learning
vol. 29, no. 6, pp. 82–97, 2012. framework for single image super-resolution,” IEEE Transactions on
[4] R. Collobert and J. Weston, “A unified architecture for natural language Neural Networks and Learning systems, vol. 25, no. 4, pp. 780–792,
processing: Deep neural networks with multitask learning,” in Proceed- 2014.
ings of the International Conference on Machine Learning, 2008, pp. [28] C. Deng, J. Xu, K. Zhang, D. Tao, X. Gao, and X. Li, “Similarity
160–167. constraints-based structured output regression machine: An approach to
[5] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A image super-resolution,” IEEE Transactions on Neural Networks and
benchmark,” in Proceedings of the European Conference on Computer Learning Systems, vol. 27, no. 12, pp. 2472–2485, 2016.
Vision, 2014, pp. 372–386. [29] W. Yang, Y. Tian, F. Zhou, Q. Liao, H. Chen, and C. Zheng, “Consis-
[6] R. Timofte, R. Rothe, and L. Van Gool, “Seven ways to improve tent coding scheme for single-image super-resolution via independent
example-based single image super resolution,” in Proceedings of the dictionaries,” IEEE Transactions on Multimedia, vol. 18, no. 3, pp.
IEEE Conference on Computer Vision and Pattern Recognition, 2016, 313–325, 2016.
pp. 1865–1873. [30] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
[7] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” review and new perspectives,” IEEE Transactions on Pattern Analysis
arXiv preprint arXiv:1412.6980, 2014. and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: [31] H. A. Song and S.-Y. Lee, “Hierarchical representation using NMF,”
Surpassing human-level performance on ImageNet classification,” in in Proceedings of the International Conference on Neural Information
Proceedings of the IEEE International Conference on Computer Vision, Processing, 2013, pp. 466–473.
2015, pp. 1026–1034. [32] J. Schmidhuber, “Deep learning in neural networks: An overview,”
[9] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image re- Neural Networks, vol. 61, pp. 85–117, 2015.
construction: a technical overview,” IEEE Signal Processing Magazine, [33] N. Rochester, J. Holland, L. Haibt, and W. Duda, “Tests on a cell
vol. 20, no. 3, pp. 21–36, 2003. assembly theory of the action of the brain, using a large digital
[10] R. Keys, “Cubic convolution interpolation for digital image process- computer,” IRE Transactions on Information Theory, vol. 2, no. 3, pp.
ing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 80–93, 1956.
vol. 29, no. 6, pp. 1153–1160, 1981.
[34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
[11] C. E. Duchon, “Lanczos filtering in one and two dimensions,” Journal
sentations by back-propagating errors,” Nature, vol. 323, no. 6088, p.
of Applied Meteorology, vol. 18, no. 8, pp. 1016–1022, 1979.
533, 1986.
[12] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, and A. K. Katsaggelos, “Soft-
[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
cuts: a soft edge smoothness prior for color image super-resolution,”
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
IEEE Transactions on Image Processing, vol. 18, no. 5, pp. 969–981,
zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
2009.
1989.
[13] J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient
profile prior,” in Proceedings of the IEEE Conference on Computer [36] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14,
Vision and Pattern Recognition, 2008, pp. 1–8. no. 2, pp. 179–211, 1990.
[14] Q. Yan, Y. Xu, X. Yang, and T. Q. Nguyen, “Single image superresolu- [37] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-
tion based on gradient profile sharpness,” IEEE Transactions on Image cies with gradient descent is difficult,” IEEE Transactions on Neural
Processing, vol. 24, no. 10, pp. 3187–3202, 2015. Networks, vol. 5, no. 2, pp. 157–166, 1994.
[15] A. Marquina and S. J. Osher, “Image super-resolution by TV- [38] J. F. Kolen and S. C. Kremer, Gradient Flow in Recurrent Nets:
regularization and Bregman iteration,” Journal of Scientific Computing, The Difficulty of Learning LongTerm Dependencies. IEEE, 2001.
vol. 37, no. 3, pp. 367–382, 2008. [Online]. Available: https://ieeexplore.ieee.org/document/5264952
[16] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super- [39] G. E. Hinton, “Learning multiple layers of representation,” Trends in
resolution,” IEEE Computer Graphics and Applications, vol. 22, no. 2, Cognitive Sciences, vol. 11, no. 10, pp. 428–434, 2007.
pp. 56–65, 2002. [40] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and
[17] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through J. Schmidhuber, “Flexible, high performance convolutional neural
neighbor embedding,” in Proceedings of the IEEE Conference on networks for image classification,” in Proceedings of the International
Computer Vision and Pattern Recognition, 2004, pp. 275–282. Joint Conference on Artificial Intelligence, 2011, pp. 1237–1242.
15
[41] D. CireşAn, U. Meier, J. Masci, and J. Schmidhuber, “Multi-column [65] ——, “Identity mappings in deep residual networks,” in Proceedings
deep neural network for traffic sign classification,” Neural Networks, of the European Conference on Computer Vision, 2016, pp. 630–645.
vol. 32, pp. 333–338, 2012. [66] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave
[42] R. Salakhutdinov and H. Larochelle, “Efficient learning of deep Boltz- like ensembles of relatively shallow networks,” in Proceedings of the
mann machines,” in Proceedings of the International Conference on Advances in Neural Information Processing Systems, 2016, pp. 550–
Artificial Intelligence and Statistics, 2010, pp. 693–700. 558.
[43] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv [67] D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and
preprint arXiv:1312.6114, 2013. B. McWilliams, “The shattered gradients problem: If resnets are the
[44] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop- answer, then what is the question?” in Proceedings of the International
agation and approximate inference in deep generative models,” arXiv Conference on Machine Learning, 2017, pp. 342–350.
preprint arXiv:1401.4082, 2014. [68] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” image super-resolution using a generative adversarial network,” in
in Proceedings of the Advances in Neural Information Processing Proceedings of the IEEE conference on computer vision and pattern
Systems, 2014, pp. 2672–2680. recognition, 2017, pp. 4681–4690.
[46] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. [69] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
MIT press Cambridge, 2016, vol. 1. network training by reducing internal covariate shift,” in Proceedings
[47] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convo- of the International Conference on Machine Learning, 2015, pp. 448–
lutional network for image super-resolution,” in Proceedings of the 456.
European Conference on Computer Vision, 2014, pp. 184–199. [70] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive
[48] ——, “Image super-resolution using deep convolutional networks,” residual network,” in Proceedings of the IEEE Conference on Computer
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vision and Pattern Recognition, 2017, pp. 3147–3155.
vol. 38, no. 2, pp. 295–307, 2016. [71] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
[49] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, residual networks for single image super-resolution,” in Proceedings
D. Rueckert, and Z. Wang, “Real-time single image and video super- of the IEEE Conference on Computer Vision and Pattern Recognition
resolution using an efficient sub-pixel convolutional neural network,” in Workshops, 2017, pp. 136–144.
Proceedings of the IEEE Conference on Computer Vision and Pattern [72] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
Recognition, 2016, pp. 1874–1883. inception-resnet and the impact of residual connections on learning,”
[50] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional in Proceedings of the Association for the Advancement of Artificial
networks for mid and high level feature learning,” in Proceedings of the Intelligence, 2017, pp. 4278–4284.
IEEE International Conference on Computer Vision, 2011, pp. 2018– [73] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely
2025. connected convolutional networks,” in Proceedings of the IEEE Con-
[51] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep ference on Computer Vision and Pattern Recognition, 2017, pp. 4700–
learning,” arXiv preprint arXiv:1603.07285, 2016. 4708.
[52] M. D. Zeiler and R. Fergus, “Visualizing and understanding convo- [74] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path
lutional networks,” in Proceedings of the European Conference on networks,” in Proceedings of the Advances in Neural Information
Computer Vision, 2014, pp. 818–833. Processing Systems, 2017, pp. 4470–4478.
[53] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [75] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using
for semantic segmentation,” in Proceedings of the IEEE Conference on dense skip connections,” in Proceedings of the IEEE International
Computer vision and Pattern Recognition, 2015, pp. 3431–3440. Conference on Computer Vision, 2017, pp. 4809–4817.
[54] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation [76] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A persistent memory
learning with deep convolutional generative adversarial networks,” network for image restoration,” in Proceedings of the IEEE Conference
arXiv preprint arXiv:1511.06434, 2015. on Computer Vision and Pattern Recognition, 2017, pp. 4539–4547.
[55] W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig, and [77] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Z. Wang, “Is the deconvolution layer the same as a convolutional Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
layer?” arXiv preprint arXiv:1609.07009, 2016. [78] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
[56] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution network for image super-resolution,” in Proceedings of the IEEE
convolutional neural network,” in Proceedings of the European Con- Conference on Computer Vision and Pattern Recognition, 2018, pp.
ference on Computer Vision, 2016, pp. 391–407. 2472–2481.
[57] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin, “Accurate [79] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for
blur models vs. image priors in single image super-resolution,” in image super-resolution with sparse prior,” in Proceedings of the IEEE
Proceedings of the IEEE International Conference on Computer Vision, International Conference on Computer Vision, 2015, pp. 370–378.
2013, pp. 2832–2839. [80] K. Gregor and Y. LeCun, “Learning fast approximations of sparse cod-
[58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, ing,” in Proceedings of the International Conference on International
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for Conference on Machine Learning, 2010, pp. 399–406.
fast feature embedding,” in Proceedings of the 22nd ACM International [81] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust
Conference on Multimedia, 2014, pp. 675–678. single image super-resolution via deep networks with sparse prior,”
[59] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–
S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: A system for 3207, 2016.
large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283. [82] D. Liu, Z. Wang, N. Nasrabadi, and T. Huang, “Learning a mixture
[60] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of of deep networks for single image super-resolution,” in Proceedings of
linear regions of deep neural networks,” in Proceedings of the Advances the Asian Conference on Computer Vision, 2016, pp. 145–156.
in Neural Information Processing Systems, 2014, pp. 2924–2932. [83] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan, “Deep
[61] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution edge guided recurrent residual learning for image super-resolution,”
using very deep convolutional networks,” in Proceedings of the IEEE IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5895–
Conference on Computer Vision and Pattern Recognition, 2016, pp. 5907, 2017.
1646–1654. [84] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian
[62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for pyramid networks for fast and accurate super-resolution,” in Proceed-
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. ings of the IEEE International Conference on Computer Vision, 2017,
[63] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional pp. 624–632.
network for image super-resolution,” in Proceedings of the IEEE [85] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,”
Conference on Computer Vision and Pattern Recognition, 2016, pp. in Proceedings of the IEEE International Conference on Computer
1637–1645. Vision, 2017, pp. 5439–5448.
[64] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [86] A. Singh and N. Ahuja, “Super-resolution using sub-band self-
recognition,” in Proceedings of the IEEE Conference on Computer similarity,” in Proceedings of the Asian Conference on Computer
Vision and Pattern Recognition, 2016, pp. 770–778. Vision, 2014, pp. 552–568.
16
[87] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recur- [109] Z. Yang, K. Zhang, Y. Liang, and J. Wang, “Single image super-
rent neural networks,” in Proceedings of the International Conference resolution with a parameter economic residual-like convolutional neural
on International Conference on Machine Learning, 2016, pp. 1747– network,” in Proceedings of the International Conference on Multime-
1756. dia Modeling, 2017, pp. 353–364.
[88] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves [110] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-
et al., “Conditional image generation with PixelCNN decoders,” in Pro- resolution via information distillation network,” in Proceedings of the
ceedings of the Advances in Neural Information Processing Systems, IEEE Conference on Computer Vision and Pattern Recognition, 2018,
2016, pp. 4790–4798. pp. 723–731.
[89] M. Irani and S. Peleg, “Improving resolution by image registration,” [111] X. Fan, Y. Yang, C. Deng, J. Xu, and X. Gao, “Compressed multi-
CVGIP: Graphical models and image processing, vol. 53, no. 3, pp. scale feature fusion network for single image super-resolution,” Signal
231–239, 1991. Processing, vol. 146, pp. 50–60, 2018.
[90] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection [112] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for neural
networks for super-resolution,” in Proceedings of the IEEE Conference networks for image processing,” IEEE Transactions on Computational
on Computer Vision and Pattern Recognition Workshops, 2018, pp. Imaging, vol. 3, no. 1, pp. 47–51, 2017.
1664–1673. [113] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with deep
[91] X. Wang, K. Yu, C. Dong, and C. Change Loy, “Recovering realistic convolutional sufficient statistics,” arXiv preprint arXiv:1511.05666,
texture in image super-resolution by deep spatial feature transform,” in 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern [114] J. Johnson, A. Alahi, and F.-F. Li, “Perceptual losses for real-time
Recognition, 2018, pp. 606–615. style transfer and super-resolution,” in Proceedings of the European
[92] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional Conference on Computer Vision, 2016, pp. 694–711.
super-resolution network for multiple degradations,” in Proceedings of [115] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor, “Learning to
the IEEE Conference on Computer Vision and Pattern Recognition, maintain natural image statistics,” arXiv preprint arXiv:1803.04626,
2018, pp. 3262–3271. 2018.
[93] R. Timofte, V. De Smet, and L. Van Gool, “Semantic super-resolution: [116] R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss for
When and where is it useful?” Computer Vision and Image Under- image transformation with non-aligned data,” in Proceedings of the
standing, vol. 142, pp. 1–12, 2016. European Conference on Computer Vision, 2018, pp. 768–783.
[94] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and- [117] K. Li and J. Malik, “Implicit maximum likelihood estimation,” arXiv
play priors for model based reconstruction,” in Proceedings of the IEEE preprint arXiv:1809.09087, 2018.
Global Conference on Signal and Information Processing, 2013, pp. [118] K. Li, S. Peng, and J. Malik, “Super-resolution via conditional implicit
945–948. maximum likelihood estimation,” arXiv preprint arXiv:1810.01406,
[95] T. Meinhardt, M. Moller, C. Hazirbas, and D. Cremers, “Learning 2018.
proximal operators: Using denoising networks for regularizing inverse [119] F. Huszár, “How (not) to train your generative model: Scheduled
imaging problems,” in Proceedings of the IEEE International Confer- sampling, likelihood, adversary?” arXiv preprint arXiv:1511.05101,
ence on Computer Vision, 2017, pp. 1781–1790. 2015.
[96] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser [120] L. Theis, A. v. d. Oord, and M. Bethge, “A note on the evaluation of
prior for image restoration,” in Proceedings of the IEEE Conference generative models,” arXiv preprint arXiv:1511.01844, 2015.
on Computer Vision and Pattern Recognition, 2017, pp. 3929–3938. [121] M. S. Sajjadi, B. Schölkopf, and M. Hirsch, “EnhanceNet: Single image
super-resolution through automated texture synthesis,” in Proceedings
[97] T. Tirer and R. Giryes, “Image restoration by iterative denoising
of the IEEE International Conference on Computer Vision, 2017, pp.
and backward projections,” IEEE Transactions on Image Processing,
4501–4510.
vol. 28, no. 3, pp. 1220–1234, 2019.
[122] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
[98] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in
translation using cycle-consistent adversarial networks,” in Proceedings
Proceedings of the IEEE Conference on Computer Vision and Pattern
of the IEEE international conference on computer vision, 2017, pp.
Recognition, 2018, pp. 9446–9454.
2223–2232.
[99] K. Zhang, X. Gao, D. Tao, and X. Li, “Single image super-resolution [123] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, “Unsu-
with multiscale similarity learning,” IEEE Transactions on Neural pervised image super-resolution using cycle-in-cycle generative adver-
Networks and Learning Systems, vol. 24, no. 10, pp. 1648–1659, 2013. sarial networks,” in 2018 IEEE Conference on Computer Vision and
[100] A. Shocher, N. Cohen, and M. Irani, “zero-shot super-resolution using Pattern Recognition Workshops, 2018, pp. 814–823.
deep internal learning,” in Proceedings of the IEEE Conference on [124] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
Computer Vision and Pattern Recognition, 2018, pp. 3118–3126. adversarial networks,” in Proceedings of the International Conference
[101] M. Zontak and M. Irani, “Internal statistics of a single natural image,” on Machine Learning, 2017, pp. 214–223.
in Proceedings of the IEEE Conference on Computer Vision and Pattern [125] S. Nowozin, B. Cseke, and R. Tomioka, “f-GAN: Training generative
Recognition Workshops, 2011, pp. 977–984. neural samplers using variational divergence minimization,” in Pro-
[102] T. Michaeli and M. Irani, “Nonparametric blind super-resolution,” in ceedings of the Advances in Neural Information Processing Systems,
Proceedings of the IEEE International Conference on Computer Vision, 2016, pp. 271–279.
2013, pp. 945–952. [126] D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas,
[103] T. Tirer and R. Giryes, “Super-resolution based on image-adapted CNN A. Smola, and A. Gretton, “Generative models and model crit-
denoisers: Incorporating generalization of training data and internal icism via optimized maximum mean discrepancy,” arXiv preprint
learning in test time,” arXiv preprint arXiv:1811.12866, 2018. arXiv:1611.04488, 2016.
[104] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image [127] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in
quality assessment: from error visibility to structural similarity,” IEEE Proceedings of the IEEE Conference on Computer Vision and Pattern
Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. Recognition, 2018, pp. 6228–6237.
[105] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight
super-resolution with cascading residual network,” in Proceedings of
the European Conference on Computer Vision, 2018, pp. 252–268.
[106] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
segmented natural images and its application to evaluating segmen-
tation algorithms and measuring ecological statistics,” in Proceedings
of the IEEE International Conference on Computer Vision, 2001, pp.
416–423.
[107] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “ImageNet:
A large-scale hierarchical image database,” in Proceedings of the IEEE
International Conference on Computer Vision, 2009, pp. 248–255.
[108] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image
super-resolution: Dataset and study,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops,
2017, pp. 126–135.