0% found this document useful (0 votes)
25 views

Zamir 2022 Mirnetv2

This document summarizes a research paper that proposes a new neural network architecture called MIRNet-v2 for fast image restoration and enhancement. The key aspects of MIRNet-v2 are: 1) It uses parallel multi-resolution convolution streams to extract multi-scale features while maintaining the original high-resolution features to preserve spatial details. 2) It introduces information exchange across the multi-resolution streams to progressively fuse coarse-to-fine resolution features. 3) It achieves state-of-the-art results on six image processing tasks while being significantly more lightweight than prior methods, using fewer parameters, computations and training time.

Uploaded by

navya.cogni21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Zamir 2022 Mirnetv2

This document summarizes a research paper that proposes a new neural network architecture called MIRNet-v2 for fast image restoration and enhancement. The key aspects of MIRNet-v2 are: 1) It uses parallel multi-resolution convolution streams to extract multi-scale features while maintaining the original high-resolution features to preserve spatial details. 2) It introduces information exchange across the multi-resolution streams to progressively fuse coarse-to-fine resolution features. 3) It achieves state-of-the-art results on six image processing tasks while being significantly more lightweight than prior methods, using fewer parameters, computations and training time.

Uploaded by

navya.cogni21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Learning Enriched Features for Fast Image


Restoration and Enhancement
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat,
Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao

Abstract—Given a degraded input image, image restoration aims to recover the missing high-quality image content. Numerous
applications demand effective image restoration, e.g., computational photography, surveillance, autonomous vehicles, and remote
sensing. Significant advances in image restoration have been made in recent years, dominated by convolutional neural networks
(CNNs). The widely-used CNN-based methods typically operate either on full-resolution or on progressively low-resolution
representations. In the former case, spatial details are preserved but the contextual information cannot be precisely encoded. In the
latter case, generated outputs are semantically reliable but spatially less accurate. This paper presents a new architecture with a
holistic goal of maintaining spatially-precise high-resolution representations through the entire network, and receiving complementary
contextual information from the low-resolution representations. The core of our approach is a multi-scale residual block containing the
following key elements: (a) parallel multi-resolution convolution streams for extracting multi-scale features, (b) information exchange
across the multi-resolution streams, (c) non-local attention mechanism for capturing contextual information, and (d) attention based
multi-scale feature aggregation. Our approach learns an enriched set of features that combines contextual information from multiple
scales, while simultaneously preserving the high-resolution spatial details. Extensive experiments on six real image benchmark
datasets demonstrate that our method, named as MIRNet-v2 , achieves state-of-the-art results for a variety of image processing tasks,
including defocus deblurring, image denoising, super-resolution, and image enhancement. The source code and pre-trained models
are available at https://github.com/swz30/MIRNetv2.

Index Terms—Multi-scale Feature Representation, Dual-pixel Defocus Deblurring, Image Denoising, Super-resolution, Low-light
Image Enhancement, and Contrast Enhancement

F
1 I NTRODUCTION

O W ing to the physical limitations of cameras or due to


complicated lighting conditions, image degradations
of varying severity are often introduced as part of image
the other hand, the high-resolution (single-scale) networks
[5], [6], [7], [8] do not employ any downsampling operation,
and thereby recover better spatial details. However, these
acquisition. For instance, smartphone cameras come with a networks have limited receptive field and are less effective
narrow aperture and have small sensors with limited dy- in encoding contextual information.
namic range. Consequently, they frequently generate noisy Image restoration is a position-sensitive procedure,
and low-contrast images. Similarly, images captured under where pixel-to-pixel correspondence from the input image
the unsuitable lighting are either too dark or too bright. to the output image is needed. Therefore, it is important to
Image restoration aims to recover the original clean image remove only the undesired degraded image content, while
from its corrupted measurements. It is an ill-posed inverse carefully preserving the desired fine spatial details (such as
problem, due to the existence of many possible solutions. true edges and texture). Such functionality for segregating
Recent advances in image restoration and enhancement the degraded content from the true signal can be better
have been led by deep learning models, as they can learn incorporated into CNNs with the help of large context,
strong (generalizable) priors from large-scale datasets. Ex- e.g., by enlarging the receptive field. Towards this goal, we
isting CNNs typically follow one of the two architecture develop a new multi-scale approach that maintains the orig-
designs: 1) an encoder-decoder, or 2) high-resolution (single- inal high-resolution features along the network hierarchy,
scale) feature processing. The encoder-decoder models [1], thus minimizing the loss of precise spatial details. Simul-
[2], [3], [4] first progressively map the input to a low- taneously, our model encodes multi-scale context by using
resolution representation, and then apply a gradual reverse parallel convolution streams that process features at lower
mapping to the original resolution. Although these ap- spatial resolutions. The multi-resolution parallel branches
proaches learn a broad context by spatial-resolution reduc- operate in a manner that is complementary to the main high-
tion, on the downside, the fine spatial details are lost, mak- resolution branch, thereby providing us more precise and
ing it extremely hard to recover them in the later stages. On contextually enriched feature representations.
One main distinction between our method and the
• S.W. Zamir, and A. Arora, are with Inception Institute of Artificial existing multi-scale image processing approaches is how
Intelligence, UAE. E-mail: [email protected] we aggregate contextual information. The existing methods
• S. khan and F.S. Khan are with Mohammed Bin Zayed University of
Artificial Intelligence, UAE. [11], [12], [13] process each scale in isolation. In contrast, we
• M. Hayat is with Monash Univeristy, Melbourne, Australia. progressively exchange and fuse information from coarse-to-
• M.-H. Yang is with University of California at Merced, and Google, USA. fine resolution-levels. Furthermore, different from existing
• L.Shao is with Terminus Group, China.
methods that employ a simple concatenation or averaging
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

TABLE 1: Comparison between MIRNet-v2 and MIRNet [9] under the same experimental settings for image denoising
task on the SIDD benchmark dataset [10]. FLOPs and inference times are computed on an image of size 256×256. When
compared to MIRNet [9], MIRNet-v2 is more accurate, while being significantly lighter and faster.

PSNR Params (M) FLOPs (B) Convs Activations (M) Train Time (h) Inference Time (ms)
MIRNet [9] 39.72 31.79 785 635 1270 139 142
MIRNet-v2 (Ours) 39.84 5.9 (81% ↓) 140 (82% ↓) 406 (36% ↓) 390 (69% ↓) 63 (55% ↓) 39 (72% ↓)

of features coming from multi-resolution branches, we intro- activations by 69%. Furthermore, the training and inference
duce a new selective kernel fusion approach that dynamically speed is increased by 2.2× and 3.6×, respectively.
selects the useful set of kernels from each branch represen-
tations using a self-attention mechanism. More importantly,
2 R ELATED W ORK
the proposed fusion block combines features with varying
receptive fields, while preserving their distinctive comple- Rapidly growing image content necessitates the need to
mentary characteristics. develop effective image restoration and enhancement algo-
The main contributions of this work include: rithms. In this paper, we propose a new method capable of
performing dual-pixel defocus deblurring, image denoising,
• A novel feature extraction model that obtains a comple- super-resolution, and image enhancement. Unlike existing
mentary set of features across multiple spatial scales, works for these problems, our approach processes features
while maintaining the original high-resolution features at the original resolution in order to preserve spatial details,
to preserve precise spatial details (Sec. 3). while effectively fuses contextual information from multiple
• A regularly repeated mechanism for information ex- parallel branches. Next, we briefly describe the representa-
change, where the features from coarse-to-fine reso- tive methods for each of the studied problems.
lution branches are progressively fused together for
improved representation learning (Sec. 3.1).
2.1 Dual-Pixel Defocus Deblurring
• A new approach to fuse multi-scale features using a
selective kernel network that dynamically combines Images captured with wide camera aperture have shallow
variable receptive fields and faithfully preserves the depth of field (DoF), where the scene regions that lie outside
original feature information at each spatial resolution the DoF are out-of-focus. Given an image with defocus
(Sec. 3.1.1). blur, the goal of defocus deblurring is to generate an all-in-
focus image. Existing defocus deblurring approaches either
A preliminary version of this work has been published
directly deblur images [14], [15], [16], [17], or first estimate
as a conference paper [9]. The MIRNet model [9] is expen-
the defocus dispartiy map and then use it to guide the
sive in terms of size and speed. In this work, we make
deblurring procedure [18], [19], [20]. Modern cameras are
several key modifications to MIRNet [9] that allow us to
equipped with dual-pixel sensor that has two photodiodes
significantly reduce the computational cost while enhanc-
at each pixel location, thereby generating two sub-aperture
ing model performance (see Table 1). Specifically, in the
views. The phase difference between these views is useful in
proposed MIRNet-v2 , (a) We demonstrate feature fusion
measuring the amount of defocus blur at each scene point.
only in the direction from low- to high-resolution streams
Recently, Abuolaim et al. [14] presented a dual-pixel deblur-
performs best, and the information flow from high- to low-
ring dataset (DPDD) and a new method based on encoder-
resolution branches can be removed to improve efficiency.
decoder design. In this paper our focus is also on deblurring
(b) We replace the dual attention unit with a new residual
images directly using the dual-pixel data as in [14], [16].
contextual block (RCB). Furthermore, we introduce group
Previous defocus deblurring works [14], [16] employ the
convolutions in RCB that are capable of learning unique
encoder-decoder that repeatedly uses the downsampling
representations in each filter group, while being more re-
operation, thus causing significant fine detail loss. Whereas
source efficient than standard convolutions. (c) We employ
the architectural design of our approach enables preserva-
progressive learning to improve training speed: the network
tion of desired textural details in the restored image.
is trained on small image patches in the early epochs and on
gradually large patches in the later training epochs. (d) We
show the effectiveness of the proposed design on a new task 2.2 Image Denoising
of dual-pixel defocus deblurring [14] alongside the other Classic denoising methods are mainly based on modifying
image processing tasks of image denoising, super-resolution transform coefficients [21], [22] or averaging neighborhood
and image enhancement. Our MIRNet-v2 achieves state-of- pixels [23], [24], [25]. Although the classical approaches
the-results on all six datasets. Furthermore, we extensively perform well, the self-similarity [26] based algorithms, e.g.,
evaluate our approach on practical challenges, such as gen- NLM [27] and BM3D [28], demonstrate promising denoising
eralization ability across datasets (Sec. 4) performance. Numerous patch-based schemes that exploit
In Table 1, we compare MIRNet-v2 with MIRNet [9] redundancy (self-similarity) in images are later developed
under the same training and inference settings. The results [29], [30], [31], [32]. Recently, deep learning models [6],
show that MIRNet-v2 is more accurate (improving PSNR [9], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43],
from 39.72 dB to 39.84 dB), while reducing the number of [44] make significant advances in image denoising, yielding
parameters and FLOPs by ∼ 81%, convolutions by 36%, and favorable results than those of the hand-crafted methods.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

Fig. 1: Framework of the proposed MIRNet-v2 that learns enriched feature representations for image restoration and
enhancement. MIRNet-v2 is based on a recursive residual design. In the core of MIRNet-v2 is the multi-scale residual block
(MRB) whose main branch is dedicated to maintaining spatially-precise high-resolution representations through the entire
network and the complimentary set of parallel branches provide better contextualized features.

2.3 Image Super-Resolution Recently, CNNs have been successfully applied to general,
Prior to the deep-learning era, numerous super-resolution as well as low-light, image enhancement problems [85].
(SR) algorithms have been proposed based on the sampling Notable works employ Retinex-inspired networks [4], [86],
theory [45], [46], edge-guided interpolation [47], [48], natu- [87], [88], encoder-decoder networks [89], [90], [91], [92],
ral image priors [49], [50], patch-exemplars [51], [52] and [93], and GANs [94], [95], [96].
sparse representations [53], [54]. Currently, deep-learning
techniques are being actively explored as they provide dra- 3 P ROPOSED M ETHOD
matically improved results over conventional algorithms.
A schematic of the proposed MIRNet-v2 is shown in Fig. 1.
The data-driven SR approaches differ according to their
We first present an overview of the proposed MIRNet-
architecture designs [55], [56], [57]. Early methods [5], [58]
v2 for image restoration and enhancement. We then pro-
take a low-resolution (LR) image as input and learn to di-
vide details of the multi-scale residual block, which is the
rectly generate its high-resolution (HR) version. In contrast
fundamental building block of our method, containing
to directly producing a latent HR image, recent SR networks
several key elements: (a) parallel multi-resolution convo-
[59], [60], [61], [62] employ the residual learning framework
lution streams for extracting (fine-to-coarse) semantically-
[63] to learn the high-frequency image detail, which is later
richer and (coarse-to-fine) spatially-precise feature repre-
added to the input LR image to produce the final result.
sentations, (b) information exchange across multi-resolution
Other networks designed to perform SR include recursive
streams, (c) attention-based aggregation of features arriving
learning [64], [65], [66], progressive reconstruction [67], [68],
from different streams, and (d) residual contextual blocks to
dense connections [7], [69], [70], attention mechanisms [71],
extract attention-based features.
[72], [73], multi-branch learning [68], [74], [75], [76], and
generative adversarial networks (GANs) [70], [77], [78], [79]. Overall Pipeline. Given an image I ∈ RH×W ×3 , the pro-
posed model first applies a convolutional layer to extract
low-level features F0 ∈ RH×W ×C . Next, the feature maps
2.4 Image Enhancement F0 pass through N number of recursive residual groups
Oftentimes, cameras generate images that lack vivid details (RRGs), yielding deep features Fn ∈ RH×W ×C . We note
or contrast. A number of factors contribute to the low qual- that each RRG contains several multi-scale residual blocks,
ity of images, including unsuitable lighting conditions and which is described in Section 3.1. Next, we apply a convolu-
physical limitations of camera devices. For image enhance- tion layer to deep features Fn and obtain a residual image
ment, histogram equalization is the most commonly used R ∈ RH×W ×3 . Finally, the restored image is obtained as
approach. However, it frequently produces under- or over- Î = I + R. We optimize the proposed network using the
enhanced images. Motivated by the Retinex theory [80], Charbonnier loss [97]:
several enhancement algorithms mimicking human vision q
2

have been proposed in the literature [81], [82], [83], [84]. L(Î, I ) = kÎ − I∗ k + ε2 , (1)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

Fig. 2: Schematic for selective kernel feature fusion (SKFF). It operates on features from different resolution streams, and
performs aggregation based on self-attention.

where I∗ denotes the ground-truth image, and ε is a con- illustrated in Fig. 2. The fuse operator generates global fea-
stant which we empirically set to 10−3 for all the experi- ture descriptors by combining the information from multi-
ments. resolution streams. The select operator uses these descrip-
tors to recalibrate the feature maps (of different streams)
followed by their aggregation. Next, we provide details
3.1 Multi-Scale Residual Block of both operators. (1) Fuse: SKFF receives inputs from
To encode context, existing CNNs [1], [98], [99], [100], [101], two parallel convolution streams carrying different scales
[102] typically employ the following architecture design: (a) of information. We first combine these multi-scale features
the receptive field of neurons is fixed in each layer/stage, using an element-wise sum as: L = L1 + L2 . We then apply
(b) the spatial size of feature maps is gradually reduced global average pooling (GAP) across the spatial dimen-
to generate a semantically strong low-resolution represen- sion of L ∈ RH×W ×C to compute channel-wise statistics
tation, and (c) a high-resolution representation is gradually s ∈ R1×1×C . Next, we apply a channel-downscaling con-
recovered from the low-resolution representation. However, volution layer to generate a compact feature representa-
it is well-understood in vision science that in the primate tion z ∈ R1×1×r , where r = C8 for all our experiments.
visual cortex, the sizes of the local receptive fields of neu- Finally, the feature vector z passes through two parallel
rons in the same region are different [103], [104], [105], channel-upscaling convolution layers (one for each resolu-
[106]. Therefore, a similar mechanism of collecting multi- tion stream) and provides us with two feature descriptors
scale spatial information in the same layer is more effective v1 and v2 , each with dimensions 1 × 1 × C . (2) Select: This
when incorporated with in CNNs [107], [108], [109], [110]. operator applies the softmax function to v1 and v2 , yielding
Motivated by this, we propose the multi-scale residual block attention activations s1 and s2 that we use to adaptively
(MRB), as shown in Fig. 1. It is capable of generating recalibrate multi-scale feature maps L1 and L2 , respectively.
a spatially-precise output by maintaining high-resolution The overall process of feature recalibration and aggregation
representations, while receiving rich contextual information is defined as: U = s1 · L1 + s2 · L2 . Note that the SKFF uses
from low-resolutions. The MRB consists of multiple (three in ∼5x fewer parameters than aggregation with concatenation
this paper) fully-convolutional streams connected in parallel but generates more favorable results (an ablation study is
that operate on varying resolution feature maps (ranging provided in the experiments section).
from low to high). It allows contextualized-information
transfer from the low-resolution streams to consolidate the 3.1.2 Residual Contextual Block
high-resolution features. Next, we describe the individual While the SKFF block fuses information across multi-
components of MRB. resolution branches, we also need a distillation mechanism
to extract useful information from within a feature tensor.
3.1.1 Selective Kernel Feature Fusion Motivated by the advances of recent low-level vision meth-
One fundamental property of neurons present in the visual ods [33], [71], [72], [73] which incorporate attention mech-
cortex is their ability to change receptive fields according to anisms [112], [113], [114], [115], we propose the residual
the stimulus [111]. This mechanism of adaptively adjusting contextual block (RCB) to extract features in the convolu-
receptive fields can be incorporated in CNNs by using tional streams. The schematic of RCB is shown in Fig. 3. The
multi-scale feature generation (in the same layer) followed RCB suppresses less useful features and only allows more
by feature aggregation and selection. The most commonly informative ones to pass further. The overall process of RCB
used approaches for feature aggregation include simple is summarized as:
concatenation or summation. However, these choices pro- FRCB = Fa + W (CM(Fb )), (2)
vide limited expressive power to the network, as reported
in [111]. In MRB, we introduce a nonlinear procedure for where Fb ∈ RH×W ×C represents feature maps that are
fusing features coming from different resolution streams obtained by applying two 3x3 group convolution layers
using a self-attention mechanism. Motivated by [111], we to the input features Fb ∈ RH×W ×C at the beginning
call it selective kernel feature fusion (SKFF). of the RCB. These group convolutions are more resource
The SKFF module performs dynamic adjustment of efficient than standard convolutions and capable of learning
receptive fields via two operations – Fuse and Select, as unique representations in each filter group. W denotes the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Fig. 3: Architecture of residual contextual block (RCB). In the first two group convolution layers, g represents the number of
groups. ⊗ denotes matrix multiplication.

last convolutional layer with filter size 1x1. CM stands camera. Each scene consists of two defocus blurred sub-
for contextual module that is realized in three parts. (1) aperture views captured with a wide camera aperture, and
Context modeling: From the original feature maps Fb , we the corresponding all-in-focus ground truth image captured
first generate new features Fc ∈ R1×1×HW by applying with a narrow aperture. The DDPD dataset is divided into
1x1 convolution followed by the reshaping and softmax 350 images for training, 74 images for validation and 76
operations. Next we reshape Fb to R1×HW ×C and perform images for testing.
matrix multiplication with Fc to obtain the global feature Image denoising. (1) DND [118] consists of 50 images cap-
descriptor Fd ∈ R1×1×C . (2) Feature transform: To capture tured with four consumer cameras. Since the images are of
the inter-channel dependencies we pass the descriptor Fd very high-resolution, the dataset providers extract 20 crops
through two 1x1 convolutions, resulting in new attention of size 512 × 512 from each image, yielding 1000 patches
features Fe ∈ R1×1×C . (3) Feature fusion: We employ in total. All these patches are used for testing (as DND
element-wise addition operation to aggregate contextual does not contain training or validation sets). The ground-
features Fe to each position of the original features Fb . truth noise-free images are not released publicly, therefore
the image quality scores in terms of PSNR and SSIM can
3.2 Progressive Training Regime only be obtained through an online server [119]. (2) SIDD
When considering the image patch size for network train- [10] is collected with smartphone cameras. Due to the small
ing, there is a trade-off between the training speed and test- sensor and high-resolution, the noise levels in smartphone
time accuracy [116], [117]. On large patches, CNNs capture images are much higher than those of DSLRs. SIDD contains
fine image details to provide improved results, but they are 320 image pairs for training and 1280 for validation.
slower to train. Whereas, training on small image patches is Super-resolution. RealSR [120] contains real-world LR-HR
faster, but comes at the cost of accuracy drop. To strike the image pairs of the same scene captured by adjusting the
right balance between the training speed and accuracy, we focal-length of the cameras. RealSR has both indoor and
propose a progressive learning method where the network outdoor images taken with two cameras. The number of
is trained on smaller image patches in the early epochs and training image pairs for scale factors ×2, ×3 and ×4 are
on gradually larger patches in the later training epochs. This 183, 234 and 178, respectively. For each scale factor, 30 test
approach can also be understood as a curriculum learning images are also provided in RealSR.
process where the network sequentially moves from learn- Image enhancement. (1) LoL [87] is created for low-light im-
ing a simpler task to a more complex one (where modeling age enhancement problem. It provides 485 images for train-
of fine details is required). The progressive learning strategy ing and 15 for testing. Each image pair in LoL consists of a
on mixed-size image patches not only improves the training low-light input image and its corresponding well-exposed
speed but also enhances the model performance at test time reference image. (2) MIT-Adobe FiveK [121] contains 5000
where the input images can be of different sizes (which is images of various indoor and outdoor scenes captured with
common in image restoration problems). DSLR cameras in different lighting conditions. The tonal at-
tributes of all images are manually adjusted by five different
4 E XPERIMENTS trained photographers (labelled as experts A to E). Similar
to [122], [123], [124], we also consider the enhanced images
In this section, we perform qualitative and quantitative of expert C as the ground-truth. Moreover, the first 4500
assessments of the results produced by our MIRNet-v2 and images are used for training and the last 500 for testing.
compare it with the state-of-the-art methods. Next, we de-
scribe the datasets, and then provide the implementation
4.2 Implementation Details
details. Finally, we report results for (a) dual-pixel defocus
deblurring, (b) image denoising, (c) image super-resolution The proposed architecture is end-to-end trainable and re-
and (d) image enhancement, on six real image datasets. quires no pre-training of sub-modules. We train four dif-
ferent networks for four different restoration tasks. For the
dual-pixel defocus deblurring, we concatenate the left and
4.1 Real Image Datasets right sub-aperture images and feed them as input to the net-
Dual-pixel defocus deblurring. DPDD [14] dataset con- work. The training parameters, common to all experiments,
tains 500 indoor/outdoor scenes captured with a DSLR are the following. We use 4 RRGs, each of which further
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

TABLE 2: Dual-pixel Defocus Deblurring comparisons on the DPDD Dataset [14]. The test set of DPDD contains 37 indoor
scenes and 39 outdoor scenes. Best and second best scores are highlighted and underlined, respectively.

Indoor Scenes Outdoor Scenes Combined


Method PSNR ↑ SSIM ↑ MAE ↓ LPIPS ↓ PSNR ↑ SSIM ↑ MAE ↓ LPIPS ↓ PSNR ↑ SSIM ↑ MAE ↓ LPIPS ↓
EBDB [18] 25.77 0.772 0.040 0.297 21.25 0.599 0.058 0.373 23.45 0.683 0.049 0.336
DMENet [20] 25.50 0.788 0.038 0.298 21.43 0.644 0.063 0.397 23.41 0.714 0.051 0.349
JNB [19] 26.73 0.828 0.031 0.273 21.10 0.608 0.064 0.355 23.84 0.715 0.048 0.315
DPDNet [14] 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
RDPD [16] 28.10 0.843 0.027 0.210 22.82 0.704 0.053 0.298 25.39 0.772 0.040 0.255
MIRNet-v2 (Ours) 28.96 0.881 0.024 0.154 23.59 0.753 0.049 0.205 26.20 0.816 0.037 0.180

contains 2 MRBs. The MRB has 3 parallel streams with TABLE 3: Denoising comparisons on SIDD [10] and
channel dimensions of 80, 120, 180 at resolutions 1, 12 , 14 , DND [118] datasets. ∗ indicates the methods that use addi-
respectively. Each stream in MRB has 2 RCBs with shared tional training data. Whereas our MIRNet-v2 is only trained
parameters. The models are trained with the Adam opti- on the SIDD images and directly tested on DND.
mizer (β1 = 0.9, and β2 = 0.999) for 3 × 105 iterations.
SIDD [10] DND [118]
The initial learning rate is set to 2 × 10−4 . We employ
Method PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
the cosine annealing strategy [125] to steadily decrease the
learning rate from initial value to 10−6 during training. For DnCNN [6] 23.66 0.583 32.43 0.790
MLP [126] 24.71 0.641 34.23 0.833
progressive training, we use the image patch sizes of 128, BM3D [28] 25.65 0.685 34.51 0.851
144, 192, and 224. The batch size is set to 64 and, for data CBDNet* [35] 30.78 0.801 38.06 0.942
augmentation, we perform horizontal and vertical flips. DAGL [127] 38.94 0.953 39.77 0.956
RIDNet* [33] 38.71 0.951 39.26 0.953
AINDNet* [42] 38.95 0.952 39.37 0.951
VDN [41] 39.28 0.956 39.38 0.952
4.3 Dual-Pixel Defocus Deblurring DeamNet* [128] 39.47 0.957 39.63 0.953
SADNet* [39] 39.46 0.957 39.59 0.952
We compare the performance of the proposed MIRNet- DANet+* [40] 39.47 0.957 39.58 0.955
v2 with the conventional defocus deblurring methods CycleISP* [38] 39.52 0.957 39.56 0.956
(EBDB [18] and JNB [19]) as well as the learning-based MIRNet-v2 (Ours) 39.84 0.959 39.86 0.955
approaches (DMENet [20], DPDNet [14], and RDPD [16]).
Table 2 shows that our method achieves state-of-the-art
results for both the indoor and outdoor scene categories. In
particular, our MIRNet-v2 achieves 0.86 dB PSNR improve- Fig. 5 shows a visual comparisons of our results with
ment over the previous best method RDPD [16] on indoor those of other competing algorithms. The MIRNet-v2 is
images and 0.77 dB on outdoor images. When both scene effective in removing real noise and produces perceptually-
categories are combined, our method shows performance pleasing and sharp images. Moreover, it is can maintain
gains of 0.81 dB over RDPD [14] and 1.07 dB over the second the spatial smoothness of the homogeneous regions without
best method DPDNet [14]. introducing artifacts. In contrast, most of the other methods
In Fig. 4, we provide defocus-deblurred results produced either yield over-smooth images and thus sacrifice structural
by different methods for both indoor and outdoor scenes. content and fine textural details, or produce images with
It is noticeable that our method effectively removes the chroma artifacts and blotchy texture.
spatially varying defocus blur and produces images that are Generalization capability. The DND and SIDD datasets
more sharper and visually faithful to the ground-truth than are acquired with different sets of cameras having differ-
those of the compared approaches. ent noise characteristics. Since the DND benchmark does
not provide training data, setting a new state-of-the-art on
DND with our SIDD trained network indicates the good
4.4 Image Denoising generalization capability of our approach.
In this section, we demonstrate the effectiveness of the
proposed MIRNet-v2 for image denoising. We train our net-
work only on the training set of the SIDD [10] and directly 4.5 Super-Resolution
evaluate it on the test images of both SIDD and DND [118] We compare our MIRNet-v2 against the state-of-the-art SR
datasets. Quantitative comparisons in terms of PSNR and algorithms (VDSR [59], SRResNet [79], RCAN [71], LP-
SSIM metrics are summarized in Table 3. Our MIRNet- KPN [120]) on the testing images of the RealSR [120] for
v2 performs favourably against the data-driven, as well upscaling factors of ×2, ×3 and ×4. Note that all the
as conventional, denoising algorithms. Specifically, when benchmarked algorithms are trained on the RealSR [120]
compared to the recent best methods, our algorithm demon- dataset for a fair comparison. In the experiments, we also
strates a performance gain of 0.32 dB over CycleISP [38] on include bicubic interpolation [45], which is the most com-
SIDD and 0.11 dB over DAGL [127] on DND. Furthermore, monly used method for generating super-resolved images.
it is worth noting that CycleISP [38] uses additional training Here, we compute the PSNR and SSIM scores using the Y
data, yet our method yields considerably better results. channel (in YCbCr color space), as it is a common practice
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

PSNR 20.76 dB 21.02 dB 21.37 dB


Reference Blurry EBDB [18] DMENet [20]

20.76 dB 20.96 dB 24.87 dB 24.14 dB 26.41 dB


Blurry Image JNB [19] DPDNet [14] RDPD [16] MIRNet-v2

PSNR 18.84 dB 18.64 dB 18.75 dB


Reference Blurry EBDB [18] DMENet [20]

18.84 dB 18.65 dB 20.01 dB 19.41 dB 20.69 dB


Blurry Image JNB [19] DPDNet [14] RDPD [16] MIRNet-v2

PSNR 27.19 dB 26.43 dB 27.44 dB


Reference Blurry EBDB [18] DMENet [20]

27.19 dB 26.82 dB 28.67 dB 29.01 dB 29.82 dB


Blurry Image JNB [19] DPDNet [14] RDPD [16] MIRNet-v2
Fig. 4: Visual comparisons for dual-pixel defocus deblurring on the DPDD dataset [14]. Compared to the other approaches,
our MIRNet-v2 more effectively removes blur while preserving the fine image details.

in the SR literature [55], [56], [71], [120]. The results in TABLE 4: Super-resolution evaluation on the RealSR
Table 4 show that the bicubic interpolation provides the least dataset [120] . Compared to the state-of-the-art, our method
accurate results, thereby indicating its low suitability for consistently yields significantly better image quality scores
dealing with real images. Moreover, the same table shows for all three scaling factors.
that the recent method LP-KPN [120] achieves marginal
Scale x2 x3 x4
improvement of only ∼ 0.04 dB over the previous best
Method PSNR SSIM PSNR SSIM PSNR SSIM
method RCAN [71]. In contrast, our method significantly
advances state-of-the-art and consistently achieves better Bicubic 32.61 0.907 29.34 0.841 27.99 0.806
VDSR [59] 33.64 0.917 30.14 0.856 28.63 0.821
image quality scores than other approaches for all three SRResNet [79] 33.69 0.919 30.18 0.859 28.67 0.824
scaling factors. Particularly, compared to LP-KPN [120], our RCAN [71] 33.87 0.922 30.40 0.862 28.88 0.826
method leads to performance gains of 0.48 dB, 0.73 dB, and LP-KPN [120] 33.90 0.927 30.42 0.868 28.92 0.834
0.24 dB for scaling factors ×2, ×3 and ×4, respectively. The MIRNet-v2 (Ours) 34.38 0.934 31.15 0.883 29.16 0.845
trend is similar for the SSIM metric as well.
Visual comparisons in Fig. 6 show that our MIRNet-
v2 can effectively recover content structures . In contrast, crop). Several more examples are provided in Fig. 7 to fur-
VDSR [59], SRResNet [79] and RCAN [71] reproduce results ther compare the image reproduction quality of our method
with noticeable artifacts. Furthermore, LP-KPN [120] is not against the previous best method [120]. It can be seen that
able to preserve structures (see near the right edge of the LP-KPN [120] has a tendency to over-enhance the contrast
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

PSNR 18.25 dB 35.57 dB 36.24 dB 36.70 dB 36.71 dB 36.74 dB 37.07 dB

PSNR 18.16 dB 29.83 dB 29.99 dB 30.48 dB 30.22 dB 30.76 dB 31.29 dB


Reference Noisy RIDNet [33] AINDNet [42] SADNet [39] CycleISP [38] DANet [40] MIRNet-v2

26.90 dB 30.91 dB 33.62 dB 33.89 dB 34.09 dB


Noisy BM3D [28] CBDNet [35] VDN [41] RIDNet [33]

26.90 dB 34.32 dB 34.36 dB 34.36 dB 34.52 dB 34.64 dB


Noisy Image CycleISP [38] AINDNet [42] DANet [40] SADNet [39] MIRNet-v2

Fig. 5: Image denoising comparisons. First two examples are from SIDD [10] and the last is from DND [118]. The proposed
MIRNet-v2 better preserves fine texture and structural patterns in the denoised images.

LR HR Bicubic SRResNet [79]

Image VDSR [59] RCAN [71] LP-KPN [120] MIRNet-v2 (Ours)


Fig. 6: Comparisons for ×4 super-resolution on the RealSR [120] dataset. The image produced by our MIRNet-v2 is more
faithful to the ground-truth than other competing methods (see lines near the right edge of the crops).

(cols. 1, 3, 4) and in turn causes loss of details near dark 4.6 Image Enhancement
and high-light areas. In contrast, the proposed MIRNet-
v2 successfully reconstructs structural patterns and edges In this section, we demonstrate the effectiveness of our
(col. 2) and produces images that are natural (cols. 1, 4) and algorithm by evaluating it for the image enhancement task.
have better color reproduction (col. 5). We report PSNR/SSIM values of our method and several
other techniques in Table 5 and Table 6 for the LoL [87] and
MIT-Adobe FiveK [121] datasets, respectively. It can be seen
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

HR

LP-KPN
[120]

MIRNet-v2
(Ours)

Fig. 7: Additional visual examples for ×4 super-resolution, comparing our MIRNet-v2 against the state-of-the-art
approach [120]. Note that all example crops are taken from different images.

TABLE 5: Low-light image enhancement evaluation on the LoL dataset [87]. The proposed method significantly advances
the state-of-the-art.

Method BIMEF CRM Dong LIME MF RRM SRIE Retinex-Net MSR NPE GLAD KinD KinD++ MIRNet-v2
[129] [130] [131] [132] [133] [134] [133] [87] [83] [135] [136] [4] [137] (Ours)
PSNR 13.86 17.20 16.72 16.76 18.79 13.88 11.86 16.77 13.17 16.97 19.72 20.87 21.30 24.74
SSIM 0.577 0.644 0.582 0.564 0.642 0.658 0.498 0.559 0.479 0.589 0.703 0.810 0.822 0.851

TABLE 6: Image enhancement comparisons on the MIT-Adobe FiveK dataset [121].

Method HDRNet [138] W-Box [122] DR [123] DPE [94] DeepUPE [124] MIRNet-v2 (Ours)
PSNR 21.96 18.57 20.97 22.15 23.04 23.97
SSIM 0.866 0.701 0.841 0.850 0.893 0.931

that our MIRNet-v2 achieves significant improvements over Table 8 shows that the proposed RCB provides favor-
previous approaches. Notably, when compared to the recent able performance gain over the baseline Resblock from
best methods, MIRNet-v2 obtains 3.44 dB performance gain EDSR [74]. Moreover, removing the transform part from
over KinD++ [137] on the LoL dataset and 0.93 dB improve- RCB causes drop in accuracy. Table 8 also shows that re-
ment over DeepUPE1 [124] on the Adobe-Fivek dataset. placing the group convolutions with regular convolutions in
We show visual results in Fig. 8 and Fig. 9. Compared RCB increases the PSNR score, but at the cost of significant
to other techniques, our method generates enhanced images increase in parameters and FLOPs. Therefore, we opt for
that are natural and vivid in appearance and have better RCB with group convolutions (g=2) as a balanced choice.
global and local contrast.

4.7 Ablation Studies Next, we analyze the feature aggregation strategy in


Table 9. It shows that the proposed SKFF generates favorable
We study the impact of each of our architectural components results compared to summation and concatenation. Note
and design choices on the final performance. All the ablation that our proposed SKFF module uses ∼ 5× fewer param-
experiments are performed for the super-resolution task eters than concatenation. Table 10 shows that the progres-
with ×3 scale factor. The ablation models are trained on sive learning strategy on mixed-size image patches yields
image patches of size 128×128 for 105 iterations. Table 7 PSNR similar to the model trained on large image patches
shows that removing skip connections causes the largest (ps=224), but takes less time for training. Finally, in Table 11
performance drop. Without skip connections, the network we study how the number of convolutional streams and
finds it difficult to converge and yields high training errors, columns (RCB blocks) of MRB affect the image restoration
and consequently low PSNR. Furthermore, the information quality. We note that increasing the number of streams
exchange among parallel convolution streams via SKFF is provides significant improvements, thereby justifying the
helpful and leads to improved performance. Similarly, RCB importance of multi-scale features processing. Moreover,
contributes positively towards the final image quality. increasing the number of columns yields better scores, thus
1. Note that the quantitative results reported in [124] are incorrect. indicating the significance of information exchange among
The correct scores are later released by the original authors [link]. parallel streams for feature consolidation.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

Input image LIME [132] SRIE [133] Retinex-Net [87]

KinD [4] KinD++ [137] MIRNet-v2 (Ours) Ground-truth

Fig. 8: Visual comparison of low-light enhancement approaches on the LoL dataset [87]. The image produced by our
method is visually closer to the ground-truth in terms of brightness and global contrast.

TABLE 7: Impact of individual components of MRB. TABLE 11: Ablation study on different layouts of MRB. Rows
denote the number of parallel resolution streams, and Cols
Skip connections X X X X represent the number of columns containing RCBs.
RCB X X X
SKFF intermediate X X X
SKFF final X X X X X PSNR Cols = 1 Cols = 2 Cols = 3

PSNR (in dB) 28.21 30.79 30.85 30.68 30.97 Rows = 1 30.01 30.29 30.47
Rows = 2 30.65 30.79 30.85
Rows = 3 30.73 30.97 31.03
TABLE 8: Effect of individual components of RCB. Resblock
work hierarchy or use an encoder-decoder architecture. The
from EDSR [74] is taken as baseline. FLOPs are calculated
first approach helps retain precise spatial details, while the
on an image of size 256×256. ‘g’ represents the number of
latter one provides better contextualized representations.
groups in the group convolutions.
However, these methods can satisfy only one of the above
two requirements, although real-world image restoration
PSNR Params (M) FLOPs (B)
tasks demand a combination of both conditioned on the
Baseline [74], g=2 30.84 5.0 139.5 given input sample. In this work, we propose a novel ar-
+ RCB, g=2 30.97 5.9 139.8
RCB w/o transform, g=2 30.92 5.0 139.7 chitecture whose main branch is dedicated to full-resolution
RCB, g=1 31.05 9.7 253.2 processing and the complementary set of parallel branches
provides better contextualized features. We propose novel
mechanisms to learn relationships between features within
TABLE 9: Feature aggregation. Our SKFF uses ∼ 5× fewer
each branch as well as across multi-scale branches. Our
parameters than ‘Concat’, but generates better results.
feature fusion strategy ensures that the receptive field can be
dynamically adapted without sacrificing the original feature
Sum Concat SKFF
details. Consistent achievement of state-of-the-art results on
PSNR (in dB) 30.76 30.83 30.97 six datasets for four image restoration and enhancement
Parameters 0 8,192 1,536
tasks corroborates the effectiveness of our approach.

TABLE 10: Effect of progressive learning. For progressive


training, we gradually increase image patch size from ACKNOWLEDGEMENTS
128×128 to 224×224. Ming-Hsuan Yang is supported by NSF CAREER grant
1149783. Ling Shao is is partially supported by the National
Patch size 128 144 192 224 Progressive Natural Science Foundation of China (grant no. 61929104).
PSNR (in dB) 30.97 30.99 31.02 31.08 31.06 Munawar Hayat is supported by the ARC DECRA Fellow-
Train time (h) 14 17 25 33 22 ship DE200101100.

R EFERENCES
5 C ONCLUDING R EMARKS
[1] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Conventional image restoration and enhancement pipelines convolutional networks for biomedical image segmentation. In
either stick to the full resolution features along the net- MICCAI, 2015. 1, 4
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

Input image HDRNet [138] DPE [94]

DeepUPE [87] MIRNet-v2 (Ours) Ground-truth

Input image HDRNet [138] DPE [94]

DeepUPE [87] MIRNet-v2 (Ours) Ground-truth

Fig. 9: Visual results of image enhancement on the MIT-Adobe FiveK [121] dataset. Compared to the state-of-the-art, our
MIRNet-v2 makes better color and contrast adjustments and produces images that appear vivid, natural and pleasant.

[2] Orest Kupyn, Tetiana Martyniuk, Junru Wu, and Zhangyang with deep convolutional networks. In ICCV, 2017. 1
Wang. Deblurgan-v2: Deblurring (orders-of-magnitude) faster [9] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar
and better. In ICCV, 2019. 1 Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao.
[3] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning Learning enriched features for real image restoration and en-
to see in the dark. In CVPR, 2018. 1 hancement. In ECCV, 2020. 2
[4] Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo. Kindling the [10] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown.
darkness: A practical low-light image enhancer. In MM, 2019. 1, A high-quality denoising dataset for smartphone cameras. In
3, 9, 10 CVPR, 2018. 2, 5, 6, 8
[5] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. [11] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia.
Image super-resolution using deep convolutional networks. Scale-recurrent network for deep image deblurring. In CVPR,
TPAMI, 2015. 1, 3 2018. 1
[6] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei [12] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-
Zhang. Beyond a gaussian denoiser: Residual learning of deep scale convolutional neural network for dynamic scene deblur-
cnn for image denoising. TIP, 2017. 1, 2, 6 ring. In CVPR, 2017. 1
[7] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. [13] Shuhang Gu, Yawei Li, Luc Van Gool, and Radu Timofte. Self-
Residual dense network for image restoration. TPAMI, 2020. 1, 3 guided network for fast image denoising. In ICCV, 2019. 1
[8] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Van- [14] Abdullah Abuolaim and Michael S Brown. Defocus deblurring
hoey, and Luc Van Gool. DSLR-quality photos on mobile devices using dual-pixel data. In ECCV, 2020. 2, 5, 6, 7
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

[15] Laurent DAndrès, Jordi Salvador, Axel Kochale, and Sabine [42] Yoonsik Kim, Jae Woong Soh, Gu Yong Park, and Nam Ik Cho.
Süsstrunk. Non-parametric blur map regression for depth of field Transfer learning from synthetic to real-noise denoising with
extension. TIP, 2016. 2 adaptive instance normalization. In CVPR, 2020. 2, 6, 8
[16] Abdullah Abuolaim, Mauricio Delbracio, Damien Kelly, [43] Faming Fang, Juncheng Li, Yiting Yuan, Tieyong Zeng, and
Michael S. Brown, and Peyman Milanfar. Learning to reduce Guixu Zhang. Multilevel edge features guided network for image
defocus blur by realistically modeling dual-pixel data. In ICCV, denoising. IEEE TNNLS, 2020. 2
2021. 2, 6, 7 [44] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar
[17] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao.
Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Multi-stage progressive image restoration. In CVPR, 2021. 2
Efficient transformer for high-resolution image restoration. In [45] Robert Keys. Cubic convolution interpolation for digital image
CVPR, 2022. 2 processing. TASSP, 1981. 3, 6
[18] Ali Karaali and Claudio Rosito Jung. Edge-based defocus blur [46] Michal Irani and Shmuel Peleg. Improving resolution by image
estimation with adaptive scale selection. TIP, 2017. 2, 6, 7 registration. CVGIP, 1991. 3
[19] Jianping Shi, Li Xu, and Jiaya Jia. Just noticeable defocus blur [47] Jan Allebach and Ping Wah Wong. Edge-directed interpolation.
detection and estimation. In CVPR, 2015. 2, 6, 7 In ICIP, 1996. 3
[20] Junyong Lee, Sungkil Lee, Sunghyun Cho, and Seungyong Lee. [48] Lei Zhang and Xiaolin Wu. An edge-guided image interpolation
Deep defocus map estimation using domain adaptation. In algorithm via directional filtering and data fusion. TIP, 2006. 3
CVPR, 2019. 2, 6, 7 [49] Kwang In Kim and Younghee Kwon. Single-image super-
[21] Leonid P Yaroslavsky. Local adaptive image restoration and resolution using sparse regression and natural image prior.
enhancement with the use of DFT and DCT in a running window. TPAMI, 2010. 3
In Wavelet Applications in Signal and Image Processing IV, 1996. 2 [50] Zhiwei Xiong, Xiaoyan Sun, and Feng Wu. Robust web im-
[22] Eero P Simoncelli and Edward H Adelson. Noise removal via age/video super-resolution. TIP, 2010. 3
bayesian wavelet coring. In ICIP, 1996. 2 [51] Hong Chang, Dit-Yan Yeung, and Yimin Xiong. Super-resolution
[23] Carlo Tomasi and Roberto Manduchi. Bilateral filtering for gray through neighbor embedding. In CVPR, 2004. 3
and color images. In ICCV, 1998. 2 [52] Gilad Freedman and Raanan Fattal. Image and video upscaling
[24] Pietro Perona and Jitendra Malik. Scale-space and edge detection from local self-examples. TOG, 2011. 3
using anisotropic diffusion. TPAMI, 1990. 2 [53] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image
[25] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total super-resolution via sparse representation. TIP, 2010. 3
variation based noise removal algorithms. Physica D: nonlinear [54] Jianchao Yang, John Wright, Thomas Huang, and Yi Ma. Image
phenomena, 1992. 2 super-resolution as sparse representation of raw image patches.
[26] Alexei A Efros and Thomas K Leung. Texture synthesis by non- In CVPR, 2008. 3
parametric sampling. In ICCV, 1999. 2 [55] Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learning for
[27] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local image super-resolution: A survey. TPAMI, 2019. 3, 7
algorithm for image denoising. In CVPR, 2005. 2 [56] Saeed Anwar, Salman Khan, and Nick Barnes. A deep journey
[28] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen into super-resolution: A survey. arXiv, 2019. 3, 7
Egiazarian. Image denoising by sparse 3-D transform-domain [57] Jianrui Cai, Shuhang Gu, Radu Timofte, and Lei Zhang. Ntire
collaborative filtering. TIP, 2007. 2, 6, 8 2019 challenge on real image super-resolution: Methods and
[29] Weisheng Dong, Guangming Shi, and Xin Li. Nonlocal image results. In CVPRW, 2019. 3
restoration with bilateral variance estimation: a low-rank ap- [58] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
proach. TIP, 2012. 2 Tang. Learning a deep convolutional network for image super-
[30] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. resolution. In ECCV, 2014. 3
Weighted nuclear norm minimization with application to image [59] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate
denoising. In CVPR, 2014. 2 image super-resolution using very deep convolutional networks.
[31] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and In ICCV, 2016. 3, 6, 7, 8
Andrew Zisserman. Non-local sparse models for image restora- [60] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Memnet: A
tion. In ICCV, 2009. 2 persistent memory network for image restoration. In ICCV, 2017.
[32] Rachid Hedjam, Reza Farrahi Moghaddam, and Mohamed 3
Cheriet. Markovian clustering for the non-local means image [61] Ying Tai, Jian Yang, and Xiaoming Liu. Image super-resolution
denoising. In ICIP, 2009. 2 via deep recursive residual network. In CVPR, 2017. 3
[33] Saeed Anwar and Nick Barnes. Real image denoising with [62] Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate sin-
feature attention. ICCV, 2019. 2, 4, 6, 8 gle image super-resolution via information distillation network.
[34] Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon In CVPR, 2018. 3
Sharlet, and Jonathan T Barron. Unprocessing images for learned [63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
raw denoising. In CVPR, 2019. 2 residual learning for image recognition. In CVPR, 2016. 3
[35] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. [64] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-
Toward convolutional blind denoising of real photographs. In recursive convolutional network for image super-resolution. In
CVPR, 2019. 2, 6, 8 CVPR, 2016. 3
[36] Tobias Plötz and Stefan Roth. Neural nearest neighbors networks. [65] Wei Han, Shiyu Chang, Ding Liu, Mo Yu, Michael Witbrock,
In NeurIPS, 2018. 2 and Thomas S Huang. Image super-resolution via dual-state
[37] Kai Zhang, Wangmeng Zuo, and Lei Zhang. FFDNet: Toward a recurrent networks. In CVPR, 2018. 3
fast and flexible solution for CNN-based image denoising. TIP, [66] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, ac-
2018. 2 curate, and lightweight super-resolution with cascading residual
[38] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar network. In ECCV, 2018. 3
Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. [67] Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas
CycleISP: Real image restoration via improved data synthesis. In Huang. Deep networks for image super-resolution with sparse
CVPR, 2020. 2, 6, 8 prior. In ICCV, 2015. 3
[39] Meng Chang, Qi Li, Huajun Feng, and Zhihai Xu. Spatial- [68] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan
adaptive network for single image denoising. In ECCV, 2020. Yang. Deep laplacian pyramid networks for fast and accurate
2, 6, 8 superresolution. In CVPR, 2017. 3
[40] Zongsheng Yue, Qian Zhao, Lei Zhang, and Deyu Meng. Dual [69] Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. Image super-
adversarial network: Toward real-world noise removal and noise resolution using dense skip connections. In ICCV, 2017. 3
generation. In ECCV, 2020. 2, 6, 8 [70] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao
[41] Zongsheng Yue, Hongwei Yong, Qian Zhao, Deyu Meng, and Dong, Yu Qiao, and Chen Change Loy. ESRGAN: enhanced
Lei Zhang. Variational denoising network: Toward blind noise super-resolution generative adversarial networks. In ECCVW,
modeling and removal. In NeurIPS, 2019. 2, 6, 8 2018. 3
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

[71] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, [96] Yubin Deng, Chen Change Loy, and Xiaoou Tang. Aesthetic-
and Yun Fu. Image super-resolution using very deep residual driven image enhancement by adversarial learning. In ACM
channel attention networks. In ECCV, 2018. 3, 4, 6, 7, 8 Multimedia, 2018. 3
[72] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei [97] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and
Zhang. Second-order attention network for single image super- Michel Barlaud. Two deterministic half-quadratic regularization
resolution. In CVPR, 2019. 3, 4 algorithms for computed imaging. In ICIP, 1994. 3
[73] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. [98] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass
Residual non-local attention networks for image restoration. In networks for human pose estimation. In ECCV, 2016. 4
ICLR, 2019. 3, 4 [99] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning
[74] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Ky- deconvolution network for semantic segmentation. In ICCV,
oung Mu Lee. Enhanced deep residual networks for single image 2015. 4
super-resolution. In CVPRW, 2017. 3, 9, 10 [100] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for
[75] Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixel human pose estimation and tracking. In ECCV, 2018. 4
recursive super resolution. In ICCV, 2017. 3 [101] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Seg-
[76] Juncheng Li, Faming Fang, Kangfu Mei, and Guixu Zhang. Multi- Net: a deep convolutional encoder-decoder architecture for im-
scale residual network for image super-resolution. In ECCV, age segmentation. TPAMI, 2017. 4
2018. 3 [102] Xi Peng, Rogerio S Feris, Xiaoyu Wang, and Dimitris N Metaxas.
[77] Seong-Jin Park, Hyeongseok Son, Sunghyun Cho, Ki-Sang Hong, A recurrent encoder-decoder network for sequential face align-
and Seungyong Lee. SRFEAT: Single image super-resolution with ment. In ECCV, 2016. 4
feature discrimination. In ECCV, 2018. 3 [103] David H Hubel and Torsten N Wiesel. Receptive fields, binocular
[78] Mehdi SM Sajjadi, Bernhard Scholkopf, and Michael Hirsch. interaction and functional architecture in the cat’s visual cortex.
Enhancenet: Single image super-resolution through automated The Journal of physiology, 1962. 4
texture synthesis. In ICCV, 2017. 3 [104] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical mod-
[79] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, An- els of object recognition in cortex. Nature neuroscience, 1999. 4
drew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan [105] Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhu-
Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single ber, and Tomaso Poggio. Robust object recognition with cortex-
image super-resolution using a generative adversarial network. like mechanisms. TPAMI, 2007. 4
In CVPR, 2017. 3, 6, 7, 8 [106] Chou P Hung, Gabriel Kreiman, Tomaso Poggio, and James J
[80] Edwin H Land. The retinex theory of color vision. Scientific DiCarlo. Fast readout of object identity from macaque inferior
american, 1977. 3 temporal cortex. Science, 2005. 4
[81] Marcelo Bertalmı́o, Vicent Caselles, Edoardo Provenzi, and [107] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der
Alessandro Rizzi. Perceptual color correction through variational Maaten, and Kilian Q Weinberger. Multi-scale dense networks for
techniques. TIP, 2007. 3 resource efficient image classification. In ICLR, 2018. 4
[82] R. Palma-Amestoy, E. Provenzi, M. Bertalmı́o, and V. Caselles. A [108] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-
perceptually inspired variational framework for color enhance- resolution representation learning for human pose estimation. In
ment. TPAMI, 2009. 3 CVPR, 2019. 4
[83] Daniel J Jobson, Zia-ur Rahman, and Glenn A Woodell. A
[109] Damien Fourure, Rémi Emonet, Élisa Fromont, Damien Muselet,
multiscale retinex for bridging the gap between color images and
Alain Trémeau, and Christian Wolf. Residual conv-deconv grid
the human observation of scenes. TIP, 1997. 3, 9
network for semantic segmentation. In BMVC, 2017. 4
[84] Alessandro Rizzi, Carlo Gatta, and Daniele Marini. From retinex
[110] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed,
to automatic color equalization: issues in developing a new al-
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going
gorithm for unsupervised color equalization. Journal of Electronic
deeper with convolutions. In CVPR, 2015. 4
Imaging, 2004. 3
[111] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective
[85] Andrey Ignatov and Radu Timofte. Ntire 2019 challenge on
kernel networks. In CVPR, 2019. 4
image enhancement: Methods and results. In CVPRW, 2019. 3
[86] Liang Shen, Zihan Yue, Fan Feng, Quan Chen, Shihao Liu, and [112] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks.
Jie Ma. Msr-net: Low-light image enhancement using deep In CVPR, 2018. 4
convolutional network. arXiv, 2017. 3 [113] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.
[87] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep Non-local neural networks. In CVPR, 2018. 4
retinex decomposition for low-light enhancement. BMVC, 2018. [114] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu.
3, 5, 8, 9, 10, 11 Global context networks. TPAMI, 2020. 4
[88] Huibin Chang, Michael K Ng, Wei Wang, and Tieyong Zeng. [115] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas
Retinex image enhancement via a learned dictionary. Optical Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers
Engineering, 2015. 3 in vision: A survey. ACM Computing Surveys, 2021. 4
[89] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [116] Elad Hoffer, Berry Weinstein, Itay Hubara, Tal Ben-Nun, Torsten
Schroff, and Hartwig Adam. Encoder-decoder with atrous sep- Hoefler, and Daniel Soudry. Mix & match: training convnets
arable convolution for semantic image segmentation. In ECCV, with mixed image sizes for improved accuracy, speed and scale
2018. 3 resiliency. arXiv:1908.08986, 2019. 5
[90] Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar. LLNet: a [117] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
deep autoencoder approach to natural low-light image enhance- scaling for convolutional neural networks. In ICML, 2019. 5
ment. Pattern Recognition, 2017. 3 [118] Tobias Plotz and Stefan Roth. Benchmarking denoising algo-
[91] Wenqi Ren, Sifei Liu, Lin Ma, Qianqian Xu, Xiangyu Xu, Xi- rithms with real photographs. In CVPR, 2017. 5, 6, 8
aochun Cao, Junping Du, and Ming-Hsuan Yang. Low-light [119] https://noise.visinf.tu-darmstadt.de/benchmark/, 2017. [On-
image enhancement via a deep hybrid network. TIP, 2019. 3 line; accessed 29-Feb-2020]. 5
[92] Kangfu Mei, Juncheng Li, Jiajie Zhang, Haoyu Wu, Jie Li, and Rui [120] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei
Huang. Higher-resolution network for image demosaicing and Zhang. Toward real-world single image super-resolution: A new
enhancing. In ICCVW, 2019. 3 benchmark and a new model. In ICCV, 2019. 5, 6, 7, 8, 9
[93] Jiaqian Li, Juncheng Li, Faming Fang, Fang Li, and Guixu [121] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Du-
Zhang. Luminance-aware pyramid network for low-light image rand. Learning photographic global tonal adjustment with a
enhancement. IEEE Transactions on Multimedia, 2020. 3 database of input/output image pairs. In CVPR, 2011. 5, 8, 9,
[94] Yu-Sheng Chen, Yu-Ching Wang, Man-Hsin Kao, and Yung-Yu 11
Chuang. Deep photo enhancer: Unpaired learning for image [122] Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen
enhancement from photographs with gans. In CVPR, 2018. 3, Lin. Exposure: A white-box photo post-processing framework.
9, 11 TOG, 2018. 5, 9
[95] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Van- [123] Jongchan Park, Joon-Young Lee, Donggeun Yoo, and
hoey, and Luc Van Gool. Wespe: weakly supervised photo In So Kweon. Distort-and-recover: Color enhancement using
enhancer for digital cameras. In CVPRW, 2018. 3 deep reinforcement learning. In CVPR, 2018. 5, 9
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

[124] Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei- Salman Khan is an Assistant Professor at
Shi Zheng, and Jiaya Jia. Underexposed photo enhancement MBZ University of Artificial Intelligence. He has
using deep illumination estimation. In CVPR, 2019. 5, 9 been an Adjunct faculty member with Aus-
[125] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient tralian National University since 2016. He has
descent with warm restarts. In ICLR, 2017. 6 been awarded the outstanding reviewer award
[126] Harold C Burger, Christian J Schuler, and Stefan Harmeling. Im- at CVPR multiple times, won the best paper
age denoising: Can plain neural networks compete with BM3D? award at 9th ICPRAM 2020, and 2nd prize in
In CVPR, 2012. 6 the NTIRE Image Enhancement Competition at
[127] Chong Mou, Jian Zhang, and Zhuoyuan Wu. Dynamic attentive CVPR 2019. He served as a program committee
graph learning for image restoration. In ICCV, 2021. 6 member for several premier conferences includ-
[128] Chao Ren, Xiaohai He, Chuncheng Wang, and Zhibo Zhao. ing CVPR, ICCV, ICLR, ECCV and NeurIPS. He
Adaptive consistency prior based deep network for image de- received his Ph.D. degree from the University of Western Australia in
noising. In CVPR, 2021. 6 2016. His thesis received an honorable mention on the Deans List
[129] Zhenqiang Ying, Ge Li, and Wen Gao. A bio-inspired multi- Award. His research interests include computer vision and machine
exposure fusion framework for low-light image enhancement. learning.
arXiv preprint arXiv:1711.00591, 2017. 9
[130] Zhenqiang Ying, Ge Li, Yurui Ren, Ronggang Wang, and Wenmin
Wang. A new image contrast enhancement algorithm using
exposure fusion framework. In CAIP, 2017. 9
[131] Xuan Dong, Guan Wang, Yi Pang, Weixin Li, Jiangtao Wen, Wei
Meng, and Yao Lu. Fast efficient algorithm for enhancement of Munawar Hayat received his PhD from The
low lighting video. In ICME, 2011. 9 University of Western Australia (UWA). His PhD
[132] Xiaojie Guo, Yu Li, and Haibin Ling. Lime: Low-light image thesis received multiple awards, including the
enhancement via illumination map estimation. TIP, 2016. 9, 10 Deans List Honorable Mention Award and the
[133] Xueyang Fu, Delu Zeng, Yue Huang, Xiao-Ping Zhang, and Robert Street Prize. After his PhD, he joined
Xinghao Ding. A weighted variational model for simultaneous IBM Research as a postdoc and then moved
reflectance and illumination estimation. In CVPR, 2016. 9, 10 to the University of Canberra as an Assistant
[134] Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Struc- Professor. He is currently a Senior Scientist at
ture inference net: Object detection using scene-level context and Inception Institute of Artificial Intelligence, UAE.
instance-level relationships. In CVPR, 2018. 9 Munawar was granted two US patents, and has
[135] Shuhang Wang, Jin Zheng, Hai-Miao Hu, and Bo Li. Naturalness published over 30 papers at leading venues in
preserved enhancement algorithm for non-uniform illumination his field, including TPAMI, IJCV, CVPR, ECCV and ICCV. His research
images. TIP, 2013. 9 interests are in computer vision and machine/deep learning.
[136] Wenjing Wang, Chen Wei, Wenhan Yang, and Jiaying Liu. Glad-
net: Low-light enhancement network with global awareness. In
FG, 2018. 9
[137] Yonghua Zhang, Xiaojie Guo, Jiayi Ma, Wei Liu, and Jiawan
Zhang. Beyond brightening low-light images. IJCV, 2021. 9,
10
[138] Michaël Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W Fahad Khan is a faculty member at MBZUAI,
Hasinoff, and Frédo Durand. Deep bilateral learning for real- United Arab Emirates and Linkoping University,
time image enhancement. TOG, 2017. 9, 11 Sweden. From 2018 to 2020 he worked as a
Lead Scientist at the Inception Institute of Arti-
ficial Intelligence (IIAI), Abu Dhabi, United Arab
Emirates. He received the M.Sc. degree in Intel-
ligent Systems Design from Chalmers University
of Technology, Sweden and a Ph.D. degree in
Computer Vision from Autonomous University of
Barcelona, Spain. He has achieved top ranks on
Syed Waqas Zamir received the Ph.D. degree various international challenges (Visual Object
from University Pompeu Fabra, Spain, in 2017. Tracking VOT: 1st 2014 and 2018, 2nd 2015, 1st 2016; VOT-TIR: 1st
He is a Research Scientist at Inception Institute 2015 and 2016; OpenCV Tracking: 1st 2015; 1st PASCAL VOC 2010).
of Artificial Intelligence in UAE. His research in- His research interests include a wide range of topics within computer vi-
terests include low-level computer vision, com- sion and machine learning, such as object recognition, object detection,
putational imaging, image and video processing, action recognition and visual tracking. He has published articles in high-
color vision and image restoration and enhance- impact computer vision journals and conferences in these areas. He
ment. serves as a regular program committee member for leading computer
vision conferences such as CVPR, ICCV, and ECCV.

Ming-Hsuan Yang is affiliated with Google, UC


Merced, and Yonsei University. Yang serves as
Aditya Arora is a Research Engineer at In- a program co-chair of IEEE International Confer-
ception Institute of Artificial Intelligence in UAE. ence on Computer Vision (ICCV) in 2019, pro-
His research interests include image and video gram co-chair of Asian Conference on Computer
processing, computational photography and low- Vision (ACCV) in 2014, and general co-chair of
level vision. ACCV 2016. Yang served as an associate editor
of the IEEE Transactions on Pattern Analysis
and Machine Intelligence, and is an associate
editor of the International Journal of Computer
Vision, Image and Vision Computing and Jour-
nal of Artificial Intelligence Research. He received the NSF CAREER
award and Google Faculty Award. He is a Fellow of the IEEE.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

Ling shao is the Chief Scientist of Terminus


Group and the President of Terminus Interna-
tional. He was the founding CEO and Chief Sci-
entist of the Inception Institute of Artificial Intel-
ligence, Abu Dhabi, UAE. His research interests
include computer vision, deep learning, medical
imaging and vision and language. He is a fellow
of the IEEE, the IAPR, the BCS and the IET.

You might also like