Efficient Image Deblurring Networks based on DF
Efficient Image Deblurring Networks based on DF
Diffusion Models
Abstract
This article presents a sliding window model for defocus deblurring, named Swin-
tormer, which achieves the best performance to date with remarkably low memory
usage. This method utilizes a diffusion model to generate latent prior features,
aiding in the restoration of more detailed images. Additionally, by adapting the
sliding window strategy, it incorporates specialized Transformer blocks to enhance
inference efficiency. The adoption of this new approach has led to a substantial
reduction in Multiply-Accumulate Operations (MACs) per iteration, drastically
cutting down memory requirements. In comparison to the currently leading GRL
method, our Swintormer model significantly reduces the computational load that
must depend on memory capacity, from 140.35 GMACs to 8.02 GMACs, while im-
proving the Peak Signal-to-Noise Ratio (PSNR) for defocus deblurring from 27.04
dB to 27.07 dB. This innovative technique enables the processing of higher resolu-
tion images on memory-limited devices, vastly broadening potential application
scenarios. The article wraps up with an ablation study, offering a comprehensive
examination of how each network module contributes to the final performance.
1 Introduction
Image deblurring is a classic task in low-level computer vision, which aims to restore the image
from a degraded input and has a wide range of application scenarios. Existing networks based
on supervised deep learning regression methods such as Restormer [81], GRL [41] show strong
capabilities for image deblurring tasks. However, such supervised algorithms invariably demand a
considerable volume of labeled data to effectively train regression models. Data annotation is labor-
intensive and often necessitates domain expertise, thereby resulting in elevated costs. Conversely,
unsupervised learning methods [39, 59, 26] obviate the need for labeled data, rendering them
particularly well-suited for large-scale datasets. Nevertheless, acquiring extensive data can pose
challenges in certain tasks, and in such contexts, unsupervised methods may not prove to be as
efficacious as supervised algorithms. Another important issue is the generalization ability. The
current deblurring algorithms [1, 66, 38, 78] often suffer severe performance degradation when
confronted with varying data distributions across different scenarios.
In recent years, attention mechanisms [63, 1, 38, 66] have demonstrated their effectiveness in
enhancing the fitting capability of deep learning regression models. Convolution operations, which are
often used to implement attention mechanisms, provide local connectivity and translation invariance.
While convolution operations enhance efficiency and generalization, the convolution operator’s
receptive field is restricted, impeding its ability to model long-range pixel dependencies. As a result,
Transformer-based algorithms utilizing the self-attention (SA) mechanism [74, 77, 83, 17] were
introduced to address this challenge. Although SA is highly effective in capturing long-range pixel
interactions, its complexity grows quadratically with spatial resolution. This makes it impractical to
∗
Corresponding author.
2
head transposed attention and the spatial attention is a shifted windows-Dconv attention
(SWDA).
• We present a more general approach by dividing the image into overlapping patches for
independent inference, which improves the model performance.
2 Related Works
2.1 Image deblurring
The traditional deblurring algorithm typically involves formulating and solving an optimization
problem based on the causes of image blurring [28, 64, 19, 55]. However, these approaches depend
on manually designed image features, leading to limited generalization capability and constrained
performance in intricate scenarios. Currently, deep learning-based on image deblurring focuses on
establishing a direct mapping between blurred images and sharp images from paired datasets:
Ib = ϕ(Is ; θi ), (2)
where ϕ is the image blur function, and θi is a parameter vector. Is is the sharp image. Ib is the
blurred image. With the powerful fitting capability of deep learning, it is possible to directly train
the model end-to-end to learn this mapping, thereby achieving deblurring [78, 41, 81]. Current
research primarily focuses on general algorithms that aim to improve model representation by using
advanced neural network architecture designs such as residual blocks, dense blocks, attention blocks,
and others [75, 46, 44, 45, 24, 8, 15, 21, 33, 20]. Notably, the Transformer architecture [74, 17]
has exhibited remarkable success. Numerous experiments have shown that the effectiveness of the
Transformer primarily lies in the design of token mixer and the FFN(Feed-Forward Network) [79].
In particular, the self-attention mechanism in the token mixer is recognized as the key driver of its
superior performance. However, its complexity increases quadratically as the number of patches
grows, making it infeasible for high-resolution images. To address this issue, various revised token
mixers [9, 3, 80, 72, 36, 76, 71, 30, 11] were developed to reduce complexity in different image
processing applications. On the other side, different FFN designs such as Mlp [4], GluMlp [14],
GatedMlp [48], ConvMlp [40] and SimpleGate [5] were proposed. While these designs have their
own advantages and disadvantages in various low-level visual tasks, the performance difference
remains consistent when the number of model parameters is nearly the same.
In addition to developing general algorithms, another focus is on creating specialized models op-
timized for specific blurry situations, including image super-resolution (SR), real-world image
deblurring, image denoising, and the reduction of JPEG compression artifacts [22, 84, 42, 7, 32].
Recently, Diffusion Models have emerged as leaders in unconditional image synthesis, leveraging
unsupervised learning algorithms to extract priors from datasets and achieve state-of-the-art results.
In contrast to previous, models such as feed-forward, GAN, and flow-based models, which directly
learn a mapping of f from the input x to the result y:
y = f (x), (3)
The diffusion model adopts a distinct approach. It treats the generation process as an optimization
computation, expressed as:
y = arg min Eθ (x, y), (4)
y
where E is the expectation and θ represents the parameters. The information it directly learns is not
the joint distribution of pixels, but rather the gradient of the distribution. In other words, instead of
learning a map directly, DM builds a neural network to find a solution to the optimization problem
and then samples the solution to get an image.
Drawing upon its robust ability to grasp image priors from datasets, various impressive diffusion
frameworks have been utilized for low-level vision tasks [52, 57, 31, 60, 26, 68, 16, 59, 18]. Nonethe-
less, training a highly advanced diffusion model often demands expensive computing resources. To
address this limitation, we propose to circumvent this drawback with our Swintormer approach.
3
Table 1: Quantitative comparison by applying diffusion model [59] on the DPDD Dataset [1]. In
order to make a fair comparison with other existing methods, the input image is an 8-bit image instead
of a 16-bit image.
3 Method
Our goal is to develop an image deblurring model that efficiently utilizes priors generated by the
diffusion model while maintaining the advantage of being memory-efficient. We propose Sliding
Window Image Restoration Model (Swintormer). The Swintormer is a regression-based model
designed with a sliding window strategy. An overview of the pipeline is presented in Fig. 1. In this
section, we introduce the design of the transform module in the model and demonstrate how diffusion
is employed to generate prior features.
While self-attention [74, 17] is highly effective, the time and memory complexity of the key-query
dot-product interaction grows quadratically in self-attention as the input resolution increases. For
example, performing a calculation on a tensor of size 8 × 128 × 128 × 48 requires 64GB of video
memory. Similar to deal with long sentence problems in NLP [9, 3, 80, 72, 36, 76, 71, 30, 11],
many methods [43, 81, 5] are proposed for high-resolution image. However, these methods have
different performance in different low-level visual tasks. Therefore, we propose Swintormer for
improving generalization performance, that has linear complexity. The key innovation is to segment
the feature tensor along the channel dimension and then calculate channel attention and spatial
attention separately. Another crucial aspect is the use of depth-wise convolutions to generate query
(Q), key (K) and value (V) projections instead of using linear layers, which can highlight the local
context to accelerate model convergence.
Given a layer normalized tensor Y ∈ RĤ×Ŵ ×Ĉ , our SWDA first applies a shifted window partition-
ing approach to divide Y into M × M patches (with a default window size of 16). One challenge
with this routine is that some windows may end up smaller than M × M . Therefore, a cyclic-shifting
toward the top-left direction method is used to solve the problem [49]. The resulting patches are
then used to generate Q, K and V projections through 1×1 convolutions to aggregate pixel-wise
cross-channel context, followed by 3×3 depth-wise bias-free convolutions. As a result, the tensors Q,
K and V are all the same size,RM ×M ×Ĉ . These tensors are then split into two parts along the channel,
C
each with a size of RM ×M × 2 . One part is used for channel attention calculation by MDTA [81],
C C
where their dot-product interaction produces a transposed-attention map of size R 2 × 2 , while the
other part is utilized for spatial attention calculation, resulting in an attention map with size of RM ×M .
Overall, the process is defined as:
Channel Attention = SoftMax(Qc KcT )Vc , (5)
Spacial Attention = SoftMax(Qs KsT + B)Vs , (6)
Attention = Wp(·) concat(Channel Attention, Spacial Attention), (7)
C
×M 2 M 2× C M 2 ×M 2
where Qc , Kc , Vc ∈ R 2 ;Qs , Ks , Vs ∈ R ;B∈R
2 represents the relative position
(·)
bias term for each head;Wp denotes the 1×1 point-wise convolution, and M 2 is the number of
patches in a window. The relative position bias encodes the relative spatial configurations of visual
elements.
Many methods, such as Masked Autoencoders [25], improve the performance of the model by
destroying the feature and then rebuilding the feature. Based on this idea of the feature destruction
4
Training Inference
(1) Randomly generate noise (2) Conditioning
(1)Encode the image
Blurry image Condition
Feature Original image c
Noisy feature zt ~ (0,I)
τθ
encode
(3) Reverse diffusion process of the length T
Noisy feature
Zt
z 0 = (x) Predicted noise Predicted noise
(zt , t,c) Zt-1 Zt (zt , t,c)
= +
Condition 1 1−t
zt−1 = (zt − (zt , t,c)) + 1−t t
c t 1−t
z t = t z0 + 1 − t
Concentrate
Noisy feature Predicted noise Noise
Swintormer
Loss
U-Net
IB
Swintormer
H W H W H W
Pixel-unshuffle
Pixel-shuffle
C 2C 2C 1×1
2 2 2 2 2 2 H W3
Transformer Transformer
3×3
Transformer
3×3
blocks blocks blocks
H W
C
2 2
H W H W
2C 2C
4 4 4 4 Channel
Concentrate
Downsample Transformer Transformer Attention
Norm
Norm
MLP
Split
blocks blocks
Window
Upsample Attention V2
Skip Connections
H W H W
4C 4C
8 8 8 8
3×3 Convolution Transformer
Transformer ˆ HW
C ˆˆ
blocks blocks Log CPB
1×1
3×3
R
R Reshape ˆ ˆ HW
ˆˆ
ˆ W
H ˆ
ˆ C ˆ HW
C ˆˆ HW ˆ HW
C ˆˆ ˆ W
H ˆ
ˆ C
Softmax
1×1
3×3
1×1
R R
Matrix Multiplication
H W H W ˆ HW
C ˆˆ
8C 8C
1×1
3×3
Transformer R
16 16 16 16
Element-wise Addition blocks
and reconstruction, we introduce the diffusion model to process some important features. Specifically,
our diffusion model is based on latent conditional denoising diffusion probabilistic models [59, 26].
It consists of a forward diffusion process q(z1:T |z0 ) and a reverse denoising process pθ (z0:T ) where
T is a fixed Markov Chain of length. When given a feature x, we use VQGAN [18] to obtain the
latent space z = E(x). With the latent space, we compute the forward diffusion process for training
the diffusion model and calculate the reverse denoising process for generating the processed feature.
Forward diffusion process. In the forward process, we input z and gradually add Gaussian noise N
to it according to a variance schedule β1 , . . . , βT :
T
Y
q(z1:T |z0 ) := q(zt |zt−1 ), (8)
t=1
p
q(zt |zt−1 ) := N (zt ; 1 − βt zt−1 , βt I) (9)
√ √
By fixing the variances βt to constants and the reparameterization [34] zt (z0 , ϵ) = ᾱt z0 + 1 − ᾱt ϵ,
for ϵ ∼ N (0, I), thus Eq. (9) can be rewritten as:
√
q(zt |z0 ) = N (zt ; ᾱt z0 , (1 − ᾱt )I) (10)
Qt
where αt := 1 − βt and ᾱt := i=1 αi . Specifically, we trained a denoising network ϵθ by predicting
ϵ from zt .
5
Reverse denoising process. In the reverse process pθ (z0:T ):
T
Y
pθ (z0:T ) := p(zT ) pθ (zt−1 |zt ), (11)
t=1
√ √
ᾱt−1 βt αt (1 − ᾱt−1 )
where µ̃t (zt , z0 ) := z0 + zt , (14)
1 − ᾱt 1 − ᾱt
1 − ᾱt−1
and β̃t := βt (15)
1 − ᾱt
Consequently, with the trained denoising network ϵθ conditioned on the c (The condition c is usually
images [56],but can also be text and semantic maps [27, 58]) for predicting the noise ϵ, we can
iteratively sample zt as follows:
1 1 − αt √
zt−1 = √ (zt − √ ϵθ (zt , t, c)) + 1 − αt ϵt (16)
αt 1 − ᾱt
where ϵt ∼ N (0, I). After T iterations, we can get feature z0 as illustrated in Fig. 1. We further
explore the iteration numbers T in Section 5.
We introduce a novel strategy to ensure consistency between the input tensor sizes used in both
training and inference. This is achieved by incorporating a pre-processing operation that employs
shifted windows before the inference step. While this pre-processing operation may introduce
additional computational overhead as overlapping regions are redundantly processed by the entire
model, it also increases parallelization due to maintaining the same batch size, thereby expediting the
inference process. Moreover, this approach grants control over the size of the overlapping region
by adjusting the sliding pace, allowing for a more nuanced trade-off between deblurring quality and
inference speed in practical applications.
First, it is aimed to establish a diffusion model for feature extraction, and then train Swintormer for
deblurring. We utilize a super-resolution LDM [16] to execute the diffusion model. This specific
LDM is chosen because its generated features align more closely with the input image’s own feature
distribution, rather than integrating the overall feature distribution of other images in the training
dataset. The super-resolution LDM consists of a denoising autoencoder ϵθ and a VQGAN model τθ .
The VQGAN is a pre-trained model, and in our approach, regardless of the training or inference stage,
its parameters are frozen, and we only need to train the denoising autoencoder ϵθ in the LDM. ϵθ is a
time-conditioned U-Net denoising autoencoder [59, 60]. Specifically, we first use τθ (x) to encode
the input image and obtain the feature z0 . Then in the latent space, we iteratively add Gaussian
noise to the input feature z0 to obtain the blurred feature zt . During this diffusion process, we train
the denoising autoencoder ϵθ (zt , t, c); t = 1 . . . T to make its estimated noise consistent with the
Gaussian noise we introduce:
h i
LLDM := EE(x),y,ϵ∼N (0,I),t ∥ϵ − ϵθ (zt , t, τθ (y))∥22 , (17)
Here, the input y is the blurred image used to guide the diffusion process.
6
After that, we proceed to train the Swintormer deblurring model. We utilize the trained denoising
autoencoder model ϵθ to estimate the noise present in the blurred image. The estimated noise is
then used through Eq. (16) to sample and acquire the prior feature z0 . Subsequently, along with
the corresponding blurred image, it is employed in the training of the Swintormer model ϕθ . This
training process incorporates the utilization of L1 loss and perceptual loss as Eq. (18) and Eq. (19):
Ldeblur = ∥Is − ϕθ (Ib , z0 )∥1 (18)
3.5 Inference
Using the trained denoising autoencoder and Swintormer for deblurring involves two corresponding
stages. First, prior feature extraction is performed. The blurred image x ∈ RH×W ×3 to be processed
is fed into the denoising autoencoder ϵθ , and the resulting encoded result zt is diffused through
the DDIM [67] to obtain the prior feature z0 ∈ RH×W ×3 . It is worth noting that that the prior
feature z0 will not be decoded by the VQGAN model τθ . Instead, along the channel dimension, z0 is
concatenated with x to form an extended feature tensor xf ∈ RH×W ×6 for deblurring computation.
To reduce the distribution shifts between training and inference, we partition the feature tensor into
the overlapping patches, resulting in the input tensor denoted as xinput ∈ RM ×M ×C×B . Here, M
signifies the window size, B is the training batch size, and the dimensions of the input tensor. After
processing by Swintormer, deblurred image patches xdp ∈ RM ×M ×3×B are obtained. Finally, the
resulting patches are merged into a complete deblurred image xdb ∈ RH×W ×3 , where the overlapping
regions are averaged to generate the corresponding value.
Table 2: Single-image defocus deblurring results on the RealDOF dataset [38]. Our method outper-
forms existing baselines without extra training data.
Method PSNR ↑ SSIM ↑ LPIPS ↓
DPDNet [1] 22.87 0.670 0.425
KPAC [66] 23.98 0.716 0.336
IFAN [38] 24.71 0.749 0.306
Restormer [81] 25.08 0.769 0.289
DRBNet [61] 25.75 0.771 0.257
Swintormer(ours) 25.83 0.772 0.257
7
4.1 Defocus Deblurring Results
As shown in Table 2, our Swintormer performs best on the RealDOF dataset [38]. Table 3 shows
image fidelity scores compared with conventional methods. Both version of our Swintormer trained
under the two different loss functions have advantages compared with state-of-the-art. Particularly, in
the outdoor scene category, Swintormer yields 0.12 dB improvements over the previous best method
GRL [41]. And, it is worth noting that our method achieves the highest perceptual scores in terms of
LPIPS on all scene categories.
Table 3: Deblurring results on the DPDD dataset(containing 37 indoor and 39 outdoor scenes).
Swintormer sets new state-of-the-art in metric PSNR and LPIPS by using the L1 loss function and
the Perceptual loss function respectively. S: single-image defocus deblurring. D: dual-pixel defocus
deblurring.
Indoor Scenes Outdoor Scenes Combined
Method PSNR ↑ SSIM ↑ MAE ↓ LPIPS ↓ PSNR ↑ SSIM ↑ MAE ↓ LPIPS ↓ PSNR ↑ SSIM ↑ MAE ↓ LPIPS ↓
EBDBS [28] 25.77 0.772 0.040 0.297 21.25 0.599 0.058 0.373 23.45 0.683 0.049 0.336
DMENetS [37] 25.50 0.788 0.038 0.298 21.43 0.644 0.063 0.397 23.41 0.714 0.051 0.349
JNBS [64] 26.73 0.828 0.031 0.273 21.10 0.608 0.064 0.355 23.84 0.715 0.048 0.315
DPDNetS [1] 26.54 0.816 0.031 0.239 22.25 0.682 0.056 0.313 24.34 0.747 0.044 0.277
KPACS [66] 27.97 0.852 0.026 0.182 22.62 0.701 0.053 0.269 25.22 0.774 0.040 0.227
IFANS [38] 28.11 0.861 0.026 0.179 22.76 0.720 0.052 0.254 25.37 0.789 0.039 0.217
RestormerS [81] 28.87 0.882 0.025 0.145 23.24 0.743 0.050 0.209 25.98 0.811 0.038 0.178
GRLS [41] 29.06 0.886 0.024 0.139 23.45 0.761 0.049 0.196 26.18 0.822 0.037 0.168
SwintormerS -Perceptual 28.95 0.883 0.025 0.141 23.33 0.750 0.050 0.205 26.09 0.819 0.038 0.168
SwintormerS -L1 28.99 0.884 0.025 0.142 23.51 0.769 0.042 0.209 26.18 0.823 0.034 0.176
DPDNetD [1] 27.48 0.849 0.029 0.189 22.90 0.726 0.052 0.255 25.13 0.786 0.041 0.223
RDPDD [2] 28.10 0.843 0.027 0.210 22.82 0.704 0.053 0.298 25.39 0.772 0.040 0.255
UformerD [78] 28.23 0.860 0.026 0.199 23.10 0.728 0.051 0.285 25.65 0.795 0.039 0.243
IFAND [38] 28.66 0.868 0.025 0.172 23.46 0.743 0.049 0.240 25.99 0.804 0.037 0.207
RestormerD [81] 29.48 0.895 0.023 0.134 23.97 0.773 0.047 0.175 26.66 0.833 0.035 0.155
GRLD [41] 29.83 0.903 0.022 0.114 24.39 0.795 0.045 0.150 27.04 0.847 0.034 0.133
SwintormerD -Perceptual 29.55 0.897 0.023 0.107 24.40 0.796 0.045 0.147 26.91 0.845 0.034 0.128
SwintormerD -L1 29.74 0.899 0.022 0.127 24.52 0.798 0.045 0.167 27.07 0.847 0.034 0.148
Experimental results for motion deblurring are shown in Table 4. Our proposed method also achieved
the leading performance.
5 Ablation Studies
In this section, we investigate the effectiveness of the different designs of our proposed method. All
experiments are conducted on the DPDD dataset. Previous findings have shown that our contributions
have led to significant performance enhancements. Next, we will analyze the influence of each
component individually.
8
Reference Blurry SRN [70] MPRNet [82]
Table 5: Ablation experiments. We train and test models on the DPDD dataset. For the baseline, we
apply Restormer [81], a Transformer architecture based on channel attention(MDTA). T represents
the iteration numbers in the diffusion model.
Network Component Params (M) MACs (G) PSNR (dB)
Baseline a Transformer(MDTA) 25.05 1.93 26.66
Transformer(Swin) 25.18 2.20 26.71
Transformer block b
Transformer(Swin+MDTA) 25.18 2.20 26.74
window size(256)+shift size(220)+Transformer(Swin+MDTA) 25.18 2.20 26.84
Pre-processing inference c window size(512)+shift size(220)+Transformer(Swin+MDTA) 25.18 2.20 26.98
window size(512)+shift size(220)+Transformer(MDTA) 26.13 18.70 26.91
window size(512)+shift size(384)+Transformer(Swin+MDTA) 25.18 2.20 26.89
T(5)+Transformer(MDTA) 138.8 8.02 26.67
T(10)+Transformer(MDTA) 138.8 8.02 26.71
Diffusion prior d
T(20)+Transformer(MDTA) 138.8 8.02 26.75
T(50)+Transformer(MDTA) 138.8 8.02 26.77
Overall T(50)+window size(512)+shift size(220)+Transformer(Swin+MDTA) 154.89 8.02 27.07
Improvements in mixed attention. Table 5b shows that the Shifted Windows-Dconv Attention has
comparable performance with MDTA. Furthermore, introducing the Shifted Windows-Dconv Atten-
tion to MDTA brings a better performance. Overall, our proposed Transformer block contributions
lead to a gain of 0.08 dB over the baseline.
Improvements in pre-processing inference. Table 5c shows that the pre-processing inference has
comparable performance with TLC [13]. Furthermore, it’s noteworthy that a significant performance
improvement of 0.25 dB over the baseline has been achieved by simply adjusting the window size
and shift size to align the input tensor size with the training tensor size, without requiring retraining
or fine-tuning.
Impact of diffusion prior. We construct a baseline model without priors generated by diffusion
model. Table 5d demonstrates that the diffusion priors provide favorable gain of 0.11dB over the
baseline. Furthermore, we explore the impact of the iteration numbers T in the diffusion model. A
larger number of iterations leads the diffusion model to generate more accurate features. Therefore,
the corresponding deblurring model can utilize the features more accurately. Based on testing results,
when the number of iterations reaches 20, the improvement of the entire deblurring model gradually
converges. However, for better performance, we choose T = 50 for the final model.
6 Conclusion
We endeavor to extend the applicability of deep learning deblurring methods beyond laboratory
settings, aiming to achieve favorable results across a broader spectrum of real-world scenarios.
Our principal contributions are presenting a new model and a new inference strategy that make
deblur high-resolution images in personal computer possible. Specifically, integrated with DM, a
memory-efficient model named Swintormer was built. It is an image deblurring Transformer model,
designed to efficiently process high-resolution images with remarkably low MACs. The proposed
Transformer block demonstrates improved performance by applying self-attention mechanisms across
both channel and spatial dimensions, while maintaining linear complexity. Furthermore, we present
a plug and play methodology that ensures consistency in input tensor sizes during training and
inference, thereby enhancing model performance. Importantly, this approach obviates the need for
retraining or fine-tuning, resulting in performance enhancements across various tasks.
9
References
[1] Abuolaim, A., Brown, M.S.: Defocus deblurring using dual-pixel data. In: Vedaldi, A., Bischof,
H., Brox, T., Frahm, J. (eds.) Computer Vision - ECCV 2020 - 16th European Conference,
Glasgow, UK, August 23-28, 2020, Proceedings, Part X. Lecture Notes in Computer Science, vol.
12355, pp. 111–126. Springer (2020). https://doi.org/10.1007/978-3-030-58607-2_7, https:
//doi.org/10.1007/978-3-030-58607-2_7
[2] Abuolaim, A., Delbracio, M., Kelly, D., Brown, M.S., Milanfar, P.: Learning to reduce defocus
blur by realistically modeling dual-pixel data. In: 2021 IEEE/CVF International Conference on
Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp. 2269–2278.
IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00229, https://doi.org/10.1109/
ICCV48922.2021.00229
[3] Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The Long-Document Transformer. ArXiv
preprint abs/2004.05150 (2020), https://arxiv.org/abs/2004.05150
[4] Burger, H.C., Schuler, C.J., Harmeling, S.: Image denoising: Can plain neural networks
compete with bm3d? In: 2012 IEEE Conference on Computer Vision and Pattern Recog-
nition, Providence, RI, USA, June 16-21, 2012. pp. 2392–2399. IEEE Computer Soci-
ety (2012). https://doi.org/10.1109/CVPR.2012.6247952, https://doi.org/10.1109/CVPR.
2012.6247952
[5] Chen, L., Chu, X., Zhang, X., Sun, J.: Simple baselines for image restoration. In: Avidan, S.,
Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022 -
17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VII. Lecture
Notes in Computer Science, vol. 13667, pp. 17–33. Springer (2022). https://doi.org/10.1007/978-
3-031-20071-7_2, https://doi.org/10.1007/978-3-031-20071-7_2
[6] Chen, L., Lu, X., Zhang, J., Chu, X., Chen, C.: Hinet: Half instance normalization network
for image restoration. In: IEEE Conference on Computer Vision and Pattern Recognition
Workshops, CVPR Workshops 2021, virtual, June 19-25, 2021. pp. 182–192. Computer
Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPRW53098.2021.00027,
https://openaccess.thecvf.com/content/CVPR2021W/NTIRE/html/Chen_HINet_
Half_Instance_Normalization_Network_for_Image_Restoration_CVPRW_2021_
paper.html
[7] Chen, X., Wang, X., Zhou, J., Qiao, Y., Dong, C.: Activating more pixels in image super-
resolution transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). pp. 22367–22377 (June 2023)
[8] Cheng, W., Zhao, M., Ye, Z., Gu, S.: Mfagan: A compression framework for memory-efficient
on-device super-resolution gan. ArXiv preprint abs/2107.12679 (2021), https://arxiv.org/
abs/2107.12679
[9] Child, R., Gray, S., Radford, A., Sutskever, I.: Generating Long Sequences with Sparse Trans-
formers. ArXiv preprint abs/1904.10509 (2019), https://arxiv.org/abs/1904.10509
[10] Cho, S., Ji, S., Hong, J., Jung, S., Ko, S.: Rethinking coarse-to-fine approach in sin-
gle image deblurring. In: 2021 IEEE/CVF International Conference on Computer Vision,
ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp. 4621–4630. IEEE (2021).
https://doi.org/10.1109/ICCV48922.2021.00460, https://doi.org/10.1109/ICCV48922.
2021.00460
[11] Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P.,
Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D.B., Colwell, L.J., Weller, A.: Rethinking
attention with performers. In: 9th International Conference on Learning Representations, ICLR
2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021), https://openreview.
net/forum?id=Ua6zuk0WRH
[12] Chu, X., Chen, L., Chen, C., Lu, X.: Improving image restoration by revisiting global
information aggregation. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Has-
sner, T. (eds.) Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Is-
rael, October 23-27, 2022, Proceedings, Part VII. Lecture Notes in Computer Science, vol.
13667, pp. 53–71. Springer (2022). https://doi.org/10.1007/978-3-031-20071-7_4, https:
//doi.org/10.1007/978-3-031-20071-7_4
10
[13] Chu, X., Chen, L., Chen, C., Lu, X.: Improving image restoration by revisiting global
information aggregation. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Has-
sner, T. (eds.) Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Is-
rael, October 23-27, 2022, Proceedings, Part VII. Lecture Notes in Computer Science, vol.
13667, pp. 53–71. Springer (2022). https://doi.org/10.1007/978-3-031-20071-7_4, https:
//doi.org/10.1007/978-3-031-20071-7_4
[14] Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional
networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference
on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Proceedings
of Machine Learning Research, vol. 70, pp. 933–941. PMLR (2017), http://proceedings.
mlr.press/v70/dauphin17a.html
[15] Deng, X., Zhang, Y., Xu, M., Gu, S., Duan, Y.: Deep coupled feedback network for joint
exposure fusion and image super-resolution. IEEE Transactions on Image Processing 30, 3098–
3112 (2021)
[16] Dhariwal, P., Nichol, A.Q.: Diffusion models beat gans on image synthesis. In:
Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Ad-
vances in Neural Information Processing Systems 34: Annual Conference on Neu-
ral Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, vir-
tual. pp. 8780–8794 (2021), https://proceedings.neurips.cc/paper/2021/hash/
49ad23d1ec9fa4bd8d77d02681df5cfa-Abstract.html
[17] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16
words: Transformers for image recognition at scale. In: 9th International Conference on
Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net
(2021), https://openreview.net/forum?id=YicbFdNTTy
[18] Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution im-
age synthesis. In: IEEE Conference on Computer Vision and Pattern Recogni-
tion, CVPR 2021, virtual, June 19-25, 2021. pp. 12873–12883. Computer Vi-
sion Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01268,
https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_
Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html
[19] Fergus, R., Singh, B., Hertzmann, A., Roweis, S.T., Freeman, W.T.: Removing cam-
era shake from a single photograph. In: ACM SIGGRAPH 2006 Papers. p. 787–794.
SIGGRAPH ’06, Association for Computing Machinery, New York, NY, USA (2006).
https://doi.org/10.1145/1179352.1141956, https://doi.org/10.1145/1179352.1141956
[20] Fu, X., Wang, M., Cao, X., Ding, X., Zha, Z.J.: A model-driven deep unfolding method for jpeg
artifacts removal. IEEE Transactions on Neural Networks and Learning Systems (2021)
[21] Fu, X., Zha, Z., Wu, F., Ding, X., Paisley, J.W.: JPEG artifacts reduction via deep convolutional
sparse coding. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV
2019, Seoul, Korea (South), October 27 - November 2, 2019. pp. 2501–2510. IEEE (2019).
https://doi.org/10.1109/ICCV.2019.00259, https://doi.org/10.1109/ICCV.2019.00259
[22] Ghahremani, M., Khateri, M., Sierra, A., Tohka, J.: Adversarial distortion learning for medical
image denoising. ArXiv preprint abs/2204.14100 (2022), https://arxiv.org/abs/2204.
14100
[23] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., Bengio, Y.: Generative adversarial networks (2014)
[24] Guo, Y., Chen, J., Wang, J., Chen, Q., Cao, J., Deng, Z., Xu, Y., Tan, M.: Closed-
loop matters: Dual regression networks for single image super-resolution. In: 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seat-
tle, WA, USA, June 13-19, 2020. pp. 5406–5415. Computer Vision Foundation / IEEE
(2020). https://doi.org/10.1109/CVPR42600.2020.00545, https://openaccess.thecvf.
com/content_CVPR_2020/html/Guo_Closed-Loop_Matters_Dual_Regression_
Networks_for_Single_Image_Super-Resolution_CVPR_2020_paper.html
[25] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable
vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,
11
CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 15979–15988. IEEE (2022).
https://doi.org/10.1109/CVPR52688.2022.01553, https://doi.org/10.1109/CVPR52688.
2022.01553
[26] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H.,
Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual (2020), https://proceedings.neurips.cc/paper/
2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html
[27] Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial
networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2017, Honolulu, HI, USA, July 21-26, 2017. pp. 5967–5976. IEEE Computer Society (2017).
https://doi.org/10.1109/CVPR.2017.632, https://doi.org/10.1109/CVPR.2017.632
[28] Karaali, A., Jung, C.R.: Edge-based defocus blur estimation with adaptive scale selection. TIP
(2017)
[29] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving
the image quality of stylegan. In: 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. pp. 8107–8116.
Computer Vision Foundation / IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00813,
https://openaccess.thecvf.com/content_CVPR_2020/html/Karras_Analyzing_
and_Improving_the_Image_Quality_of_StyleGAN_CVPR_2020_paper.html
[30] Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive
transformers with linear attention. In: Proceedings of the 37th International Conference on
Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine
Learning Research, vol. 119, pp. 5156–5165. PMLR (2020), http://proceedings.mlr.
press/v119/katharopoulos20a.html
[31] Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. In: Advances
in Neural Information Processing Systems (2022)
[32] Kawar, B., Song, J., Ermon, S., Elad, M.: Jpeg artifact correction using denoising diffusion
restoration models. In: Neural Information Processing Systems (NeurIPS) Workshop on Score-
Based Methods (2022)
[33] Kim, Y., Soh, J.W., Park, J., Ahn, B., Lee, H.S., Moon, Y.S., Cho, N.I.: A pseudo-blind
convolutional neural network for the reduction of compression artifacts. IEEE Transactions on
Circuits and Systems for Video Technology 30(4), 1121–1135 (2019)
[34] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.)
2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada,
April 14-16, 2014, Conference Track Proceedings (2014), http://arxiv.org/abs/1312.
6114
[35] Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2022)
[36] Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: 8th International
Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2020. OpenReview.net (2020), https://openreview.net/forum?id=rkgNKkHtvB
[37] Lee, J., Lee, S., Cho, S., Lee, S.: Deep defocus map estimation using domain
adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. pp. 12222–12230. Com-
puter Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPR.2019.01250,
http://openaccess.thecvf.com/content_CVPR_2019/html/Lee_Deep_Defocus_
Map_Estimation_Using_Domain_Adaptation_CVPR_2019_paper.html
[38] Lee, J., Son, H., Rim, J., Cho, S., Lee, S.: Iterative filter adaptive network for single image
defocus deblurring. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2021, virtual, June 19-25, 2021. pp. 2034–2042. Computer Vision Foundation / IEEE (2021).
https://doi.org/10.1109/CVPR46437.2021.00207, https://openaccess.thecvf.com/
content/CVPR2021/html/Lee_Iterative_Filter_Adaptive_Network_for_Single_
Image_Defocus_Deblurring_CVPR_2021_paper.html
12
[39] Lempitsky, V., Vedaldi, A., Ulyanov, D.: Deep image prior. In: 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 9446–9454 (2018).
https://doi.org/10.1109/CVPR.2018.00984
[40] Li, J., Hassani, A., Walton, S., Shi, H.: Convmlp: Hierarchical convolutional mlps for
vision. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR
2023 - Workshops, Vancouver, BC, Canada, June 17-24, 2023. pp. 6307–6316. IEEE
(2023). https://doi.org/10.1109/CVPRW59228.2023.00671, https://doi.org/10.1109/
CVPRW59228.2023.00671
[41] Li, Y., Fan, Y., Xiang, X., Demandolx, D., Ranjan, R., Timofte, R., Gool, L.V.: Efficient
and explicit modelling of image hierarchies for image restoration. In: IEEE/CVF Conference
on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June
17-24, 2023. pp. 18278–18289. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01753,
https://doi.org/10.1109/CVPR52729.2023.01753
[42] Liang, J., Cao, J., Sun, G., Zhang, K., Gool, L.V., Timofte, R.: Swinir: Image restora-
tion using swin transformer. In: IEEE/CVF International Conference on Computer Vi-
sion Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021. pp. 1833–
1844. IEEE (2021). https://doi.org/10.1109/ICCVW54120.2021.00210, https://doi.org/
10.1109/ICCVW54120.2021.00210
[43] Liang, J., Cao, J., Sun, G., Zhang, K., Gool, L.V., Timofte, R.: Swinir: Image restora-
tion using swin transformer. In: IEEE/CVF International Conference on Computer Vi-
sion Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021. pp. 1833–
1844. IEEE (2021). https://doi.org/10.1109/ICCVW54120.2021.00210, https://doi.org/
10.1109/ICCVW54120.2021.00210
[44] Liang, J., Lugmayr, A., Zhang, K., Danelljan, M., Gool, L.V., Timofte, R.: Hier-
archical conditional flow: A unified framework for image super-resolution and image
rescaling. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV
2021, Montreal, QC, Canada, October 10-17, 2021. pp. 4056–4065. IEEE (2021).
https://doi.org/10.1109/ICCV48922.2021.00404, https://doi.org/10.1109/ICCV48922.
2021.00404
[45] Liang, J., Sun, G., Zhang, K., Gool, L.V., Timofte, R.: Mutual affine network for spatially
variant kernel estimation in blind image super-resolution. In: 2021 IEEE/CVF International
Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021.
pp. 4076–4085. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00406, https://doi.
org/10.1109/ICCV48922.2021.00406
[46] Liang, J., Zhang, K., Gu, S., Gool, L.V., Timofte, R.: Flow-based kernel prior with
application to blind super-resolution. In: IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. pp. 10601–10610. Computer
Vision Foundation / IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.01046, https:
//openaccess.thecvf.com/content/CVPR2021/html/Liang_Flow-Based_Kernel_
Prior_With_Application_to_Blind_Super-Resolution_CVPR_2021_paper.html
[47] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D.,
Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context (2015)
[48] Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to mlps. In: Ranzato, M., Beygelzimer, A.,
Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Sys-
tems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021,
December 6-14, 2021, virtual. pp. 9204–9215 (2021), https://proceedings.neurips.cc/
paper/2021/hash/4cc05b35c2f937c5bd9e7d41d3686fff-Abstract.html
[49] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:
Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International
Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp.
9992–10002. IEEE (2021). https://doi.org/10.1109/ICCV48922.2021.00986, https://doi.
org/10.1109/ICCV48922.2021.00986
[50] Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: 5th
International Conference on Learning Representations, ICLR 2017, Toulon, France, April
24-26, 2017, Conference Track Proceedings. OpenReview.net (2017), https://openreview.
net/forum?id=Skq89Scxx
13
[51] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference
on Learning Representations (2019), https://openreview.net/forum?id=Bkg6RiCqY7
[52] Luo, Z., Gustafsson, F.K., Zhao, Z., Sjölund, J., Schön, T.B.: Image restoration with mean-
reverting stochastic differential equations. International Conference on Machine Learning
(2023)
[53] Mao, X., Liu, Y., Shen, W., Li, Q., Wang, Y.: Deep residual fourier transformation for single
image deblurring. ArXiv preprint abs/2111.11745 (2021), https://arxiv.org/abs/2111.
11745
[54] Nah, S., Kim, T.H., Lee, K.M.: Deep multi-scale convolutional neural network for dy-
namic scene deblurring. In: 2017 IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 257–265. IEEE Computer
Society (2017). https://doi.org/10.1109/CVPR.2017.35, https://doi.org/10.1109/CVPR.
2017.35
[55] Pan, J., Sun, D., Pfister, H., Yang, M.H.: Deblurring images via dark channel prior.
IEEE Transactions on Pattern Analysis and Machine Intelligence 40(10), 2315–2328 (2018).
https://doi.org/10.1109/TPAMI.2017.2753804
[56] Park, T., Liu, M., Wang, T., Zhu, J.: Semantic image synthesis with spatially-adaptive
normalization. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. pp. 2337–2346. Computer
Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPR.2019.00244, http:
//openaccess.thecvf.com/content_CVPR_2019/html/Park_Semantic_Image_
Synthesis_With_Spatially-Adaptive_Normalization_CVPR_2019_paper.html
[57] Peebles, W., Xie, S.: Scalable diffusion models with transformers (2023)
[58] Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial
text to image synthesis. In: Balcan, M., Weinberger, K.Q. (eds.) Proceedings of the 33nd
International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June
19-24, 2016. JMLR Workshop and Conference Proceedings, vol. 48, pp. 1060–1069. JMLR.org
(2016), http://proceedings.mlr.press/v48/reed16.html
[59] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis
with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern
Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 10674–10685.
IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01042, https://doi.org/10.1109/
CVPR52688.2022.01042
[60] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image
segmentation. In: MICCAI (3). Lecture Notes in Computer Science, vol. 9351, pp. 234–241.
Springer (2015)
[61] Ruan, L., Chen, B., Li, J., Lam, M.: Learning to deblur using light field generated and real
defocus images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,
CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 16283–16292. IEEE (2022).
https://doi.org/10.1109/CVPR52688.2022.01582, https://doi.org/10.1109/CVPR52688.
2022.01582
[62] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition
challenge (2015)
[63] Schuler, C.J., Hirsch, M., Harmeling, S., Schölkopf, B.: Learning to deblur. IEEE
Transactions on Pattern Analysis and Machine Intelligence 38(7), 1439–1451 (2016).
https://doi.org/10.1109/TPAMI.2015.2481418
[64] Shi, J., Xu, L., Jia, J.: Just noticeable defocus blur detection and estimation. In: IEEE Conference
on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015.
pp. 657–665. IEEE Computer Society (2015). https://doi.org/10.1109/CVPR.2015.7298665,
https://doi.org/10.1109/CVPR.2015.7298665
[65] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recogni-
tion. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015),
http://arxiv.org/abs/1409.1556
14
[66] Son, H., Lee, J., Cho, S., Lee, S.: Single image defocus deblurring using kernel-sharing
parallel atrous convolutions. In: 2021 IEEE/CVF International Conference on Computer Vision,
ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp. 2622–2630. IEEE (2021).
https://doi.org/10.1109/ICCV48922.2021.00264, https://doi.org/10.1109/ICCV48922.
2021.00264
[67] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (October
2020), https://arxiv.org/abs/2010.02502
[68] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based gen-
erative modeling through stochastic differential equations. In: 9th International Conference on
Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net
(2021), https://openreview.net/forum?id=PxTIG12RRHS
[69] Swofford, M.: Image completion on cifar-10 (2018)
[70] Tao, X., Gao, H., Shen, X., Wang, J., Jia, J.: Scale-recurrent network for deep image deblurring.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[71] Tay, Y., Bahri, D., Metzler, D., Juan, D.C., Zhao, Z., Zheng, C.: Synthesizer: Rethinking Self-
Attention in Transformer Models. ArXiv preprint abs/2005.00743 (2020), https://arxiv.
org/abs/2005.00743
[72] Tay, Y., Bahri, D., Yang, L., Metzler, D., Juan, D.: Sparse sinkhorn attention. In: Proceedings of
the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual
Event. Proceedings of Machine Learning Research, vol. 119, pp. 9438–9447. PMLR (2020),
http://proceedings.mlr.press/v119/tay20a.html
[73] Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A.C., Li, Y.: MAXIM: multi-axis
MLP for image processing. In: IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 5759–5770. IEEE (2022).
https://doi.org/10.1109/CVPR52688.2022.00568, https://doi.org/10.1109/CVPR52688.
2022.00568
[74] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polo-
sukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M.,
Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing
Systems 30: Annual Conference on Neural Information Processing Systems 2017, December
4-9, 2017, Long Beach, CA, USA. pp. 5998–6008 (2017), https://proceedings.neurips.
cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[75] Wang, L., Wang, Y., Lin, Z., Yang, J., An, W., Guo, Y.: Learning A single network for
scale-arbitrary super-resolution. In: 2021 IEEE/CVF International Conference on Computer
Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp. 4781–4790. IEEE (2021).
https://doi.org/10.1109/ICCV48922.2021.00476, https://doi.org/10.1109/ICCV48922.
2021.00476
[76] Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-Attention with Linear
Complexity. ArXiv preprint abs/2006.04768 (2020), https://arxiv.org/abs/2006.04768
[77] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE Confer-
ence on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June
18-22, 2018. pp. 7794–7803. Computer Vision Foundation / IEEE Computer Society (2018).
https://doi.org/10.1109/CVPR.2018.00813, http://openaccess.thecvf.com/content_
cvpr_2018/html/Wang_Non-Local_Neural_Networks_CVPR_2018_paper.html
[78] Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A general u-shaped transformer
for image restoration. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,
CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 17662–17672. IEEE (2022).
https://doi.org/10.1109/CVPR52688.2022.01716, https://doi.org/10.1109/CVPR52688.
2022.01716
[79] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is
actually what you need for vision. In: IEEE/CVF Conference on Computer Vision and Pattern
Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 10809–10819.
IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01055, https://doi.org/10.1109/
CVPR52688.2022.01055
15
[80] Zaheer, M., Guruganesh, G., Dubey, K.A., Ainslie, J., Alberti, C., Ontañón, S., Pham, P., Ravula,
A., Wang, Q., Yang, L., Ahmed, A.: Big bird: Transformers for longer sequences. In: Larochelle,
H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual (2020), https://proceedings.neurips.cc/
paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html
[81] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.: Restormer: Efficient
transformer for high-resolution image restoration. In: IEEE/CVF Conference on Computer
Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp.
5718–5729. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.00564, https://doi.
org/10.1109/CVPR52688.2022.00564
[82] Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Multi-stage
progressive image restoration. In: CVPR (2021)
[83] Zhang, H., Goodfellow, I.J., Metaxas, D.N., Odena, A.: Self-attention generative adversarial
networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International
Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California,
USA. Proceedings of Machine Learning Research, vol. 97, pp. 7354–7363. PMLR (2019),
http://proceedings.mlr.press/v97/zhang19d.html
[84] Zhang, Y., Li, D., Shi, X., He, D., Song, K., Wang, X., Qin, H., Li, H.: Kbnet: Kernel basis
network for image restoration. ArXiv preprint abs/2303.02881 (2023), https://arxiv.org/
abs/2303.02881
16