0% found this document useful (0 votes)
23 views13 pages

Chen Et Al. - 2020 - Pre-Trained Image Processing Transformer

Uploaded by

howardchina12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views13 pages

Chen Et Al. - 2020 - Pre-Trained Image Processing Transformer

Uploaded by

howardchina12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Pre-Trained Image Processing Transformer

Hanting Chen1,2 , Yunhe Wang2 *, Tianyu Guo1,2 , Chang Xu3 , Yiping Deng4 ,
Zhenhua Liu2,5,6 , Siwei Ma5,6 , Chunjing Xu2 , Chao Xu1 , Wen Gao5,6
1 2
Key Lab of Machine Perception (MOE), Dept. of Machine Intelligence, Peking University. Noah’s Ark Lab, Huawei Technologies.
3 4
School of Computer Science, Faculty of Engineering, The University of Sydney. Central Software Institution, Huawei Technologies.
5 6
Institute of Digital Media, School of Electronic Engineering and Computer Science, Peking University. Peng Cheng Laboratory.
arXiv:2012.00364v2 [cs.CV] 3 Dec 2020

{htchen, tianyuguo, liu-zh, swma, wgao}@pku.edu.cn, [email protected]


{yunhe.wang, yiping.deng, xuchunjing}@huawei.com, [email protected]

Abstract SISR x2 SISR x3 SISR x4


33.8 29.6 27.3
33.7 29.5 27.2
As the computing power of modern hardware is increas- 33.6 29.4 27.1
0.4dB↑ 0.4dB↑ 0.4dB↑
ing strongly, pre-trained deep learning models (e.g., BERT, 33.5
33.4
29.3
29.2
27
26.9
GPT-3) learned on large-scale datasets have shown their 33.3 29.1 26.8
33.2 29 26.7
effectiveness over conventional methods. The big progress 33.1 28.9 26.6

is mainly contributed to the representation ability of trans- HAN IPT HAN IPT HAN IPT
(ECCV 2020) (ECCV 2020) (ECCV 2020)
former and its variant architectures. In this paper, we
Denoising (30) Denoising (50) Deraining
study the low-level computer vision task (e.g., denoising, 34 31.5 42

super-resolution and deraining) and develop a new pre- 33.5 31 41.5


33 30.5
trained model, namely, image processing transformer (IPT). 32.5
2.0dB↑ 30
1.8dB↑ 41

40.5
1.6dB↑
29.5
To maximally excavate the capability of transformer, we 32
31.5 29
40

present to utilize the well-known ImageNet benchmark for 31 28.5 39.5

30.5 28 39
generating a large amount of corrupted image pairs. The RDN IPT RDN IPT RCDNet IPT
(CVPR 2018) (CVPR 2018) (CVPR 2020)
IPT model is trained on these images with multi-heads and
multi-tails. In addition, the contrastive learning is intro- Figure 1. Comparison on the performance of the proposed IPT and
duced for well adapting to different image processing tasks. the state-of-the-art image processing models on different tasks.
The pre-trained model can therefore efficiently employed on
desired task after fine-tuning. With only one pre-trained
model, IPT outperforms the current state-of-the-art meth- Pre-training has the potential to provide an attractive so-
ods on various low-level benchmarks. lution to image processing tasks by addressing the follow-
ing two challenges: First, task-specific data can be limited.
This problem is exacerbated in image processing task that
1. Introduction involves the paid-for data or data privacy, such as medical
images [7] and satellite images [67]. Various inconsistent
Image processing is one component of the low-level part factors (e.g. camera parameter, illumination and weather)
of a more global image analysis or computer vision system. can further perturb the distribution of the captured data for
Results from the image processing can largely influence the training. Second, it is unknown which type of image pro-
subsequent high-level part to perform recognition and un- cessing job will be requested until the test image is pre-
derstanding of the image data. Recently, deep learning has sented. We therefore have to prepare a series of image pro-
been widely applied to solve low-level vision tasks, such as cessing modules at hand. They have distinct aims, but some
image super-resolution, inpainting, deraining and coloriza- underlying operations could be shared.
tion. As many image processing tasks are related, it is nat-
It is now common to have pre-training in computer vi-
ural to expect a model pre-trained on one dataset can be
sion and natural language processing. For example, the
helpful for another. But few studies have generalized pre-
backbones of object detection models are often pre-trained
training across image processing tasks.
on ImageNet classification [16]. A number of well-trained
* Corresponding author networks can now be easily obtained from the Internet, in-

1
cluding AlexNet [36], VGGNet [50] and ResNet [30]. The quently. The transformer body is employed to process the
seminal work Transformers [55] have been widely used flattened features in which position and task embedding are
in many natural language processing (NLP) tasks, such as utilized for encoder and decoder, respectively. In addition,
translation [58] and question-answering [52]. The secret tails are forced to predict the original images with differ-
of its success is to pre-train transformer-based models on ent output sizes according to the specific task. Moreover,
a large text corpus and fine-tune them on the task-specific a contrastive loss on the relationship between patches of
dataset. Variants of Transformers, like BERT [17] and GPT- different inputs is introduced for well adopting to differ-
3 [4], further enriched the training data and improved the ent image processing tasks. The proposed image processing
pre-training skills. There have been interesting attempts on transformer is learned in an end-to-end manner. Experimen-
extending the success of Transformers to the computer vi- tal results conducted on several benchmarks show that the
sion field. For example, [56, 22] applied the self-attention pre-trained IPT model can surpass most of existing meth-
based models to capture global information on images. Car- ods on their own tasks by a significant enhancement after
ion et al. [6] proposed DERT to use transformer architec- fine-tuning.
tures for an end-to-end object detection. Most recently,
Dosovitskiy et al. [20] introduced Vision Transformer (ViT) 2. Related Works
to treat input images as 16×16 words and attained excellent
results on image recognition. 2.1. Image Processing
The aforementioned pre-training in computer vision and Image processing consists of the manipulation of im-
natural language mostly investigate a pretest classification ages, including super-resolution, denoising, dehazing, de-
task, but both the input and the output in an image pro- raining, debluring, etc. There are a variety of deep-learning-
cessing task are images. A straightforward application of based methods proposed to conduct on one or many kinds
these existing pre-training strategies might not be feasible. of image processing tasks. For the super-resolution, Dong
Further, how to effectively address different target image et al. propose SRCNN [18, 19] which are considered as
processing tasks in the pre-training stage remains a hard pioneering works introducing end-to-end models that re-
challenge. It is also instructive to note that the pre-training constructs HR images from their LR counterparts. Kim
of image processing models enjoys a convenience of self- et al. [34] further explore the capacity of deep neural net-
generating training instances based on the original real im- work with a more deeper convolutional network. Ahn et
ages. The synthetically manipulated images are taken for al. [2] and Lim et al. [41] propose introduce residual block
training, while the original image itself is the ground-truth into SR task. Zhang et al. [74] and Anwar and Barnes [3]
to be reconstructed. utilize the power of attention to enhance the performance
In this paper, we develop a pre-trained model for im- on SR task. A various excellent works are also proposed
age processing using the transformer architecture, namely, for the other tasks, such as denoising [54, 28, 33, 37], de-
Image Processing Transformer (IPT). As the pre-trained hazing [5, 38, 68, 65], deraining [32, 63, 49, 26, 59], and
model needs to be compatible with different image process- debluring [53, 44, 21, 9]. Different from above methods,
ing tasks, including super-resolution, denoising, and derain- we dig the capacity of both big models and huge volume
ing, the entire network is composed of multiple pairs of of data. Then a pre-training model handling several image
head and tail corresponding to different tasks and a sin- processing tasks is introduced.
gle shared body. Since the potential of transformer needs
to be excavated using large-scale dataset, we should pre- 2.2. Transformer
pair a great number of images with considerable diversity Transformer [55] and its variants have proven its suc-
for training the IPT model. To this end, we select the Im- cess being powerful unsupervised or self-supervised pre-
ageNet benchmark which contains various high-resolution training frameworks in various natural language processing
with 1,000 categories. For each image in the ImageNet, tasks. For example, GPTs [46, 47, 4] are pre-trained in a
we generate multiple corrupted counterparts using several autoregressive way that predicting next word in huge text
carefully designed operations to serve different tasks. For datasets. BERT [17] learns from data without explicit su-
example, training samples for the super-resolution task are pervision and predicts a masking word based on context.
generated by downsampling original images. The entired Colin et al. [48] proposes a universal pre-training frame-
dataset we used for training IPT contains about over 10 mil- work for several downstream tasks. Yinhan et al. [43] pro-
lions of images. poses a robust variant for original BERT.
Then, the transformer architecture is trained on the huge Due to the success of Transformer-based models in the
dataset as follows. The training images are input to the NLP field, there are many attempts to explore the benefits
specific head, and the generated features are cropped into of Transformer in computer vision tasks. These attempts
patches (i.e., “words”) and flattened to sequences subse- can be roughly divided into two types. The first is to in-

2
Multi-head Flatten features Multi-tail

Denoising Denoising
Head Tail
Transformer Encoder

Deraining Deraining
Head Tail
Features
Task embedding
x2 Up Features x2 Up
Head Tail



Transformer Decoder
x4 Up x4 Up
Head Reshape Tail

Figure 2. The diagram of the proposed image processing transformer (IPT). The IPT model consists of multi-head and multi-tail for
different tasks and a shared transformer body including encoder and decoder. The input images are first converted to visual features and
then divided into patches as visual words for subsequent processing. The resulting images with high visual quality are reconstructed by
ensembling output patches.

troduce self-attention into the traditional convolutional neu- images), an encoder-decoder transformer is established for
ral network. Yuan et al. [66] introduce spatial attention recovering the missing information in input data, and tails
for image segmentation. Fu et al. [23] proposes DANET are used formapping the features into restored images. Here
utilizing the context information by combining spatial and we briefly introduce our architecture, details can be found
channel attention. Wang et al. [60], Chen et al. [13] and in the supplementary material.
Zhang et al. [73] also augment features by self-attention Heads. To adjust different image processing task, we use
to enhance model performance on several high-level vision a multi-head architecture to deal with each task separately,
tasks. The other type is to replace convolutional neural net- where each head consists of three convolutional layers. De-
work with self-attention block. For instance, Kolesnikov et note the input image as x ∈ R3×H×W (3 means R, G, and
al. [35] and Dosovitskiy [20] conduct image classification B), the head generates a feature map fH ∈ RC×H×W with
with transformer block. Carion et al. [6] and Zhu et al. [80] C channels and same height and width (typical we use C =
implement transformer-based models in detection. Chen et 64). The calculation can be formulated as fH = H i (x),
al. [10] proposes a pre-trained GPT model for generative where H i (i = {1, . . . , Nt }) denote the head for the ith
and classification tasks. Wu et al. [62] and Zhao et al. [78] task and Nt denotes the number of tasks.
propose pre-training methods for teansformer-based mod- Transformer encoder. Before input features into the
els for image recognition task. However, few related works transformer body, we split the given features into patches
focus on low-level vision tasks. In this paper, we explore a and each patch is regarded as a ”word”. Specifically, the
universal pre-training approach for image processing tasks. features fH ∈ RC×H×W are reshaped into a sequence
2
of patches, i.e., fpi ∈ RP ×C , i = {1, . . . , N }, where
3. Image Processing Transformer
N = HW P 2 is the number of patches (i.e., the length of se-
To excavate the potential use of transformer on im- quence) and P is patch size. To maintain the position in-
age processing tasks for achieving better results, here we formation of each patch, we add learnable position encod-
2
present the image processing transformer by pre-training on ings Epi ∈ RP ×C for each patch of feature fpi follow-
large-scale dataset. ing [20, 6], and Epi + fpi will be directly input into the
transformer encoder. The architecture of encoder layer is
3.1. IPT architecture
following the original structure in [55], which has a multi-
The overall architecture of our IPT consists of four com- head self-attention module and a feed forward network. The
2
ponents: heads for extracting features from the input cor- output of encoder fEi ∈ RP ×C for each patch has the
rupted images (e.g., images with noise and low-resolution same size to that of the input patch fpi . The calculation can

3
be formulated as we generate the entire dataset for several tasks (e.g., super-
resolution and denosing) as follows.
y0 = [Ep1 + fp1 , Ep2 + fp2 , . . . , EpN + fpN ] ,
As the images in the ImageNet benchmark are of high
qi = ki = vi = LN(yi−1 ), diversity, which contains over 1 million of natural images
yi0 = MSA(qi , ki , vi ) + yi−1 , (1) from 1,000 different categories. These images have abun-
yi = FFN(LN(yi0 )) + yi0 , i = 1, . . . , l dant texture and color information. We first remove the
semantic label and manually synthesize a variety of cor-
[fE1 , fE2 , . . . , fEN ] = yl ,
rupted images from these unlabeled images with a variety
where l denotes the number of layers in the encoder, MSA of degradation models for different tasks. Note that synthe-
denotes the multi-head self-attention module in the conven- sized dataset is also usually used in these image processing
tional transformer model [55] and FFN denotes the feed for- tasks and we use the same degeneration methods as sug-
ward network, which contains two fully connected layers. gested in [27, 1]. For example, super-resolution tasks often
Transformer decoder. The decoder also follows the take bicubic degradation to generate low-resolution images,
same architecture and takes the output of decoder as input denoising tasks add Gaussian noise in clean images with
in the transformer body, which consists of two multi-head different noise level to generate the noisy images. These
self-attention (MSA) layers and one feed forward network synthesized images can significantly improve the perfor-
(FFN). The difference to that of the original transformer mance of learned deep networks including both CNN and
here is that we utilize a task-specific embedding as an addi- transformer architectures, which will be shown in the exper-
tional input of the decoder. These task-specific embeddings iment part. Basically, the corrupted images are synthesized
2
Eti ∈ RP ×C , i = {1, . . . , Nt } are learned to decode fea- as:
tures for different tasks. The calculation of decoder can be Icorrupted = f (Iclean ), (3)
formulated as: where f denotes the degradation transformation, which is
depended on the specific task: for the super-resolution task,
z0 = [fE1 , fE2 , . . . , fEN ] ,
f sr is exactly the bicubic interpolation; for image denois-
qi = ki = LN(zi−1 ) + Et , vi = LN(zi−1 ), ing, f noise (I) = I + η, where η is the additive Gaussian
zi0 = MSA(qi , ki , vi ) + zi−1 , noise; for deraining, f rain (I) = I + r in which r is a hand-
qi0 = LN(zi0 ) + Et , ki0 = vi0 = LN(z0 ), (2) crafted rain streak. The loss function for learning our IPT
in the supervised fashion can be formulated as:
zi00
= MSA(qi0 , ki0 , vi0 ) + zi0 ,
zi = FFN(LN(zi00 )) + zi00 , i = 1, . . . , l Nt
X
i
[fD1 , fD2 , . . . , fDN ] = yl , Lsupervised = L1 (IPT(Icorrupted ), Iclean ), (4)
i=1
2
where fDi ∈ RP ×C denotes the outputs of decoder. The where L1 denote the conventional L1 loss for reconstructing
decoded N patched features with size P 2 × C are then re- i
desired images and Icorrupted denote the corrupted image
shaped into the features fD with size C × H × W . for task i, respectively. In addition, Eq. 4 implies that the
Tails. The properties of tails are same as those of heads, proposed framework is trained with multiple image process
we use multi tails to deal with different tasks. The cal- tasks simultaneously. Specifically, for each batch, we ran-
culation can be formulated as fT = T i (fD ), where T i domly select one task from Nt supervised tasks for train-
(i = {1, . . . , Nt }) denote the head for the ith task and Nt ing and each task will be processed using the correspond-
denotes the number of tasks. The output fT is the resulted ing head, tail and task embedding, simultaneously. After
images size of 3 × H 0 × W 0 which is determined by the the pre-training the IPT model, it will capture the intrin-
specific task. For example, H 0 = 2H, W = 2W for a 2× sic features and transformations for a large variety of image
super-resolution task. processing tasks thus can be further fine-tuned to apply on
the desired task using the new provided dataset. Moreover,
3.2. Pre-training on ImageNet
other heads and tails will be dropped for saving the compu-
Besides the architecture of transformer itself, one of tation costs and parameters in the remained head, tail and
the key factors for successfully training an excellent trans- body will be updated according to the back-propagation.
former is that the well use of large-scale datasets. Compared However, due to the variety of degradation models, we
with image classification, the number of available data used cannot synthesize images for all image processing tasks.
for image processing task is relatively small (e.g., only 2000 For example, there is a wide range of possible noise lev-
images on DIV2K dataset for the image super-resolution els in practice. Therefore, the generalization ability of
task), we propose to utilize the well-known ImageNet as the resulting IPT should be further enhanced. Similar to
the baseline dataset for pre-training our IPT model, then the pre-training natural language processing models, the

4
relationship between patches of images is also informa- Table 1. Quantitative results on image super-resolution. Best and
second best results are highlighted and underlined.
tive. The patch in image scenario can be considered as a
Method Scale Set5 Set14 B100 Urban100
word in natural language processing. For example, patches
VDSR [34] ×2 37.53 33.05 31.90 30.77
cropped from the same feature map are more likely to ap- EDSR [42] ×2 38.11 33.92 32.32 32.93
pear together, which should be embedded into similar posi- RCAN [74] ×2 38.27 34.12 32.41 33.34
tions. Therefore, we introduce contrastive learning [11, 29] RDN [76] ×2 38.24 34.01 32.34 32.89
OISR-RK3 [31] ×2 38.21 33.94 32.36 33.03
for learning universal features so that the pre-trained IPT RNAN [75] ×2 38.17 33.87 32.32 32.73
model can be utilized to unseen tasks. In practice, denote SAN [15] ×2 38.31 34.07 32.42 33.10
the output patched features generated by IPT decoder for HAN [45] ×2 38.27 34.16 32.41 33.35
2
the given input xj as fD j
∈ RP ×C , i = {1, . . . , N }, IGNN [79] ×2 38.24 34.07 32.41 33.23
i IPT (ours) ×2 38.37 34.43 32.48 33.76
where xj is selected from a batch of training images X =
VDSR [34] ×3 33.67 29.78 28.83 27.14
{x1 , x2 , . . . , xB }. We aims to minimize the distance be- EDSR [42] ×3 34.65 30.52 29.25 28.80
tween patched features from the same images while max- RCAN [74] ×3 34.74 30.65 29.32 29.09
imize the distance between patches from different images. RDN [76] ×3 34.71 30.57 29.26 28.80
OISR-RK3 [31] ×3 34.72 30.57 29.29 28.95
The loss function for contrastive learning is formulated as:
RNAN [75] ×3 34.66 30.52 29.26 28.75
j j SAN [15] ×3 34.75 30.59 29.33 28.93
j j
exp(d(fD i
, fD i
)) HAN [45] ×3 34.75 30.67 29.32 29.10
l(fD , fD ) = −log PB 1
j
2
, IGNN [79] ×3 34.72 30.66 29.31 29.03
i1 i2 k
k=1 Ik6=j exp(d(fDi , fDi ))
1 2 IPT (ours) ×3 34.81 30.85 29.38 29.49
N N B VDSR [34] ×4 31.35 28.02 27.29 25.18
1 X XX j
Lconstrastive = l(f , f j ), EDSR [42] ×4 32.46 28.80 27.71 26.64
BN 2 i =1 i =1 j=1 Di1 Di2 RCAN [74] ×4 32.63 28.87 27.77 26.82
1 2
SAN [15] ×4 32.64 28.92 27.78 26.79
(5) RDN [76] ×4 32.47 28.81 27.72 26.61
aT b
where d(a, b) = kakkbk denotes the cosine similarity. OISR-RK3 [31] ×4 32.53 28.86 27.75 26.79
Moreover, to make fully usage of both supervised and self- RNAN [75] ×4 32.49 28.83 27.72 26.61
HAN [45] ×4 32.64 28.90 27.80 26.85
supervised information, we reformulate the loss function as: IGNN [79] ×4 32.57 28.85 27.77 26.84
IPT (ours) ×4 32.64 29.01 27.82 27.26
LIP T = λ · Lcontrastive + Lsupervised . (6)

Wherein, we combine the λ-balanced contrastive loss with


the supervised loss as the final objective function of IPT.
Thus, the proposed transformer network trained using Eq. 6
low the method described in [64]. During the test, we crop
can be effectively exploited on various existing image pro-
the images in the test set into 48 × 48 patches with a 10
cessing tasks.
pixels overlap. Note that the same testing strategy is also
4. Experiments adopted for CNN based models for a fair comparison, and
the resulting PSNR values of CNN models are the same as
In this section, we evaluate the performance of the pro- that of their baselines.
posed IPT on various image processing tasks including
super-resolution and image denoising. We show that the Training & Fine-tuning. We use 32 Nvidia NVIDIA
pre-trained IPT model can achieve state-of-the-art perfor- Tesla V100 cards to train our IPT model using the conven-
mance on these tasks. Moreover, extensive experiments for tional Adam optimizer with β1 = 0.9, β2 = 0.999 for 300
ablation study show that the transformer-based models per- epochs on the modified ImageNet dataset. The initial learn-
form better than convolutional neural networks when us- ing rate is set as 5e−5 and decayed to 2e−5 in 200 epoch
ing the large-scale dataset for solving the image processing with 256 batch size. Since the training set consists of dif-
problem. ferent tasks, we cannot input all of them in a single batch
Datasets. To obtain better pre-trained results of the IPT due to the expensive memory cost. Therefore, we stack a
model, we use the well-known ImageNet dataset, which batch of images from a randomly selected task in each iter-
consists of over 1M color images of high diversity. The ation. After pre-training on the entire synthesized dataset,
training images are cropped into 48 × 48 patches with 3 we fine-tune the IPT model on the desired task (e.g., ×3
channels for training, i.e., there are over 10M patches for single image super-resolution) for 30 epochs with a learn-
training the IPT model. We then generate the corrupted im- ing rate of 2e−5 . Note that SRCNN [18] also found that
ages with 6 types of degradation: 2×, 3×, 4× bicubic inter- using ImageNet training can bring up the performance of
polation, 30, 50 noise level Gaussian noise and adding rain- the super-resolution task, while we propose a model fitting
streaks, respectively. For the rain-streak generation, we fol- general low-level vision tasks.

5
HR VDSR [34] EDSR [42]

RDN [76] OISR [31] SAN [15]

Urban100 (×4): img 004 RNAN [75] IGNN [79] IPT (ours)

HR Bicubic VDSR [34] EDSR [42] RDN [76]

Urban100 (4×):img 012 OISR [31] SAN [15] RNAN [75] IGNN [79] IPT (ours)

HR Bicubic VDSR [34] EDSR [42] RDN [76]

Urban100 (4×): img 044 OISR [31] SAN [15] RNAN [75] IGNN [79] IPT (ours)
Figure 3. Visual results with bicubic downsampling (×4) from Urban100. The proposed method recovers more details. Compared images
are derived from [79].

4.1. Super-resolution Table 2. Quantitative results on color image denoising. Best and
second best results are highlighted and underlined.
We compare our model with several state-of-the-art BSD68 Urban100
Method
CNN-based SR methods. As shown in Table 1, our pre- 30 50 30 50
trained IPT outperforms all the other methods and achieves CBM3D [14] 29.73 27.38 30.36 27.94
the best performance in ×2, ×3, ×4 scale on all datasets. TNRD [12] 27.64 25.96 27.40 25.52
It is worth to highlight that our model achieves 33.76dB DnCNN [69] 30.40 28.01 30.28 28.16
PSNR on the ×2 scale Urban100 dataset, which surpasses MemNet [51] 28.39 26.33 28.93 26.53
other methods with more than ∼0.4dB, while previous IRCNN [70] 30.22 27.86 30.28 27.69
SOTA methods can only achieve a <0.2dB improvement FFDNet [71] 30.31 27.96 30.53 28.05
compared with others, which indicates the superiority of the SADNet [8] 30.64 28.32 N/A N/A
proposed model by utilizing large scale pre-training. RDN [77] 30.67 28.31 31.69 29.29
We further present the visualization results on our model IPT (ours) 32.32 29.88 33.75 31.12
in 4× scale on Urban100 dataset. As shown in Figure 3,
it is difficult for recover the original high resolution images
since lots of information are lost due to the high scaling image denoising task. The training and testing data is gen-
factor. Previous methods generated blurry images, while the erated by adding Gaussian noise with σ = 30, 50 to the
super-resolution images produced by our model can well clean images.
recover the details from the low-resolution images.
To verify the effectiveness of the proposed method,
4.2. Denoising we compare our results with various state-of-the-art mod-
els. Table 2 reported the color image denoising results
Since our pre-trained model can be well adapt to many on BSD68 and Urban100 dataset. As a result, our IPT
tasks, we then evaluate the performance of our model on achieves the best results among all denoising methods on

6
GT Noisy (σ=50) CBM3D [14] TNRD [12] RDN [76]

BSD68: 163085 DnCNN [69] MemNet [51] IRCNN [70] FFDNet [71] IPT (ours)

Figure 4. Color image denoising results with noise level σ = 50. Compared images are derived from [72].

Input / Groundtruth DSC GMM JCAS Clear SPANet


27.37 / 0.8154 29.34 / 0.8479 32.38 / 0.9306 31.45 / 0.9151 31.59 / 0.9380 35.67 / 0.9700

RESCAN PReNet JORDER_E SIRR RCDNet IPT (ours)


41.26 / 0.9887 37.27 / 0.9793 41.11 / 0.9894 36.99 / 0.9692 42.15 / 0.9912 43.91 / 0.9922

Figure 5. Image deraining results on the Rain100L dataset. Compared images are derived from [57].

different Gaussian noise level. Moreover, we surprisingly Figure 5 shows the visualization results. Previous meth-
found that our model improve the state-of-the-art perfor- ods are failed to reconstruct the original clean images since
mance by ∼2dB on the Urban100 dataset, which demon- they lack of image prior. As a result, our IPT model can
strate the effectiveness of pre-training and the superiority of present exactly the same image as the ground-truth and sur-
our transformer-based model. passes all the previous algorithms in visual quality. This
Figure 4 shows the visualization of the resulted images. result substantiates the generality of the proposed model.
As shown in the figure, noisy images are hard to be recog-
nized and it is difficult to recover the clean images. There-
4.4. Generalization Ability
fore, existing methods fail to reconstruct enough details and Although we can generate various corrupted images, nat-
generate abnormal pixels. As a result, our pre-trained model ural images are of high complexity and we cannot syn-
can well recover several details in the hair of this cat and our thesize all possible images for pre-training the transformer
visual quality beats all the previous models obviously. model. However, a good pre-trained model should have the
capacity for well adapting other tasks as those in the field of
4.3. Deraining
NLP. To this end, we then conduct several experiments to
For the image deraining task, we evaluate our model on verify the generalization ability of our model. In practice,
the synthesized Rain100L dataset [64], which consists of we test corrupted images that did not include in our syn-
100 rainy images. Quantitative results can be viewed in thesized ImageNet dataset, i.e., image denoising with noisy
Table 3. Compared with the state-of-the-art methods, we level 10 and 70, respectively. We use the heads and tails for
achieve the best performance (41.62dB) with an 1.62dB im- image denoising tasks as the pre-trained model.
provement. The detailed results are shown in Table 4, we compare

7
Table 3. Quantitative results of image deraining on the Rain100L dataset. Best and second best results are highlighted and underlined.
Method Input DSC [25] GMM [40] JCAS [27] Clear [24] DDN [25]
PSNR 26.90 27.34 29.05 28.54 30.24 32.38
SSIM 0.8384 0.8494 0.8717 0.8524 0.9344 0.9258
RESCAN [39] PReNet [49] JORDER E [64] SPANet [59] SSIR [61] RCDNet [57] IPT (ours)
38.52 37.45 38.59 35.33 32.37 40.00 41.62
0.9812 0.9790 0.9834 0.9694 0.9258 0.9860 0.9880

Table 4. Generation ability of our IPT model on color image de- the impact on the number of used data for resulting perfor-
noising with different noise levels. Best and second best results
mance. Figure 6 shows the results of different pre-trained
are highlighted and underlined.
models. When the models are not pre-trained or pre-trained
BSD68 Urban100
Method with small amount (< 60%) of the entire dataset, the CNN
10 70 10 70
models achieve better performance. In contrast, when using
CBM3D [14] 35.91 26.00 36.00 26.31 large-scale data, the transformer-based models overwhelm-
TNRD [12] 33.36 23.83 33.60 22.63 ing CNN models, which demonstrates that the effectiveness
DnCNN [69] 36.31 26.56 36.21 26.17 of our IPT model for pre-training.
MemNet [51] N/A 25.08 N/A 24.96
IRCNN [70] 36.06 N/A 35.81 N/A Table 5. Impact of λ for contrastive learning.
FFDNet [71] 36.14 26.53 35.77 26.39 λ 0 0.05 0.1 0.2 0.5
RDN [77] 36.47 26.85 36.69 27.63 PSNR 38.27 38.32 38.37 38.33 38.26
IPT (ours) 38.30 28.21 39.07 28.80
Impact of contrastive learning. As discussed above, to
improve the representation ability of our pre-trained model,
38.3 we embed the contrastive learning loss (Eq. 6) into the train-
ing procedure. We then evaluate its effectiveness on the ×2
scale super-resolution task using the Set4 dataset. Table 5
38.2
PSNR (dB)

shows the impact of the hyper-parameter λ for balancing


the two terms in Eq. 6. When λ=0, the IPT model is trained
38.1 using only a supervised learning approach, the resulting
PSNR value is 38.27dB. When employing the contrastive
loss for self-supervised learning, the model can achieve a
38.0 IPT CNN 38.37dB PSNR value (λ = 0.1), which is about 0.1dB higher
than that of the model trained with λ = 0. These results fur-
0.0 0.2 0.4 0.6 0.8 1.0 ther demonstrate the effectiveness of the contrastive learn-
Percentage of Used Images on ImageNet (1.1M Images)
ing for learning better pre-trained IPT model.
Figure 6. The performance of CNN and IPT models using different
percentages of data. 5. Conclusions and Discussions
This paper aims to address the image processing prob-
the performance of using the pre-trained IPT model and the lems using a pre-trained transformer model (IPT). The IPT
state-of-the-art methods for image denoising. Obviously, model is designed with multi-heads,multi-tails a shared
IPT model outperforms other conventional methods, which transformer body for serving different image processing
demonstrates that the pre-trained model can capture more task such as image super-resolution and denoising. To max-
useful information and features from the large-scale dataset. imally excavate the performance of the transformer archi-
tecture on various tasks, we explore a synthesized ImageNet
4.5. Ablation Study
datesets. Wherein, each original image will be degraded to
Impact of data percentage. To evaluate the effective- a series of counterparts as paired training data. The IPT
ness of the transformer architecture, we conduct experi- model is then trained using supervised and self-supervised
ments to analyse the improvement of pre-training on CNN- approaches which shows strong ability for capturing intrin-
based model and transformer-based model. We utilize the sic features for low-level image processing. Experimental
well-known EDSR model as the CNN baseline and pre- results demonstrate that our IPT can outperform the state-
trained it and the proposed IPT model on the synthesized of-the-art methods using only one pre-trained model after a
ImageNet dataset. We use 20%, 40%, 60%, 80% and 100% quickly fine-tuning. In the future work, we will extend our
percentages of the synthesized ImageNet dataset to analyse IPT model to more tasks such as deblurring, dehazing, etc.

8
References on Image Processing, volume 1, pages I–313. IEEE, 2007.
6, 7, 8
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge [15] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and
on single image super-resolution: Dataset and study. In Pro- Lei Zhang. Second-order attention network for single im-
ceedings of the IEEE Conference on Computer Vision and age super-resolution. In Proceedings of the IEEE conference
Pattern Recognition Workshops, pages 126–135, 2017. 4 on computer vision and pattern recognition, pages 11065–
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, 11074, 2019. 5, 6
accurate, and lightweight super-resolution with cascading [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
residual network. In Proceedings of the European Confer- and Li Fei-Fei. Imagenet: A large-scale hierarchical image
ence on Computer Vision (ECCV), pages 252–268, 2018. 2 database. In 2009 IEEE conference on computer vision and
[3] Saeed Anwar and Nick Barnes. Densely residual laplacian pattern recognition, pages 248–255. Ieee, 2009. 1
super-resolution. IEEE Transactions on Pattern Analysis and [17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Machine Intelligence, 2020. 2 Toutanova. Bert: Pre-training of deep bidirectional
[4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- transformers for language understanding. arXiv preprint
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- arXiv:1810.04805, 2018. 2
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. [18] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
Language models are few-shot learners. arXiv preprint Tang. Learning a deep convolutional network for image
arXiv:2005.14165, 2020. 2 super-resolution. In European conference on computer vi-
[5] Bolun Cai, Xiangmin Xu, Kui Jia, Chunmei Qing, and sion, pages 184–199. Springer, 2014. 2, 5
Dacheng Tao. Dehazenet: An end-to-end system for single [19] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
image haze removal. IEEE Transactions on Image Process- Tang. Image super-resolution using deep convolutional net-
ing, 25(11):5187–5198, 2016. 2 works. IEEE transactions on pattern analysis and machine
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas intelligence, 38(2):295–307, 2015. 2
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- [20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
to-end object detection with transformers. arXiv preprint Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
arXiv:2005.12872, 2020. 2, 3 Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
[7] Gabriella Castellano, Leonardo Bonilha, LM Li, and Fer- vain Gelly, et al. An image is worth 16x16 words: Trans-
nando Cendes. Texture analysis of medical images. Clinical formers for image recognition at scale. arXiv preprint
radiology, 59(12):1061–1069, 2004. 1 arXiv:2010.11929, 2020. 2, 3
[8] Meng Chang, Qi Li, Huajun Feng, and Zhihai Xu. Spatial- [21] Thomas Eboli, Jian Sun, and Jean Ponce. End-to-end in-
adaptive network for single image denoising. arXiv preprint terpretable learning of non-blind image deblurring. arXiv
arXiv:2001.10291, 2020. 6 preprint arXiv:2007.01769, 2020. 2
[22] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
[9] Liang Chen, Faming Fang, Shen Lei, Fang Li, and Guixu
wei Fang, and Hanqing Lu. Dual attention network for
Zhang. Enhanced sparse model for blind deblurring. 2020.
scene segmentation. In Proceedings of the IEEE Conference
2
on Computer Vision and Pattern Recognition, pages 3146–
[10] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Hee- 3154, 2019. 2
woo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. [23] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
Generative pretraining from pixels. In Proceedings of the wei Fang, and Hanqing Lu. Dual attention network for
37th International Conference on Machine Learning, vol- scene segmentation. In Proceedings of the IEEE Conference
ume 1, 2020. 3 on Computer Vision and Pattern Recognition, pages 3146–
[11] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- 3154, 2019. 3
offrey Hinton. A simple framework for contrastive learning [24] Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao,
of visual representations. arXiv preprint arXiv:2002.05709, and John Paisley. Clearing the skies: A deep network archi-
2020. 5 tecture for single-image rain removal. IEEE Transactions on
[12] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction Image Processing, 26(6):2944–2956, 2017. 8
diffusion: A flexible framework for fast and effective image [25] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao
restoration. IEEE transactions on pattern analysis and ma- Ding, and John Paisley. Removing rain from single images
chine intelligence, 39(6):1256–1272, 2016. 6, 7, 8 via a deep detail network. In Proceedings of the IEEE Con-
[13] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan ference on Computer Vision and Pattern Recognition, pages
Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based 3855–3863, 2017. 8
global reasoning networks. In Proceedings of the IEEE Con- [26] Xueyang Fu, Borong Liang, Yue Huang, Xinghao Ding, and
ference on Computer Vision and Pattern Recognition, pages John Paisley. Lightweight pyramid networks for image de-
433–442, 2019. 3 raining. IEEE transactions on neural networks and learning
[14] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and systems, 2019. 2
Karen Egiazarian. Color image denoising via sparse 3d col- [27] Shuhang Gu, Deyu Meng, Wangmeng Zuo, and Lei Zhang.
laborative filtering with grouping constraint in luminance- Joint convolutional analysis and synthesis sparse representa-
chrominance space. In 2007 IEEE International Conference tion for single image layer separation. In Proceedings of the

9
IEEE International Conference on Computer Vision, pages [41] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
1708–1716, 2017. 4, 8 Kyoung Mu Lee. Enhanced deep residual networks for single
[28] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei image super-resolution. In Proceedings of the IEEE confer-
Zhang. Toward convolutional blind denoising of real pho- ence on computer vision and pattern recognition workshops,
tographs. In Proceedings of the IEEE Conference on Com- pages 136–144, 2017. 2
puter Vision and Pattern Recognition, pages 1712–1722, [42] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
2019. 2 Kyoung Mu Lee. Enhanced deep residual networks for single
[29] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross image super-resolution. In Proceedings of the IEEE confer-
Girshick. Momentum contrast for unsupervised visual rep- ence on computer vision and pattern recognition workshops,
resentation learning. In Proceedings of the IEEE/CVF Con- pages 136–144, 2017. 5, 6
ference on Computer Vision and Pattern Recognition, pages [43] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
9729–9738, 2020. 5 Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. moyer, and Veselin Stoyanov. Roberta: A robustly optimized
Deep residual learning for image recognition. In Proceed- bert pretraining approach. arXiv preprint arXiv:1907.11692,
ings of the IEEE conference on computer vision and pattern 2019. 2
recognition, pages 770–778, 2016. 2 [44] Boyu Lu, Jun-Cheng Chen, and Rama Chellappa. Unsuper-
[31] Xiangyu He, Zitao Mo, Peisong Wang, Yang Liu, Mingyuan vised domain-specific deblurring via disentangled represen-
Yang, and Jian Cheng. Ode-inspired network design for sin- tations. In Proceedings of the IEEE Conference on Computer
gle image super-resolution. In Proceedings of the IEEE Con- Vision and Pattern Recognition, pages 10225–10234, 2019.
ference on Computer Vision and Pattern Recognition, pages 2
1732–1741, 2019. 5, 6 [45] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping
Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and
[32] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, and Pheng-Ann Heng.
Haifeng Shen. Single image super-resolution via a holistic
Depth-attentional features for single-image rain removal. In
attention network. In European Conference on Computer
Proceedings of the IEEE Conference on Computer Vision
Vision, pages 191–207. Springer, 2020. 5
and Pattern Recognition, pages 8022–8031, 2019. 2
[46] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
[33] Xixi Jia, Sanyang Liu, Xiangchu Feng, and Lei Zhang. Foc-
Sutskever. Improving language understanding by generative
net: A fractional optimal control network for image denois-
pre-training, 2018. 2
ing. In Proceedings of the IEEE Conference on Computer
[47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
Vision and Pattern Recognition, pages 6054–6063, 2019. 2
Amodei, and Ilya Sutskever. Language models are unsuper-
[34] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate
vised multitask learners. OpenAI blog, 1(8):9, 2019. 2
image super-resolution using very deep convolutional net-
[48] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
works. In Proceedings of the IEEE conference on computer
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
vision and pattern recognition, pages 1646–1654, 2016. 2,
Peter J Liu. Exploring the limits of transfer learning with a
5, 6
unified text-to-text transformer. Journal of Machine Learn-
[35] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan ing Research, 21(140):1–67, 2020. 2
Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. [49] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu,
Big transfer (bit): General visual representation learning. and Deyu Meng. Progressive image deraining networks: A
arXiv preprint arXiv:1912.11370, 6, 2019. 3 better and simpler baseline. In Proceedings of the IEEE con-
[36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ference on computer vision and pattern recognition, pages
Imagenet classification with deep convolutional neural net- 3937–3946, 2019. 2, 8
works. Communications of the ACM, 60(6):84–90, 2017. 2 [50] Karen Simonyan and Andrew Zisserman. Very deep convo-
[37] Stamatios Lefkimmiatis. Non-local color image denoising lutional networks for large-scale image recognition. arXiv
with convolutional neural networks. In Proceedings of the preprint arXiv:1409.1556, 2014. 2
IEEE Conference on Computer Vision and Pattern Recogni- [51] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Mem-
tion, pages 3587–3596, 2017. 2 net: A persistent memory network for image restoration. In
[38] Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and Proceedings of the IEEE international conference on com-
Dan Feng. An all-in-one network for dehazing and beyond. puter vision, pages 4539–4547, 2017. 6, 7, 8
arXiv preprint arXiv:1707.06543, 2017. 2 [52] Hao Tan and Mohit Bansal. Lxmert: Learning cross-
[39] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin modality encoder representations from transformers. arXiv
Zha. Recurrent squeeze-and-excitation context aggregation preprint arXiv:1908.07490, 2019. 2
net for single image deraining. In Proceedings of the Euro- [53] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Ji-
pean Conference on Computer Vision (ECCV), pages 254– aya Jia. Scale-recurrent network for deep image deblurring.
269, 2018. 8 In Proceedings of the IEEE Conference on Computer Vision
[40] Yu Li, Robby T Tan, Xiaojie Guo, Jiangbo Lu, and Michael S and Pattern Recognition, pages 8174–8182, 2018. 2
Brown. Rain streak removal using layer priors. In Proceed- [54] Chunwei Tian, Yong Xu, Zuoyong Li, Wangmeng Zuo,
ings of the IEEE conference on computer vision and pattern Lunke Fei, and Hong Liu. Attention-guided cnn for image
recognition, pages 2736–2744, 2016. 8 denoising. Neural Networks, 124:117–129, 2020. 2

10
[55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- [68] He Zhang and Vishal M Patel. Densely connected pyramid
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia dehazing network. In Proceedings of the IEEE conference on
Polosukhin. Attention is all you need. In Advances in neural computer vision and pattern recognition, pages 3194–3203,
information processing systems, pages 5998–6008, 2017. 2, 2018. 2
3, 4 [69] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and
[56] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Lei Zhang. Beyond a gaussian denoiser: Residual learning of
Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. deep cnn for image denoising. IEEE Transactions on Image
Residual attention network for image classification. In Pro- Processing, 26(7):3142–3155, 2017. 6, 7, 8
ceedings of the IEEE conference on computer vision and pat- [70] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang.
tern recognition, pages 3156–3164, 2017. 2 Learning deep cnn denoiser prior for image restoration. In
[57] Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng. A model- Proceedings of the IEEE conference on computer vision and
driven deep neural network for single image rain removal. pattern recognition, pages 3929–3938, 2017. 6, 7, 8
In Proceedings of the IEEE/CVF Conference on Computer [71] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward
Vision and Pattern Recognition, pages 3103–3112, 2020. 7, a fast and flexible solution for cnn-based image denoising.
8 IEEE Transactions on Image Processing, 27(9):4608–4622,
[58] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang 2018. 6, 7, 8
Li, Derek F Wong, and Lidia S Chao. Learning deep [72] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward
transformer models for machine translation. arXiv preprint a fast and flexible solution for cnn-based image denoising.
arXiv:1906.01787, 2019. 2 IEEE Transactions on Image Processing, 27(9):4608–4622,
[59] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang 2018. 7
Zhang, and Rynson WH Lau. Spatial attentive single-image [73] Songyang Zhang, Xuming He, and Shipeng Yan. Latent-
deraining with a high quality real rain dataset. In Proceed- gnn: Learning efficient non-local relations for visual recog-
ings of the IEEE Conference on Computer Vision and Pattern nition. In International Conference on Machine Learning,
Recognition, pages 12270–12279, 2019. 2, 8 pages 7374–7383, 2019. 3
[60] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- [74] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
ing He. Non-local neural networks. In Proceedings of the Zhong, and Yun Fu. Image super-resolution using very deep
IEEE conference on computer vision and pattern recogni- residual channel attention networks. In Proceedings of the
tion, pages 7794–7803, 2018. 3 European Conference on Computer Vision (ECCV), pages
[61] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying 286–301, 2018. 2, 5
Wu. Semi-supervised transfer learning for image rain re- [75] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun
moval. In Proceedings of the IEEE Conference on Computer Fu. Residual non-local attention networks for image restora-
Vision and Pattern Recognition, pages 3877–3886, 2019. 8 tion. arXiv preprint arXiv:1903.10082, 2019. 5, 6
[62] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, [76] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Pe- Yun Fu. Residual dense network for image super-resolution.
ter Vajda. Visual transformers: Token-based image repre- In Proceedings of the IEEE conference on computer vision
sentation and processing for computer vision. arXiv preprint and pattern recognition, pages 2472–2481, 2018. 5, 6, 7
arXiv:2006.03677, 2020. 3 [77] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
[63] Wenhan Yang, Jiaying Liu, Shuai Yang, and Zongming Guo. Yun Fu. Residual dense network for image restoration. IEEE
Scale-free single image deraining via visibility-enhanced re- Transactions on Pattern Analysis and Machine Intelligence,
current wavelet learning. IEEE Transactions on Image Pro- 2020. 6, 8
cessing, 28(6):2948–2961, 2019. 2 [78] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor-
[64] Wenhan Yang, Robby T Tan, Jiashi Feng, Zongming Guo, ing self-attention for image recognition. In Proceedings of
Shuicheng Yan, and Jiaying Liu. Joint rain detection and the IEEE/CVF Conference on Computer Vision and Pattern
removal from a single image with contextualized deep net- Recognition, pages 10076–10085, 2020. 3
works. IEEE transactions on pattern analysis and machine [79] Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, and
intelligence, 42(6):1377–1393, 2019. 5, 7, 8 Chen Change Loy. Cross-scale internal graph neural network
[65] Xitong Yang, Zheng Xu, and Jiebo Luo. Towards percep- for image super-resolution. Advances in Neural Information
tual image dehazing by physics-based disentanglement and Processing Systems, 33, 2020. 5, 6
adversarial training. In AAAI, pages 7485–7492, 2018. 2 [80] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
[66] Yuhui Yuan and Jingdong Wang. Ocnet: Object context net- Wang, and Jifeng Dai. Deformable detr: Deformable trans-
work for scene parsing. arXiv preprint arXiv:1809.00916, formers for end-to-end object detection. arXiv preprint
2018. 3 arXiv:2010.04159, 2020. 3
[67] Yongnian Zeng, Wei Huang, Maoguo Liu, Honghui Zhang,
and Bin Zou. Fusion of satellite images in urban area: As- A. Visualization of Embeddings
sessing the quality of resulting images. In 2010 18th Inter-
national Conference on Geoinformatics, pages 1–4. IEEE, We visualize the task embeddings in figure 7. We can
2010. 1 find that for ×2 super-resolution task, the similarity be-

11
(a) ×2 super-resolution (b) ×3 super-resolution (c) ×4 super-resolution

(d) deraining (e) denoising with 30 noise level (f) denoising with 50 noise level
Figure 7. Visualization of six different task embeddings.

tween the embeddings on each position and their neigh- tasks).


bours are higher than ×3 super-resolution, while that of Moreover, we visualize the learned embeddings of IPT.
×4 super-resolution is the smallest. This results indi- Figure 8 shows the visualization results of position embed-
cates that each patches in ×2 super-resolution can focus dings. We find that patches with similar columns or rows
on other patches with farther distance than ×3 and ×4, have similar embeddings, which indicate that they learn
since their downsampling scale are smaller and the rela- useful information for discovering the position on image
tionship between different patches are closer. The similar- processing. We also test to use fixed embeddings or do not
ity of task embedding for deraining in figure 7 (d) shows use embeddings, whose performance are lower than that of
that the patches pay more attention on the vertical direc- using learnable position embeddings (vary from 0.2dB to
tion than horizontal direction, which is reasonable as the 0.3dB for different tasks).
rain is dropped vertically. The similarity of task embedding
for denoising is similar with Gaussian noise, and figure 7 B. Architecture of IPT
(f) with higher (50) noise level shows higher similarity be-
tween neighbours than figure 7 (e) with 30 noise level. The In the main paper, we propose the image processing
visualization results suggests that our task embeddings can transformer (IPT). Here we show the detailed architecture
indeed learn some information for different tasks. We also of IPT, which consists of heads, body and tails. Each head
test to not use task embeddings, which results in signifi- has one convolutional layer (with 3 × 3 kernel size, 3 in-
cant accuracy drop (vary from 0.1dB to 0.5dB for different put channels and 64 output channels) and two ResBlock.
Each ResBlock consists of two convolutional layers (with

12
Figure 8. Visualization of cosine similarity of position embed-
dings.

5 × 5 kernel size, 64 input channels and 64 output channels)


which involved by a single shortcut. The body has 12 en-
coder layers and 12 decoder layers. The tail of denoising or
deraining is a convolutional layer with 3 × 3 kernel size, 64
input channels and 3 output channels. For super-resolution,
the tail consists of one pixelshuffle layer with upsampling
scale 2 and 3 for ×2 and ×3 SR, two pixelshuffle layer with
upsampling scale 2 for ×4 SR.
The whole IPT has 114M parameters and 33G FLOPs,
which have more parameters while fewer FLOPs compared
with traditional CNN models (e.g., EDSR has 43M param-
eters and 99G FLOPs).

C. Impact of Multi-task Training


We train IPT following a multi-task manner and then
fine-tune it on 6 different tasks including ×2, ×3, ×4 super-
resolution, denoising with noise level 30,50 and deraining.
We find that this training strategy would not harm the per-
formance on these tasks which have been pre-trained on
large scale dataset (ImageNet). In other words, the per-
formance of multi-task training and single-task training re-
mains almost the same. However, when transferring to other
tasks (e.g., Section 4.4 in the main paper), the pre-trained
model using multi-task training is better than that of single-
task training for about 0.3dB, which suggests the multi-task
training would learn universal representation of image pro-
cessing tasks.

13

You might also like