Valanarasu TransWeather Transformer-Based Restoration of Images Degraded by Adverse Weather Conditions CVPR 2022 Paper
Valanarasu TransWeather Transformer-Based Restoration of Images Degraded by Adverse Weather Conditions CVPR 2022 Paper
Weather Conditions
Abstract
Fog Rain Snow Fog Rain Snow Fog Rain Snow
2353
have also been explored for weather removal tasks achiev- and so multiple artifacts can occur within a single patch.
ing better performance than CNNs [38, 48, 74]. Most of To this end, we propose a novel transformer encoder with
these methods just focus on one task at hand or fine-tune intra-patch transformer (Intra-PT) blocks. Intra-PT works
the model separately for each task. Although they achieve on sub-patches created from the original patches and ex-
excellent performance, these are not generic solutions for cavates features and details of smaller patches. Intra-PT
all adverse weather removal problems as the networks have thus focuses on attention inside the main patches to re-
to trained separately for each task. This makes it difficult to move weather degradations effectively. We use efficient
adopt them for real-time systems as there have to be mul- self-attention mechanisms to calculate the attention be-
tiple models making it computationally complex. Also, the tween sub-patches to keep the computational complexity
system would have to decide and switch between a series low. From our experiments, we find that introducing Intra-
of weather removal algorithms (Figure 1 (a)) making the PT blocks enhances the performance of transformer and
pipeline more complicated. helps it adapt better to weather removal tasks. We train
Recently, Li et al. [23] proposed an All-in-One bad our network on a similar configuration as All-in-One and
weather removal network which was the first work to pro- obtain superior performance across multiple test datasets
pose an algorithm that takes in an image degraded by any for rain removal, snow removal, fog removal and even a
weather condition as input and predicts the clean image. combination of these weather degradations. We also outper-
All-in-One network was tested across 3 datasets of rain, fog, form the methods designed specifically for these individual
and snow removal and achieved better or comparable per- tasks which are finetuned on those datasets. We also show
formance than the previous methods which were tuned indi- that TransWeather is fast during inference. Finally, we also
vidually on separate datasets. All-in-One network is CNN- test TransWeather on real-world weather degraded images,
based and uses multiple encoders. In particular, it uses sep- achieving excellent performance compared to the previous
arate encoders for the different weather degradation at hand methods. TransWeather can act as an efficient backbone in
and uses neural architecture search to find the best network the future for generic weather removal frameworks.
to address the problem (Figure 1 (b)). This network is still The key contributions of this work are as follows:
computationally complex as there are multiple encoders. To • We propose TransWeather - an efficient solution for all
the best of our knowledge, no other methods apart from All- adverse weather removal problem with just a single en-
in-One network [23] have been proposed for a generic ad- coder and a single decoder using transformers. We pro-
verse weather removal in the literature. Although recent pose using weather type queries to efficiently handle the
methods like MPR-Net [67], U-former [55], Swin-IR [27] All-in-One problem.
have been proposed as generic restoration networks vali- • We propose a novel transformer encoder using intra-patch
dated on multiple datasets, they are still fine-tuned on the transformer (Intra-PT) blocks to cater to fine detail feature
individual datasets and do not use a single model for all the extraction for low-level vision tasks like weather removal.
weather removal tasks. • We achieve state-of-the-art performance on multiple
datasets. We also validate the effectiveness of the pro-
In this work, we propose a single encoder-single decoder
posed method on real-world images.
transformer network, called TransWeather, to tackle all ad-
verse weather removal problems at once. Instead of using
2. Related Works
multiple encoders, we introduce weather type queries in the
transformer decoder to learn the task (Figure 1 (c)). Here, Adverse weather removal problems like deraining [16,
the multi-head self attention mechanisms take in weather 21, 25, 51, 61, 65, 76], dehazing [1, 2, 10, 22, 42, 69], desnow-
type queries as input and match it with keys and values ing [29,44,44,73] and rain drop removal [37,39,40,66] have
taken from features extracted from the transformer encoder. been extensively explored in the literature.
These weather type embeddings are learned along with the Rain Streak Removal: Yang et al. [61] used a recurrent
network to understand and adjust to the weather degradation network to decompose rain layers to different layers of var-
type present in the image. The decoded features and the hi- ious streak types to remove the rain. Zhang et al. [70]
erarchical features obtained form the encoder are fused and proposed using a conditional GAN for image deraining.
projected to the image space using a convolutional block. Yasarla et al. [64] explored using Gaussian processes to per-
Thus, TransWeather just has one encoder and one decoder form transfer learning from synthetic rain data to real-world
to learn the weather type as well as produce the clean image. rain data. Quan et al. [39] used a complementary cascaded
Transformers are good at extracting rich global information network to remove rain streaks and raindrops in a unified
when compared to CNNs [9]. However, we argue that when framework. A more detailed survey of various rain removal
the patches are large like in ViT [9], we fail to attend much methods can be found in [62].
to the information within the patch. Weather degradations Fog Removal: Li et al. [18] proposed a CNN network con-
like rain streak, rain drop and snow are usually small in size sidering both atmospheric light and transmission map to
2354
perform dehazing. Ren et al. [43] proposed pre-processing 3. Proposed Method - TransWeather
a hazy image to generate multiple inputs thus introduc-
ing color distortions to perform dehazing. Zhang and Pa- In the literature, different weather phenomenons have
tel [68] proposed a pyramid CNN network for image dehaz- been modelled differently with regards to the underlying
ing. Zhang et al. [72] proposed a hierarchial density aware physics involved. Rain drop [37] is modelled as
network for image dehazing. I = (1 − M) B + R, (1)
where I is the degraded image, M is the mask, B is the back-
Rain drop Removal: You et al. [66] proposed using tem- ground and R is the raindrop residual. Heavy rain with rain
poral information to perform video-based raindrop removal. streaks and fog effect [21] is modelled as
Qian et al. [37] used an attention GAN to remove raindrop Xn
and also introduced a new dataset. Quan et al. [40] used a I = T (B + Ri ) + (1 − T) A, (2)
dual attention mechanism to remove effects of raindrops. i
where T is the transmission map produced by scattering ef-
Snow Removal: Desnow-Net [29] was one of the first fect, and A is the atmospheric light in the scene. According
CNN-based methods proposed to remove snow from an im- to [29], snow is generally modeled as
age. Li et al. [20] proposed a stacked dense network for I = M S + M (1 − z), (3)
snow removal. Chen et al. [6] proposed JSTASR in which a where z is a mask indicating snow and S corresponds to
size and transparency aware method was proposed to re- snow flakes. All-in-One method [23] generalizes the ad-
move snow. Recently, DDMSNet [72] proposed a deep verse weather removal problem as
dense multiscale network using semantic and geometric pri- B = D(Ep (Ip )), (4)
ors for snow removal. where E corresponds to the encoder and D corresponds to
the decoder. p represents the weather type present in the
All-in-One Weather Removal: All-in-One Network [23]
image. Note that for each weather type a different encoder
was proposed to handle multiple weather degradations us-
is used. In this work, we follow a similar formulation of all
ing a single network. All-in-One uses a generator with mul-
adverse weather removal as
tiple task-specific encoders and a common decoder. It uses
B = T(Ip ), (5)
a discriminator to classify the degradation type and only
where T corresponds to TransWeather which consists of a
backpropagates the loss to specific encoders. It also uses
weather agnostic encoder and decoder network unlike All-
neural architecture search to optimize the feature extraction
in-One Network. The weather type queries are learnt along
from the encoder.
with the parameters of T() thus making the problem setup
Transformers in low-level vision: Since the introduc- more generic. We motivate this setup because a problem as
tion of Vision Transformer (ViT) [9] for visual recogni- generic as weather removal cannot be addressed by merely
tion, transformers have been widely adopted for various seeking for perfection on solving individual tasks. This for-
computer vision tasks [12, 31, 49, 60, 75]. Especially for mulation not only makes the process computationally ef-
low-level vision, Image processing transformer [4] shows ficient, but also helps in using complimentary information
how pretraining a transformer on large-scale datasets can between the tasks to further improve the performance. Fur-
help in obtaining a better performance for low-level ap- thermore, it is also grounded with regards to how human vi-
plications. U-former [55] proposed a U-Net based trans- sion works as our visual cortex can perform multiple tasks
former architecture for restoration problems. Swin-IR [27] without any difficulty. This view is widely agreed in neuro-
adopted Swin Transformer [30] for image restoration. Zhao biology as the visual cortex does not have different modules
et al. [74] proposed a local-global transformer specifically for different perception tasks [24, 32].
for image dehazing. A multi-branch network [48] for de-
raining was also proposed based on swin transformer. In 3.1. Network Architecture
ETDNet [38], an efficient transformer block to extract fea-
Given a degraded image I of size H × W × 3, we first
tures in a coarse to fine way for image deraining was pro-
divide it into patches. We then feed forward the patches
posed.
to a transformer encoder containing transformer blocks at
Unlike the above methods, we propose a transformer- different stages. Across each stage, the resolution is re-
based single-encoder single-decoder network to solve all duced to make sure the transformer learns both coarse and
adverse weather removal tasks using a single model in- fine information. We then use a transformer decoder block
stance. Our Transformer encoder is also modified to cater that uses the encoded features as keys and values while us-
to low-level tasks with the introduction of intra-patch trans- ing learnable weather type query embeddings as queries.
former block. Our transformer decoder is trained with The extracted features are then passed through a convolu-
weather type queries to learn the task and uses that infor- tional projection block to get the clean image of dimensions
mation to restore the clean image. H × W × 3. An overview of the network architecture of
2355
Fog Transformer Encoder Transformer Decoder
Transformer
Transformer
Transformer
Decoder
Block
Block
Block
Rain
Transformer
Block
xM
Snow
Image
xN From Transformer
Encoder
Intra Patch Transformer Block Transformer Block Feed Forward Net Block
Conv
Patch Embeddings
Multi-Head
FFN Block
Transformer
Depth-wise
Attention
Projection
Norm
Norm
MLP
Conv
Block
Clean Image
Figure 2. Overview of the proposed TransWeather network. A degraded image is forwarded to transformer encoder to extract hierar-
chical features. The encoder has intra-patch transformer blocks to extract features from smaller sub-patches created from the main patch.
The transformer decoder has learnable weather type queries to obtain the task feature. Then, the hierarchical features from encoder as well
as the task feature from decoder are forwarded to a convolutional projection block to obtain the clean image.
TransWeather can be found in Figure 2. In the following dimension of (N, C) to a dimension of ( N R , C.R). We then
sections, we describe these components in detail. use a linear layer to get the second dimension back to C
from C.R. Hence, the keys get a dimension of N R × C thus
reducing the complexity while calculating the self attention.
3.1.1 Transformer Encoder
The self-attention features are then passed to a FFN block.
We generate a hierarchical feature representation of the in- The FFN block used here has a slight variation from ViT as
put image by extracting multi-level features in the trans- we introduce using depth-wise convolution to MLP inspired
former encoder. The features are extracted at different from [26, 58, 59]. Using depth-wise convolution here helps
stages in the encoder thus facilitating extraction of both bring locality information and provide positional informa-
high-level and low-level features. Across each stage, we tion for transformers as shown in [59]. The computation in
perform overlapped patch merging [59]. Using this we com- the FFN block can be summarized as follows:
bine overlapping feature patches to get features of the same F F Ni (Xi ) = M LP (GELU (DW C(M LP (Xi )))) + Xi ,
size as that of non-overlapped patches before passing the where X refers to self-attention features, DW C is depth-
features to the next stage. wise convolution [7], GELU is Gaussian error linear units
Transformer Block: In each transformer block, we use [14], M LP is multi-layer perceptron, i indicates the stage.
multi-head self-attention layers and feed forward networks Intra-Patch Transformer Block: The intra-patch trans-
to calculate the self-attention features. The computation can former blocks are present in between each stage in the trans-
be summarized as: former encoder. These blocks take in the sub-patches cre-
Ti (Ii ) = F F N (M SA(Ii ) + Ii ), (6) ated from the original patches as the input. These sub-
where T() represents the transformer block, F F N () repre- patches are fixed in dimensions half of height and width
sents the feed forward network block, M SA() represents of the original patch. Intra-PT also utilizes a similar trans-
multi head self-attention, I is the input and i represents the former block as explained above. We use a high R value to
stage in the encoder. Similar to the original self-attention make the computation very efficient in the Intra-PT block.
network, the heads of queries (Q), keys (K) and values (V) Intra-PT block helps in extracting fine details helpful in
have same dimensions and are calculated as:! removing smaller degradation as we operate on smaller
QKT patches. Note that the Intra-PT block creates patches at
Attn(Q, K, V) = softmax √ V, (7)
d the feature level except at the first stage where it is done
where d represents the dimensionality. Note that we have at the image level. The output self-attention features from
multiple attention heads in each transformer block and that the Intra-PT block are added to the self-attention features
number is a hyper-parameter which we vary across each from the main block across the same stage. The formula-
stage in the transformer encoder. More details regarding the tion of feed forward process in our transformer encoder can
hyper-parameter settings can be found in the supplementary be summarized as follows:
document. We reduce the complexity of the original self- Yi = M Ti (Xi ) + IntraP Ti (P (Xi )) (8)
2
attention from O(N 2 ) to O( NR ) by introducing a reduction where I is input to the transformer across each stage, Y is
ratio R [54]. We reshape the keys into a dimension from a the output across each stage, M T () is the main transformer
2356
block, IntraP T is the intra-patch transformer block, P () truth (G) defined as follows:
0.5E2
corresponds to the process of creating sub-patches from the if |E| < 1
LsmoothL1 = (9)
input patches and i denotes the stage. |E| − 0.5 otherwise ,
where E = Î − G. We also add a perceptual loss that mea-
3.1.2 Transformer Decoder
sures the discrepancy between the features of prediction and
In the original transformer decoder [50], an autoregressive the ground truth. We extract these features using a VGG16
decoder is used to predict the output sequence one ele- network [47] pretrained on ImageNet. We extract features
ment at a time. Detection transformer (DETR) [3] uses from the 3rd , 8th and 15th layers of VGG16 to calculate
object queries to decode the box coordinates and class la- the perceptual loss. The perceptual loss is formulated as
bels to produce the final predictions. Inspired from them, follows
we define weather type queries to decode the task, predict Lperceptual = LM SE (V GG3,8,15 (Î), V GG3,8,15 (G)).
a task feature vector and use it to restore the clean image. The total loss can be summarized as follows
These weather type queries are learnable embeddings which Ltotal = LsmoothL1 + λLperceptual , (10)
are learnt along with the other parameters of our network. where λ is a weight that controls the contribution of
These queries attend to the feature outputs from the trans- Lperceptual and L1-loss on the overall loss.
former encoder. The transformer decoder here operates at a
single stage but has multiple blocks. We illustrate the trans- 4. Experiments
former decoder block in Figure 3. These transformer blocks We conduct extensive experiments to show the effective-
are similar to encoder-decoder transformer blocks [50]. Un- ness of our proposed method. In what follows, we explain
like self-attention transformer block where Q, K and V are the datasets, implementation details, experimental settings,
taken from the same input, here Q is weather type learnable results and comparison with state-of-the-art methods.
embedding while K and V are the features taken from the
last stage of the transformer encoder. The output decoded 4.1. Datasets
features represent the task feature vector and are fused with We train our network on a combination of images de-
the features extracted across the transformer encoder at each graded from a variety of adverse weather conditions simi-
stage. All of these features are forwarded to the convolu- lar to All-in-One Network [23]. We follow the same train-
tional tail to reconstruct the clean image. ing set distribution used in All-in-One for fair comparison.
Transformer Block
The training data consists of 9,000 images sampled from
Features from Transformer Encoder
FFN Block
Snow100K [29], 1,069 images from Raindrop [37] and
Norm 9,000 images of Outdoor-Rain [21]. Snow100K has syn-
K, V Multi-Head thetic images degraded by snow, raindrop has real raindrop
Attention
Q
images and Outdoor-Rain has synthetic images degraded
by both fog and rain streaks. We term this combination of
Weather Type Queries
training data as “All-Weather” for better representation.
Figure 3. Configuration of the transformer block in the de- We test our methods on both synthetic and real-world
coder. The queries here are learnable embeddings representing datasets. We use the Test1 dataset [21, 23], the RainDrop
the weather type while the keys and values are features taken from test dataset [37] and the Snow100k-L test set [29] for testing
the last stage of the transformer encoder. our method. In addition, we also evaluate on real-world
images degraded by rain streaks and rain drops.
3.1.3 Convolutional Projection Block
4.2. Implementation Details
The set of hierarchical transformer encoder features and We implement our method using Pytorch framework
task features from the transformer decoder are passed [33] and train it using an NVIDIA RTX 8000 GPU. We use
through a set of 4 convolutional layers to output the clean an Adam optimizer [17] and a learning rate of 0.0002. We
image. We also use an upsampling layer before every con- use a learning rate scheduler that anneals the learning rate
volutional layer to get back to the original image size. We by 2 after 100 and 150 epochs. The network is trained for
also have skip connections across each stage in the convolu- a total of 200 epochs with a batch size of 32. Other hyper-
tional tail from the transformer encoder. We also use a tanh parameters regarding the TransWeather architecture can be
activation function in the final layer. found in the supplementary document.
3.2. Loss 4.3. Comparison with state-of-the-art methods
Our network is trained in an end-to-end fashion using a First, we compare our method with state-of-the-art meth-
smooth L1-loss between the prediction (Î) and the ground ods which are designed specifically for each task: rain drop
2357
Type Method Venue PSNR (↑) SSIM (↑)
DetailsNet + Dehaze (DHF) [11] CVPR 2017 13.36 0.5830
DetailsNet + Dehaze (DRF) [11] CVPR 2017 15.68 0.6400
RESCAN + Dehaze (DHF) [25] ECCV 2018 14.72 0.5870
Task RESCAN + Dehaze (DHF) [25] ECCV 2018 15.91 0.6150
Specific pix2pix [15] CVPR 2017 19.09 0.7100
HRGAN [21] CVPR 2019 21.56 0.8550
Swin-IR [27] CVPR 2021 23.23 0.8685
MPRNet [67] CVPR 2021 21.90 0.8456
Multi All-in-One [23] CVPR 2020 24.71 0.8980
Task TransWeather - 31.05 0.9509
Table 1. Quantitative Comparison on the Test1 (rain+fog) dataset based on PSNR and SSIM. DHF represents De-Hazing First and
DRF represents De-Raining First. Red and Blue corresponds to first and second best results. ↑ means higher the better.
Type Method Venue PSNR (↑) SSIM (↑) 4.3.1 Referenced Quality Metrics
DetailsNet [11] CVPR 2017 19.18 0.7495
DesnowNet [29] TIP 2018 27.17 0.8983 We use PSNR and SSIM to evaluate the performance of dif-
Task
JSTASR [6] ECCV 2020 25.32 0.8076 ferent models. We tabulate the quantitative results in terms
Specific
Swin-IR [27] CVPR 2021 28.18 0.8800
DDMSNET [73] TIP 2021 28.85 0.8772
of PSNR and SSIM in Tables 1, 2, and 3 while evaluating
Multi All-in-One [23] CVPR 2020 28.33 0.8820 on the Test1 (fog+rain removal), Snow100K-L (snow re-
Task TransWeather - 33.78 0.9287 moval) and RainDrop (rain drop removal) test datasets, re-
Table 2. Quantitative Comparison on the SnowTest100k-L test spectively. As Test1 has both fog and rain, we sequentially
dataset based on PSNR and SSIM. Red and Blue corresponds apply deraining and dehazing methods for fair comparison
to first and second best results. ↑ means higher the better. on this dataset. For example, while using Details-Net and
RESCAN for deraining, we apply Multi-scale boosted de-
Type Method Venue PSNR (↑) SSIM (↑)
hazing network (MSBDN) [8] for dehazing. Note that from
Pix2pix [15] CVPR 2017 28.02 0.8547 our experiments we found MSBDN to be the best perform-
Attn. GAN [37] CVPR 2018 30.55 0.9023 ing network for dehazing. We compare the performance
Task
Specific
Quan et al. [40] ICCV 2019 31.44 0.9263 while applying deraining first, then dehazing and also vice-
Swin-IR [27] CVPR 2021 30.82 0.9035 versa. We train Swin-IR and MPRNet directly on “Outdoor-
CCN [39] CVPR 2021 31.34 0.9500
Rain” (training split of Test1) and test it on Test1 for fair
Multi All-in-One [23] CVPR 2020 31.12 0.9268
Task TransWeather - 34.55 0.9502
comparison. Similarly, Swin-IR was trained on Snow100K
Table 3. Quantitative comparison on the RainDrop test dataset dataset, RainDrop and tested on SnowTest100k-L, Rain-
based on PSNR and SSIM. Red and Blue corresponds to first and Drop test datasets respectively. It can be noted that some
second best results. ↑ means higher the better. recent methods like CCN and DDMSNet when fine-tuned
on the individual datasets outperform All-in-One. Tran-
sWeather outperforms All-in-One as well as all the task-
specific methods by a significant margin as we cater to low-
removal, snow removal and rain+haze removal. For rain
level weather details as well as use weather queries to effi-
drop removal, we compare the performance with state-of-
ciently handle the All-in-One problem.
the-art methods like Attention GAN [37], Quan et al. [40],
and complementary cascaded network (CCN) [39]. For
snow removal, we compare with Desnow-Net [29], JSTASR 4.3.2 Visual Quality Comparison
[6] and Deep Dense Multi-Scale Network (DDMSNet) [73].
For rain+fog removal, we compare with HRGAN [21], Synthetic Images We illustrate the predictions from syn-
Details-Net [11], Recurrent squeeze-and-excitation context thetic test datasets like Test1 and Snow100k-L in Figures
aggregation Net (RESCAN) [25], and Multi-Stage Progres- 4 and 5. It can be seen that Transweather achieves visu-
sive Restoration Network (MPRNet) [67]. We also com- ally pleasing results compared to the previous methods. It
pare with a recent transformer network Swin-IR [27] for all works very well in removing both fog and rain streaks as
datasets. Note that all these methods are single-task han- can be seen in Figure 4 while other methods including All-
dling networks which are fine-tuned for specific datasets. in-One fail to remove at least one of the degradations. It
can be seen from Figure 5 that our method removes even
We also compare the performance of our method with the snow particles which are very small in structure while
All-in-One network [23] which is trained to perform all the All-in-One has hard time removing them.
above tasks with a single model instance. Our method Tran- Real-World Images We illustrate the predictions from real
sWeather is also trained to perform all these tasks using a test datasets like RainDrop and Real-World images in Fig-
single model instance. ures 6 and 7. It can be seen in both the figures that Tran-
2358
Input RESCAN MPRNet All-in-One TransWeather Ground Truth
Figure 4. Sample qualitative results on the Test1 dataset. Red box corresponds to the zoomed-in patch for better comparison.
are embeddings which learn what type of degradation is
present in the image. These queries help in predicting the
corresponding task vector which is helpful to inject the task
information to get a better prediction. To show this, we
visualize the attention maps of eight random queries (out
of 512) for three images corresponding to different weather
degradations in Figure 8. It is interesting to observe that
queries Q1, Q3, and Q6 activate highly for foggy image.
They attend throughout the image to all the places afflicted
by the fog. Queries Q2, Q4 and Q8 are observed to activate
Input All-in-One TransWeather Ground Truth
highly for rainy images and the attention maps are sparse
Figure 5. Sample qualitative results on the Snow100k-L
corresponding to the rain details. Similarly, queries Q5 and
dataset. Red box corresponds to the zoomed-in patches.
Q7 activate to snow images more when compared to im-
sweather removes even the finest rain streaks or drops when ages with rain and fog. This shows that different queries
compared to the previous methods. activate for different weather degredations helping Tran-
sWeather learn the underlying weather condition and give
5. Discussions better predictions. It can also be noted that when an image
Ablation Study: We conduct an ablation study to under- is degraded by multiple weather conditions, multiple task
stand the contributions of individual components proposed type queries activate to encode specific tasks. This can be
in the TransWeather architecture. We start with a base trans- observed from the middle row of Figure 8 where queries
former encoder architecture and a conv tail. We call this that attend to both fog and rain activate as the image is de-
configuration Transformer Base. We then convert the trans- graded by both of these conditions.
former encoder to hierarchical transformer (HE) encoder to Inference Time: In Figure 1 (bottom row), we compare
extract both high-level and low-level features where we per- the inference speed in terms of seconds. The time reported
form patch merging between each stage in the transformer in the table corresponds to the time taken by each model
encoder. We then add the intra patch transformer block feed forward an image of dimensions 256 × 256 during the
(Intra-PT) in the encoder. Then we add learnable weather inference stage. We note that our method is faster (with just
type queries and a transformer decoder block to learn the 0.14 seconds per image) during inference when compared
task embeddings. This configuration corresponds to the to the previous weather removal methods. TransWeather
TransWeather architecture. All of these models are trained has 31 M parameters which are less than that of All-in-One
on All-Weather and tested on the Raindrop test dataset. The Network which has 44 M parameters.
results of ablation study can be found in Table 4. It can
Differences from All-in-One: As the All-in-One network
be observed that each individual contribution of this work
[23] is the first method to look into using a single model
helps in improving the performance.
instance for all weather removal problems, we present clear
Method PSNR (↑) SSIM (↑) differences of our method from All-in-One. First, All-in-
Transformer Base 30.12 0.8512 One is a CNN-based method while TransWeather uses a
+ HE 31.62 0.8671 transformer backbone built specifically for low-level vision
+ HE + Intra-PT 32.37 0.9463 tasks with an extra focus on operating on smaller patches.
+ HE + Intra-PT +Weather Queries 34.55 0.9502 All-in-One uses multiple encoders while TransWeather uti-
Table 4. Ablation Study on the RainDrop test dataset. HE de- lizes a single encoder. All-in-One uses adversarial training
notes converting to hierarchical transformer encoder and Intra-PT and neural architecture search while TransWeather just uses
represents intra-patch transformer blocks.
What do the weather queries learn? The weather queries a combination of L1 and perceptual loss making the train-
2359
Input Attention GAN Quan et al. All-in-One TransWeather Ground Truth
Figure 6. Sample qualitative results on the RainDrop dataset. Red box corresponds to the zoomed-in patches for better comparison.
Rain
Fog+Rain
0
Snow
2360
References tern analysis and machine intelligence, 33(12):2341–2353,
2010. 1
[1] Dana Berman, Shai Avidan, et al. Non-local image dehazing. [14] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
In Proceedings of the IEEE conference on computer vision units (gelus). arXiv preprint arXiv:1606.08415, 2016. 4
and pattern recognition, pages 1674–1682, 2016. 2
[15] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
[2] Bolun Cai, Xiangmin Xu, Kui Jia, Chunmei Qing, and Efros. Image-to-image translation with conditional adver-
Dacheng Tao. Dehazenet: An end-to-end system for single sarial networks. In Proceedings of the IEEE conference on
image haze removal. IEEE Transactions on Image Process- computer vision and pattern recognition, pages 1125–1134,
ing, 25(11):5187–5198, 2016. 2 2017. 6
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [16] Li-Wei Kang, Chia-Wen Lin, and Yu-Hsiang Fu. Auto-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- matic single-image-based rain streaks removal via image
end object detection with transformers. In European Confer- decomposition. IEEE transactions on image processing,
ence on Computer Vision, pages 213–229. Springer, 2020. 1, 21(4):1742–1755, 2011. 2
5 [17] Diederik P Kingma and Jimmy Ba. Adam: A method for
[4] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping stochastic optimization. arXiv preprint arXiv:1412.6980,
Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and 2014. 5
Wen Gao. Pre-trained image processing transformer. In Pro- [18] Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and
ceedings of the IEEE/CVF Conference on Computer Vision Dan Feng. Aod-net: All-in-one dehazing network. In Pro-
and Pattern Recognition, pages 12299–12310, 2021. 3 ceedings of the IEEE international conference on computer
[5] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian vision, pages 4770–4778, 2017. 2
Schroff, and Hartwig Adam. Encoder-decoder with atrous [19] Pengyue Li, Jiandong Tian, Yandong Tang, Guolin Wang,
separable convolution for semantic image segmentation. In and Chengdong Wu. Deep retinex network for single im-
Proceedings of the European conference on computer vision age dehazing. IEEE Transactions on Image Processing,
(ECCV), pages 801–818, 2018. 30:1100–1115, 2020. 1
[6] Wei-Ting Chen, Hao-Yu Fang, Jian-Jiun Ding, Cheng-Che [20] Pengyue Li, Mengshen Yun, Jiandong Tian, Yandong Tang,
Tsai, and Sy-Yen Kuo. Jstasr: Joint size and transparency- Guolin Wang, and Chengdong Wu. Stacked dense networks
aware snow removal algorithm based on modified partial for single-image snow removal. Neurocomputing, 367:152–
convolution and veiling effect removal. In European Con- 163, 2019. 3
ference on Computer Vision, pages 754–770. Springer, 2020. [21] Ruoteng Li, Loong-Fah Cheong, and Robby T Tan. Heavy
3, 6 rain image restoration: Integrating physics model and condi-
[7] François Chollet. Xception: Deep learning with depthwise tional adversarial learning. In Proceedings of the IEEE/CVF
separable convolutions. In Proceedings of the IEEE con- Conference on Computer Vision and Pattern Recognition,
ference on computer vision and pattern recognition, pages pages 1633–1642, 2019. 2, 3, 5, 6
1251–1258, 2017. 4 [22] Runde Li, Jinshan Pan, Zechao Li, and Jinhui Tang. Single
[8] Hang Dong, Jinshan Pan, Lei Xiang, Zhe Hu, Xinyi Zhang, image dehazing via conditional generative adversarial net-
Fei Wang, and Ming-Hsuan Yang. Multi-scale boosted de- work. In Proceedings of the IEEE Conference on Computer
hazing network with dense feature fusion. In Proceedings of Vision and Pattern Recognition, pages 8202–8211, 2018. 2
the IEEE/CVF Conference on Computer Vision and Pattern [23] Ruoteng Li, Robby T Tan, and Loong-Fah Cheong. All in
Recognition, pages 2157–2167, 2020. 1, 6 one bad weather removal using architectural search. In Pro-
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, ceedings of the IEEE/CVF Conference on Computer Vision
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, and Pattern Recognition, pages 3175–3185, 2020. 1, 2, 3, 5,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 6, 7
vain Gelly, et al. An image is worth 16x16 words: Trans- [24] Wu Li, Valentin Piëch, and Charles D Gilbert. Perceptual
formers for image recognition at scale. arXiv preprint learning and top-down influences in primary visual cortex.
arXiv:2010.11929, 2020. 2, 3 Nature neuroscience, 7(6):651–657, 2004. 3
[10] Raanan Fattal. Dehazing using color-lines. ACM transac- [25] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin
tions on graphics (TOG), 34(1):1–14, 2014. 2 Zha. Recurrent squeeze-and-excitation context aggregation
[11] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao net for single image deraining. In Proceedings of the Euro-
Ding, and John Paisley. Removing rain from single images pean Conference on Computer Vision (ECCV), pages 254–
via a deep detail network. In Proceedings of the IEEE Con- 269, 2018. 2, 6
ference on Computer Vision and Pattern Recognition, pages [26] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc
3855–3863, 2017. 1, 6 Van Gool. Localvit: Bringing locality to vision transformers.
[12] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang arXiv preprint arXiv:2104.05707, 2021. 4
Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud [27] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
transformer. Computational Visual Media, 7(2):187–199, Van Gool, and Radu Timofte. Swinir: Image restoration us-
2021. 3 ing swin transformer. In Proceedings of the IEEE/CVF Inter-
[13] Kaiming He, Jian Sun, and Xiaoou Tang. Single image haze national Conference on Computer Vision, pages 1833–1844,
removal using dark channel prior. IEEE transactions on pat- 2021. 2, 3, 6
2361
[28] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Proceedings of the IEEE/CVF International Conference on
Deep continuous fusion for multi-sensor 3d object detection. Computer Vision, pages 2463–2471, 2019. 1, 2, 3, 6
In Proceedings of the European Conference on Computer Vi- [41] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
sion (ECCV), pages 641–656, 2018. 1 Faster r-cnn: Towards real-time object detection with region
[29] Yun-Fu Liu, Da-Wei Jaw, Shih-Chia Huang, and Jenq- proposal networks. Advances in neural information process-
Neng Hwang. Desnownet: Context-aware deep network for ing systems, 28:91–99, 2015. 1
snow removal. IEEE Transactions on Image Processing, [42] Wenqi Ren, Si Liu, Hua Zhang, Jinshan Pan, Xiaochun Cao,
27(6):3064–3073, 2018. 1, 2, 3, 5, 6 and Ming-Hsuan Yang. Single image dehazing via multi-
[30] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, scale convolutional neural networks. In European conference
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- on computer vision, pages 154–169. Springer, 2016. 2
former: Hierarchical vision transformer using shifted win- [43] Wenqi Ren, Lin Ma, Jiawei Zhang, Jinshan Pan, Xiaochun
dows. arXiv preprint arXiv:2103.14030, 2021. 3 Cao, Wei Liu, and Ming-Hsuan Yang. Gated fusion network
[31] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, for single image dehazing. In Proceedings of the IEEE Con-
Stephen Lin, and Han Hu. Video swin transformer. arXiv ference on Computer Vision and Pattern Recognition, pages
preprint arXiv:2106.13230, 2021. 3 3253–3261, 2018. 3
[32] Justin NJ McManus, Wu Li, and Charles D Gilbert. Adap- [44] Weihong Ren, Jiandong Tian, Zhi Han, Antoni Chan, and
tive shape processing in primary visual cortex. Proceedings Yandong Tang. Video desnowing and deraining based on ma-
of the National Academy of Sciences, 108(24):9739–9746, trix decomposition. In Proceedings of the IEEE Conference
2011. 3 on Computer Vision and Pattern Recognition, pages 4210–
[33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, 4219, 2017. 1, 2
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming [45] Stefan Roth and Michael J Black. Fields of experts: A frame-
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, work for learning image priors. In 2005 IEEE Computer
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- Society Conference on Computer Vision and Pattern Recog-
son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, nition (CVPR’05), volume 2, pages 860–867. IEEE, 2005.
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im- 1
perative style, high-performance deep learning library. In H. [46] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear
Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. total variation based noise removal algorithms. Physica D:
Fox, and R. Garnett, editors, Advances in Neural Informa- nonlinear phenomena, 60(1-4):259–268, 1992. 1
tion Processing Systems 32, pages 8024–8035. Curran Asso- [47] Karen Simonyan and Andrew Zisserman. Very deep convo-
ciates, Inc., 2019. 5 lutional networks for large-scale image recognition. arXiv
[34] Asanka G Perera, Yee Wei Law, and Javaan Chahl. Uav- preprint arXiv:1409.1556, 2014. 5
gesture: A dataset for uav control and gesture recognition. [48] Fuxiang Tan, YuTing Kong, Yingying Fan, Feng Liu, Daxin
In Proceedings of the European Conference on Computer Vi- Zhou, Long Chen, Liang Gao, Yurong Qian, et al. Sdnet:
sion (ECCV) Workshops, pages 0–0, 2018. 1 mutil-branch for single image deraining using swin. arXiv
[35] Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multi- preprint arXiv:2105.15077, 2021. 2, 3
modal fusion transformer for end-to-end autonomous driv- [49] Jeya Maria Jose Valanarasu, Poojan Oza, Ilker Hacihaliloglu,
ing. In Proceedings of the IEEE/CVF Conference on Com- and Vishal M. Patel. Medical transformer: Gated axial-
puter Vision and Pattern Recognition, pages 7077–7087, attention for medical image segmentation. In Medical Im-
2021. 1 age Computing and Computer Assisted Intervention – MIC-
[36] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J CAI 2021, pages 36–46, Cham, 2021. Springer International
Guibas. Frustum pointnets for 3d object detection from rgb- Publishing. 3
d data. In Proceedings of the IEEE conference on computer [50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
vision and pattern recognition, pages 918–927, 2018. 1 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[37] Rui Qian, Robby T Tan, Wenhan Yang, Jiajun Su, and Jiay- Polosukhin. Attention is all you need. In Advances in neural
ing Liu. Attentive generative adversarial network for rain- information processing systems, pages 5998–6008, 2017. 5
drop removal from a single image. In Proceedings of the [51] Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng. A model-
IEEE conference on computer vision and pattern recogni- driven deep neural network for single image rain removal.
tion, pages 2482–2491, 2018. 1, 2, 3, 5, 6 In Proceedings of the IEEE/CVF Conference on Computer
[38] Qin Qin, Jingke Yan, Qin Wang, Xin Wang, Minyao Li, and Vision and Pattern Recognition, pages 3103–3112, 2020. 2
Yuqing Wang. Etdnet: An efficient transformer deraining [52] Lijun Wang, Jianming Zhang, Oliver Wang, Zhe Lin, and
model. IEEE Access, 9:119881–119893, 2021. 2, 3 Huchuan Lu. Sdc-depth: Semantic divide-and-conquer net-
[39] Ruijie Quan, Xin Yu, Yuanzhi Liang, and Yi Yang. Remov- work for monocular depth estimation. In Proceedings of
ing raindrops and rain streaks in one go. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 541–550, 2020. 1
Recognition, pages 9147–9156, 2021. 2, 6 [53] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang
[40] Yuhui Quan, Shijie Deng, Yixin Chen, and Hui Ji. Deep Zhang, and Rynson WH Lau. Spatial attentive single-image
learning for seeing through window with raindrops. In deraining with a high quality real rain dataset. In Proceed-
2362
ings of the IEEE/CVF Conference on Computer Vision and pattern analysis and machine intelligence, 38(9):1721–1733,
Pattern Recognition, pages 12270–12279, 2019. 1 2015. 1, 2, 3
[54] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao [67] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling
Pyramid vision transformer: A versatile backbone for Shao. Multi-stage progressive image restoration. In Pro-
dense prediction without convolutions. arXiv preprint ceedings of the IEEE/CVF Conference on Computer Vision
arXiv:2102.12122, 2021. 4 and Pattern Recognition, pages 14821–14831, 2021. 2, 6
[55] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and [68] He Zhang and Vishal M Patel. Densely connected pyramid
Jianzhuang Liu. Uformer: A general u-shaped transformer dehazing network. In Proceedings of the IEEE conference on
for image restoration. arXiv preprint arXiv:2106.03106, computer vision and pattern recognition, pages 3194–3203,
2021. 2, 3 2018. 3
[56] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying [69] He Zhang and Vishal M Patel. Density-aware single image
Wu. Semi-supervised transfer learning for image rain re- de-raining using a multi-stream dense network. In Proceed-
moval. In Proceedings of the IEEE/CVF Conference on ings of the IEEE conference on computer vision and pattern
Computer Vision and Pattern Recognition, pages 3877– recognition, pages 695–704, 2018. 1, 2
3886, 2019. 1 [70] He Zhang, Vishwanath Sindagi, and Vishal M Patel. Im-
[57] Haiyan Wu, Yanyun Qu, Shaohui Lin, Jian Zhou, Ruizhi age de-raining using a conditional generative adversarial net-
Qiao, Zhizhong Zhang, Yuan Xie, and Lizhuang Ma. Con- work. IEEE transactions on circuits and systems for video
trastive learning for compact single image dehazing. In Pro- technology, 30(11):3943–3956, 2019. 1, 2
ceedings of the IEEE/CVF Conference on Computer Vision [71] He Zhang, Vishwanath Sindagi, and Vishal M Patel. Joint
and Pattern Recognition, pages 10551–10560, 2021. 1 transmission map estimation and dehazing using deep net-
[58] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, works. IEEE Transactions on Circuits and Systems for Video
Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introduc- Technology, 30(7):1975–1986, 2019. 1
ing convolutions to vision transformers. arXiv preprint [72] Jingang Zhang, Wenqi Ren, Shengdong Zhang, He Zhang,
arXiv:2103.15808, 2021. 4 Yunfeng Nie, Zhe Xue, and Xiaochun Cao. Hierarchical
[59] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, density-aware dehazing network. IEEE Transactions on Cy-
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef- bernetics, 2021. 1, 3
ficient design for semantic segmentation with transformers.
[73] Kaihao Zhang, Rongqing Li, Yanjiang Yu, Wenhan Luo, and
arXiv preprint arXiv:2105.15203, 2021. 1, 4
Changsheng Li. Deep dense multi-scale network for snow
[60] Sen Yang, Zhibin Quan, Mu Nie, and Wankou Yang. Trans-
removal using semantic and depth priors. IEEE Transactions
pose: Keypoint localization via transformer. In Proceedings
on Image Processing, 30:7419–7431, 2021. 1, 2, 6
of the IEEE/CVF International Conference on Computer Vi-
[74] Dong Zhao, Jia Li, Hongyu Li, and Long Xu. Hybrid
sion, pages 11802–11812, 2021. 3
local-global transformer for image dehazing. arXiv preprint
[61] Wenhan Yang, Robby T Tan, Jiashi Feng, Zongming Guo,
arXiv:2109.07100, 2021. 2, 3
Shuicheng Yan, and Jiaying Liu. Joint rain detection and
[75] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
removal from a single image with contextualized deep net-
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
works. IEEE transactions on pattern analysis and machine
Xiang, Philip HS Torr, et al. Rethinking semantic seg-
intelligence, 42(6):1377–1393, 2019. 1, 2
mentation from a sequence-to-sequence perspective with
[62] Wenhan Yang, Robby T. Tan, Shiqi Wang, Yuming Fang, and
transformers. In Proceedings of the IEEE/CVF Conference
Jiaying Liu. Single image deraining: From model-based to
on Computer Vision and Pattern Recognition, pages 6881–
data-driven and beyond, 2019. 2
6890, 2021. 3
[63] Rajeev Yasarla and Vishal M Patel. Uncertainty guided
multi-scale residual learning-using a cycle spinning cnn for [76] Lei Zhu, Chi-Wing Fu, Dani Lischinski, and Pheng-Ann
single image de-raining. In Proceedings of the IEEE/CVF Heng. Joint bi-layer optimization for single-image rain
Conference on Computer Vision and Pattern Recognition, streak removal. In Proceedings of the IEEE international
pages 8405–8414, 2019. 1 conference on computer vision, pages 2526–2534, 2017. 1,
2
[64] Rajeev Yasarla, Vishwanath A Sindagi, and Vishal M Pa-
tel. Syn2real transfer learning for image deraining using
gaussian processes. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition, pages
2726–2736, 2020. 2
[65] Rajeev Yasarla, Jeya Maria Jose Valanarasu, and Vishal M
Patel. Exploring overcomplete representations for single im-
age deraining using cnns. IEEE Journal of Selected Topics
in Signal Processing, 15(2):229–239, 2020. 2
[66] Shaodi You, Robby T Tan, Rei Kawakami, Yasuhiro
Mukaigawa, and Katsushi Ikeuchi. Adherent raindrop mod-
eling, detectionand removal in video. IEEE transactions on
2363