0% found this document useful (0 votes)
33 views

RR 2

This document summarizes a research paper that proposes a method for single image reflection removal using deep learning. It addresses two key challenges: 1) the intrinsic ill-posed nature of the problem and 2) lack of labeled training data. To address the first, it augments a CNN with context encoding modules to leverage high-level context. To address the second, it introduces an alignment-invariant loss to enable training on easily collected misaligned real-world data. Experimental results show its method outperforms state-of-the-art when using aligned data, and improves significantly by also using additional misaligned data.

Uploaded by

Gayathri Girish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

RR 2

This document summarizes a research paper that proposes a method for single image reflection removal using deep learning. It addresses two key challenges: 1) the intrinsic ill-posed nature of the problem and 2) lack of labeled training data. To address the first, it augments a CNN with context encoding modules to leverage high-level context. To address the second, it introduces an alignment-invariant loss to enable training on easily collected misaligned real-world data. Experimental results show its method outperforms state-of-the-art when using aligned data, and improves significantly by also using additional misaligned data.

Uploaded by

Gayathri Girish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Single Image Reflection Removal Exploiting Misaligned Training Data and

Network Enhancements

Kaixuan Wei1 Jiaolong Yang2 Ying Fu1∗ David Wipf2 Hua Huang1
1
Beijing Institute of Technology 2 Microsoft Research

Abstract standing, SIRR remains a largely unsolved problem across


disparate imaging conditions and varying scene content.
Removing undesirable reflections from a single image For CNN-based reflection removal, our focus herein, the
captured through a glass window is of practical impor- challenge originates from at least two sources: (i) The ex-
tance to visual computing systems. Although state-of-the- traction of a background image layer devoid of reflection
art methods can obtain decent results in certain situations, artifacts is fundamentally ill-posed, and (ii) Training data
performance declines significantly when tackling more gen- from real-world scenes, are exceedingly scarce because of
eral real-world cases. These failures stem from the intrin- the difficulty in obtaining ground-truth labels.
sic difficulty of single image reflection removal – the funda- Mathematically speaking, it is typically assumed that a
mental ill-posedness of the problem, and the insufficiency of captured image I is formed as a linear combination of a
densely-labeled training data needed for resolving this am- background or transmitted layer T and a reflection layer R,
biguity within learning-based neural network pipelines. In i.e., I = T + R. Obviously, when given access only to I,
this paper, we address these issues by exploiting targeted there exists an infinite number of feasible decompositions.
network enhancements and the novel use of misaligned Further compounding the problem is the fact that both T
data. For the former, we augment a baseline network archi- and R involve content from real scenes that may have over-
tecture by embedding context encoding modules that are ca- lapping appearance distributions. This can make them diffi-
pable of leveraging high-level contextual clues to reduce in- cult to distinguish even for human observers in some cases,
determinacy within areas containing strong reflections. For and simple priors that might mitigate this ambiguity are not
the latter, we introduce an alignment-invariant loss func- available except under specialized conditions.
tion that facilitates exploiting misaligned real-world train- On the other hand, although CNNs can perform a wide
ing data that is much easier to collect. Experimental results variety visual tasks, at times exceeding human capabilities,
collectively show that our method outperforms the state-of- they generally require a large volume of labeled training
the-art with aligned data, and that significant improvements data. Unfortunately, real reflection images accompanied
are possible when using additional misaligned data. with densely-labeled, ground-truth transmitted layer inten-
sities are scarce. Consequently, previous learning-based ap-
proaches have resorted to training with synthesized images
1. Introduction [5, 38, 47] and/or small real-world data captured from spe-
cialized devices [47]. However, existing image synthesis
Reflection is a frequently-encountered source of image
procedures are heuristic and the domain gap may jeopardize
corruption that can arise when shooting through a glass sur-
accuracy on real images. On the other hand, collecting suf-
face. Such corruptions can be addressed via the process of
ficient additional real data with precise ground-truth labels
single image reflection removal (SIRR), a challenging prob-
is tremendously labor-intensive.
lem that has attracted considerable attention from the com-
puter vision community [22, 25, 39, 2, 5, 47, 44, 38]. Tra- This paper is devoted to addressing both of the afore-
ditional optimization-based methods often leverage manual mentioned challenges. First, to better tackle the intrinsic
intervention or strong prior assumptions to render the prob- ill-posedness and diminish ambiguity, we propose to lever-
lem more tractable [22, 25]. Recently, alternative learning- age a network architecture that is sensitive to contextual in-
based approaches rely on deep Convectional Neural Net- formation, which has proven useful for other vision tasks
works (CNNs) in lieu of the costly optimization and hand- such as semantic segmentation [11, 48, 46, 13]. Note that
crafted priors [5, 47, 44, 38]. But promising results notwith- at a high level, our objective is to efficiently convert prior
information mined from labeled training data into network
∗ Corresponding author: [email protected] structures capable of resolving this ambiguity. Within a tra-

8178
ditional CNN model, especially in the early layers where
the effective receptive field is small, the extracted features
across all channels are inherently local. However, broader
non-local context is necessary to differentiate those features
that are descriptive of the desired transmitted image, and
those that can be discarded as reflection-based. For ex-
ample, in image neighborhoods containing a particularly
strong reflection component, accurate separation by any
possible method (even one trained with arbitrarily rich train-
ing data) will likely require contextual information from re- [46] Ours
gions without reflection. To address this issue, we utilize Figure 1: Comparison of the reflection image data collection meth-
two complementary forms of context, namely, channel-wise ods in [46] and this paper.
context and multi-scale spatial context. Regarding the for-
mer, we apply a channel attention mechanism to the fea- ious transformations. Our study shows that the using only
ture maps from convolutional layers such that different fea- the highest-level feature from a deep network (VGG-19 in
tures are weighed differently according to global statistics our case) leads to satisfactory results for our reflection re-
of the activations. For the latter, we aggregate information moval task. In both simulation tests and experiments us-
across a pyramid of feature map scales within each chan- ing a newly collected dataset, we demonstrate for the first
nel to reach a global contextual consistency in the spatial time that training/fine-tuning a CNN with unaligned data
domain. Our experiments demonstrate that significant im- improves the reflection removal results by a large margin.
provement can be obtained by these enhancements, leading
to state-of-the-art performance on two real-image datasets. 2. Related Work
Secondly, orthogonal to architectural considerations, we
This paper is concerned with reflection removal from
seek to expand the sources of viable training data by facil-
a single image. Previous methods utilizing multiple input
itating the use of misaligned training pairs, which are con-
images of, e.g., flash/non-flash pairs [1], different polariza-
siderably easier to collect. Misalignment between an input
tion [20], multi-view or video sequences [6, 35, 30, 7, 24,
image I and a ground-truth reflection-free version T can be
34, 9, 43, 45] will not be considered here.
caused by camera and/or object movements during the ac-
quisition process. In the previous works [37, 46], data pairs Traditional methods. Reflection removal from a single im-
(I, T ) were obtained by taking an initial photo through a age is a massively ill-posed problem. Additional priors are
glass plane, followed by capturing a second one after the needed to solve the otherwise prohibitively-difficult prob-
glass has been removed. This process requires that the lem in traditional optimization-based method [22, 25, 39,
camera, scene, and even lighting conditions remain static. 2, 36]. In [22], user annotations are used to guide layer
Adhering to these requirements across a broad acquisition separation jointly with a gradient sparsity prior [23]. [25]
campaign can significantly reduce both the quantity and di- introduces a relative smoothness prior where the reflections
versity of the collected data. Additionally, post-processing are assumed to be blurry thus their large gradients are penal-
may also be necessary to accurately align I and T to com- ized. [39] explores a variant of the smoothness prior where
pensate for spatial shifts caused by the refractive effect [37]. a multi-scale Depth-of-Field (DoF) confidence map is uti-
In contrast, capturing unaligned data is considerably less lized to perform edge classification. [31] exploits the ghost
burdensome, as shown in Fig. 1. For example, there is no cues for layer separation. [2] proposes a simple optimiza-
need for a tripod, table, or other special hardware; the cam- tion formulation with an l0 gradient penalty on the transmit-
era can be hand-held and the pose can be freely adjusted; ted layer inspired by image smoothing algorithms [42]. De-
dynamic scenes in the presence of vehicles, humans, etc. spite decent results can be obtained by these methods where
can be incorporated; and finally no post-processing of any their assumptions hold, the vastly-different imaging condi-
type is needed. tions and complex scene content in the real world render
their generalization problematic.
To handle such misaligned training data, we require a
loss function that is, to the extent possible, invariant to the Deep learning based methods. Recently, there is an
alignment, i.e., the measured image content discrepancy be- emerging interest in applying deep convolutional neural net-
tween the network prediction and its unaligned reference works for single image reflection removal such that the
should be similar to what would have been observed if handcrafted priors can be replaced by data-driven learn-
the reference was actually aligned. In the context of im- ing [5, 38, 47, 44]. The first CNN-based method is due
age style transfer [17] and others, certain perceptual loss to [5], where a network structure is proposed to first pre-
functions have been shown to be relatively invariant to var- dict the background layer in the edge domain followed by

8179
reconstructing it the color domain. Later, [38] proposes to At this point, we have constructed a useful base archi-
predict the edge and image intensity concurrently by two tecture upon which other more targeted alterations will be
cooperative sub-networks. The recent work of [44] presents applied shortly. This baseline, which we will henceforth
a cascade network structure which predicts the background refer to as BaseNet, performs quite well when trained and
layer and reflection layer in an interleaved fashion. The ear- tested on synthetic data. However, when deployed on real-
lier CNN-based methods typical use the raw image intensity world reflection images we found that its performance de-
discrepancy such as mean squared error (MSE) to train the graded by an appreciable amount, especially on the 20 real
networks. Several recent works [47, 16, 3] adopt the per- images from [47]. Therefore, to better mitigate the tran-
ceptual loss [17] which uses the multi-stage features of a sition from the make-believe world of synthetic images to
deep network pre-trained on ImageNet [29]. [47]. Adver- real-life photographs, we describe two modifications for in-
sarial loss is investigated in [47, 21] to improve the realism troducing broader contextual information into otherwise lo-
of the predicted background layers. cal convolutional filters.

3. Approach 3.2. Context Encoding Modules


As mentioned previously, we consider both context be-
Given an input image I contaminated with reflections,
tween channels and multi-scale context within channels.
our goal is to estimate a reflection-free trasmitted image T̂ .
To achieve this, we train a feed-forward CNN GθG parame- Channel-wise context. The underlying design princi-
terized by θG to minimize a reflection removal loss function ple here is to introduce global contextual information
l. Given training image pairs {(In , Tn )}, n = 1, · · · , N , across channels, and a richer overall structure within resid-
this involves solving: ual blocks, without dramatically increasing the parameter
PN count. One way to accomplish this is by incorporating a
θ̂G = arg minθG N1 n=1 l(GθG (In ), Tn ). (1) channel attention module originally developed in [13] to re-
calibrate feature maps using global summary statistics.
We will first introduce the details of network architecture Let U = [u1 , . . . , uc , . . . , uC ] denote original, uncali-
GθG followed by the loss function l applied to both aligned brated activations produced by a network block, with C
data (the common case) and newly proposed unaligned data feature maps of size of H × W . These activations gener-
extensions. The overall system is illustrated in Fig. 2. ally only reflect local information residing within the corre-
sponding receptive fields of each filter. We then form scalar,
3.1. Basic Image Reconstruction Network
channel-specific descriptors zc = fgp (uc ) by applying a
Our starting point can be viewed as the basic image re- global average pooling operator fgp to each feature map
construction neural network component from [5] but mod- uc ∈ RH×W . The vector z = [z1 , . . . , zC ] ∈ RC represents
ified in three aspects: (1) We simplify the basic residual a simple statistical summary of global, per-channel activa-
block [12] by removing the batch normalization (BN) layer tions and, when passed through a small network structure,
[14]; (2) we increase the capacity by widening the network can be used to adaptively predict the relative importance of
from 64 to 256 feature maps; and (3) for each input image each channel [13].
I, we extract hypercolumn features [10] from a pretrained More specifically, the channel attention module first
VGG-19 network [32], and concatenate these features with computes s = σ(WU δ(WD z)) where WD is a trainable
I as an augmented network input. As explained in [47], weight matrix that downsamples z to dimension R < C,
such an augmentation strategy can help enable the network δ is a ReLU non-linearity, WU represents a trainable up-
to learn semantic clues from the input image. sampling weight matrix, and σ is a sigmoidal activation.
Note that removing the BN layer from our network turns Elements of the resulting output vector s ∈ RC serve
out to be critical for optimizing performance in the present as channel-specific gates for calibrating feature maps via
context. As shown in [41], if batch sizes become too small, ûc = sc · uc .
prediction errors can increase precipitously and stability is- Consequently, although each individual convolutional
sues can arise. Moreover, for a dense prediction task such as filter has a local receptive field, the determination of which
SIRR, large batch sizes can become prohibitively expensive channels are actually important in predicting the transmis-
in terms of memory requirements. In our case, we found sion layer and suppressing reflections is based on the pro-
that within the tenable batch sizes available for reflection re- cessing of a global statistic (meaning the channel descrip-
moval, BN led to considerably worse performance, includ- tors computed as activations pass through the network dur-
ing color attenuation/shifting issues as sometimes observed ing inference). Additionally, the parameter overhead intro-
in image-to-image translation tasks [5, 15, 49]. BN layers duced by this process is exceedingly modest given that WD
have similarly been removed from other dense prediction and WU are just small additional weight matrices associated
tasks such as image super-resolution [26] or deblurring [28]. with each block.

8180
𝐺
𝑃𝑖𝑥𝑒𝑙 𝐿𝑜𝑠𝑠 𝑙

… Loss for
aligned Data
… 𝑇 F𝑒𝑎𝑡𝑢𝑟𝑒 𝐿𝑜𝑠𝑠 𝑙
VGG19-features
13 blocks
Align. 𝐼𝑛𝑣𝑎𝑟𝑖𝑎𝑛𝑡 𝐿𝑜𝑠𝑠 𝑙 Loss for
Unaligned Data
𝐷
1/32
𝑇
𝐴𝑑𝑣𝑒𝑟𝑠𝑎𝑟𝑖𝑎𝑙 𝐿𝑜𝑠𝑠 𝑙
1/16

1/8

Upsample Convolution ReLU Residual Sigmoid Pyramid Global


Channel Attention 1/4 Block Pooling Pooling

Figure 2: Overview of our approach for single image reflection removal.

Multi-scale spatial context. Although we have found that improve the realism of the produced background images.
encoding the contextual information across channels al- We define an opponent discriminator network DθD and
ready leads to significant empirical gains on real-world im- minimize the relativistic adversarial loss [18] defined as
G
ages, utilizing complementary multi-scale spatial informa- ladv = ladv = − log(DθD (T, T̂ ))−log(1−DθD (T̂ , T )) for
tion within each channel provides further benefit. To ac- D
GθG and ladv = − log(1 − DθD (T, T̂ )) − log(DθD (T̂ , T ))
complish this, we apply a pyramid pooling module [11], for DθD where DθD (T, T̂ ) = σ(C(T ) − C(T̂ )) with σ(·)
which has proven to be an effective global-scene-level rep- being the sigmoid function and C(·) the non-transformed
resentation in semantic segmentation [48]. As shown in discriminator function (refer to [18] for details).
Fig. 2, we construct such a module using pooling opera- To summarize, our loss for aligned data is defined as:
tions at sizes 4, 8, 16, and 32 situated in the tail of our net-
work before the final construction of T̂ . Pooling in this way laligned = ω1 lpixel + ω2 lf eat + ω3 ladv (2)
fuses features under four different pyramid scales. After
harvesting the resulting sub-region representations, we per- where we empirically set the weights as ω1 = 1, ω2 = 0.1,
form a non-linear transformation (i.e. a Conv-ReLU pair) to and ω3 = 0.01 respectively throughout our experiments.
reduce the channel dimension. The refined features are then
3.4. Training Loss for Unaligned Data
upsampled via bilinear interpolation. Finally, the different
levels of features are concatenated together as a final repre- To use misaligned data pairs (I, T ) for training, we need
sentation reflecting multi-scale spatial context within each a loss function that is invariant to the alignment, such that
channel; the increased parameter overhead is negligible. the true similarity between T and the prediction T̂ can be
reasonably measured. In this regard, we note that human
3.3. Training Loss for Aligned Data observers can easily assess the similarity of two images
In this section, we present our loss function for aligned even if they are not aligned. Consequently, designing a
training pairs (I, T ), which consists of three terms similar loss measuring image similarity on the perceptual-level may
to previous methods [47, 44]. serve our goal. This motivates us to directly use a deep fea-
ture loss for unaligned data.
Pixel loss. Following [5], we penalize the pixel-wise in- Intuitively, the deeper the feature, the more likely it is
tensity difference of T and T̂ via lpixel = αkT̂ − T k22 + to be insensitive to misalignment. To experimentally ver-
β(k∇x T̂ − ∇x T k1 + k∇y T̂ − ∇y T k1 ) where ∇x and ∇y ify this and find a suitable feature layer for our purposes,
are the gradient operator along x- and y-direction, respec- we conducted tests using a pre-trained VGG-19 network as
tively. We set α = 0.2 and β = 0.4 in all our experiments. follows. Given an unaligned image pair (I, T ), we use gra-
Feature loss. We define the feature loss based on the dient descent to finetune the weights of our network GθG
activations of the 19-layer VGG network [33] pretrained to minimize the feature difference of T and T̂ , with features
on ImageNet [29]. Let φl be the feature from the l-th extracted at different layers of VGG-19. Figure 3 shows that
layer of VGG-19, we define the feature loss as lf eat = using low-level or middle-level features from ‘conv2 2’ to
P ‘conv4 2’ leads to blurry results (similar to directly using a
l λl kφl (T ) − φl (T̂ )k1 where {λl } are the balancing
pixel-wise loss), although the reflection is more thoroughly
weights. Similar to [47], we use the layers ‘conv2 2’,
removed. In contrast, using the highest-level feature from
‘conv3 2’, ‘conv4 2’, and ‘conv5 2’ of VGG-19 net.
‘conv5 2’ gives rise to a striking result: the predicted back-
Adversarial loss. We further add an adversarial loss to ground image is sharp and almost reflection-free.

8181
Table 1: Comparison of different settings. Our full model (i.e.
ERRNet) leads to best performance among all comparisons.

Synthetic Real20
Model PSNR SSIM PSNR SSIM
(a) Input (b) Unaligned Ref. (c) Pretrained
CEILNet-F [5] 24.70 0.884 20.32 0.739
BaseNet only 25.71 0.926 21.51 0.780
BaseNet + CSC 27.64 0.940 22.61 0.796
BaseNet + MSC 26.03 0.928 21.75 0.783
ERRNet 27.88 0.941 22.89 0.803
(d) lpixel (e) conv2 2 (f) conv3 2

data, i.e. 7,643 cropped images with size 224 × 224 from
PASCAL VOC dataset [4]. 90 real-world training images
from [47] are adopted as real data. For image synthesis,
(g) conv4 2 (h) conv5 2 (i) Loss of [27] we use the same data generation model as [5] to create our
synthetic data. In the following, we always use the same
Figure 3: The effect of using different loss to handle misaligned
dataset for training, unless specifically stated.
real data. (a) and (b) are the unaligned image pair (I, T ). (c)
shows the reflection removal result of our network trained on syn- Training details. Our implementation1 is based on Py-
thetic data and a small number of aligned real data (see Section 4 Torch. We train the model with 60 epoch using the Adam
for details). Reflection can still be observed in the predicted back- optimizer [19]. The base learning rate is set to 10−4 and
ground image. (d) is the result finetuned on (I, T ) with pixel- halved at epoch 30, then reduced to 10−5 at epoch 50. The
wise intensity loss. (e)-(h) are the results finetuned with features
weights are initialized as in [26].
at different layers of VGG-19. Only the highest-level feature from
‘conv5 2’ yields satisfactory result. (i) shows the results finetuned
4.2. Ablation Study
with the loss of [27]. (Best viewed on screen with zoom)
In this section, we conduct an ablation study for our
method on 100 synthetic testing images from [5] and 20
Recently, [27] introduced a “contextual loss” which is real testing images from [47] (denoted by ‘Real20’).
also designed for training deep networks with unaligned
data for image-to-image translation tasks like image style Component analysis. To verify the importance of our
transfer. In Fig 3, we also present the finetuned result us- network design, we compare four model architectures as
ing this loss for our reflection removal task. Upon visual described in Section 3, including (1) Our basic image re-
inspection, the results are similar to our highest-level VGG construction network BaseNet; (2) BaseNet with channel-
feature loss (quantitative comparison can be found in the wise context module (BaseNet + CWC); (3) BaseNet with
experiment section). However, our adopted loss (formally multi-scale spatial context module (BaseNet + MSC); and
defined below) is much simpler and more computationally (4) Our enhanced reflection removal network, denoted ER-
efficient than the loss from [27]. RNet, i.e., BaseNet + CWC + MSC. The result from the
CEILNet [5] fine-tuned on our training data (denoted by
Alignment-invariant loss. Based on the above study, we CEILNet-F) is also provided as an additional reference.
now formally define our invariant loss component designed As shown in Table 1, our BaseNet has already achieved
for unaligned data as linv = kφh (T ) − φh (T̂ )k1 , where a much better result than CEILNet-F. The performance of
φh denotes the ‘conv5 2’ feature of the pretrained VGG-19 our BaseNet could be obviously boosted by using channel-
network. For unaligned data, we also apply an adversarial wise context and multi-scale spatial context modules, espe-
loss which is not affected by misalignment. Therefore, our cially by using them together, i.e. ERRNet. Figure 4 visu-
overall loss for unaligned data can be written as ally shows the results from BaseNet and our ERRNet. It can
lunaligned = ω4 linv + ω5 ladv (3) be observed that BaseNet struggles to discriminate the re-
flection region and yields some obvious residuals, while the
where we set the weights as ω4 = 0.1 and ω5 = 0.01. ERRNet removes the reflection and produces much cleaner
transmitted images. These results suggest the effectiveness
4. Experiments of our network design, especially the components tailored
to encode the contextual clues.
4.1. Implementation Details
Efficacy of the training loss for unaligned data. In this
Training data. We adopt a fusion of synthetic and real data
as our train dataset. The images from [5] are used as sythetic 1 Code is released at https://github.com/Vandermode/ERRNet

8182
Input BaseNet ERRNet 4.3. Method Comparison on Benchmarks
In this section, we compare our ERRNet against state-of-
the-art methods including the optimization-based method of
[25] (LB14) and the learning-based approaches (CEILNet
[5], Zhang et al. [47], and BDN [44]). For fair comparison,
we finetune these models on our training dataset and report
results of both the original pretrained model and finetuned
version (denoted with a suffix ’-F’).
The comparison is conducted on four real-world
datasets, i.e. 20 testing images in [47] and three sub-datasets
Figure 4: Comparison of the results with (ERRNet) and without from SIR2 [37]. These three sub-datasets are captured under
(BaseNet) the context encoding modules. different conditions: (1) 20 controlled indoor scenes com-
posed by solid objects; (2) 20 different controlled scenes
Table 2: Simulation experiment to verify the efficacy our on postcards; and (3) 55 wild scenes3 with ground truth
alignment-invariant loss provided. In the following, we denote these datasets by
‘Real20’, ‘Objects’, ‘Postcard’, and ‘Wild’, respectively.
Training Scheme PSNR SSIM Table 3 summarizes the results of all competing meth-
Synthetic only 19.79 0.741 ods on four real-world datasets. The quality metrics include
+ 50 aligned 22.00 0.785 PSNR, SSIM [40], NCC [43, 37] and LMSE [8]. Larger
+ 90 aligned 22.89 0.803 values of PSNR, SSIM, and NCC indicate better perfor-
+ 50 aligned, + 40 unaligned trained with: mance, while a smaller value of LMSE implies a better re-
lpixel 21.85 0.766 sult. Our ERRNet achieves the state-of-the-art performance
linv 22.38 0.797 in ‘Real20’ and ‘Objects’ datasets. Meanwhile, our result
lcx 22.47 0.796 is comparable to the best-performing BDN-F on ‘Postcard’
linv + lcx 22.43 0.796
data. The quantitative results on ‘Wild’ dataset reveal a
frustrating fact, namely, that no method could outperform
the naive baseline ’Input’, suggesting that there is still large
experiment, we first train our ERRNet with only ‘synthetic room for improvement.
data’, ‘synthetic + 50 aligned real data’, and ‘synthetic + Figure 5 displays visual results on real-world images. It
90 aligned real data’. The loss function in Eq. (2) is used can be seen that all compared methods fail to handle some
for aligned data. We can see that the testing results become strong reflections, but our network more accurately removes
better with the increasing real data in Table 2. many undesirable artifacts, e.g. removal of tree branches re-
Then, we synthesize misalignment through performing flected on the building window in the fourth photo of Fig 5.
random translations within [−10, 10] pixels on real data2 ,
and train ERRNet with ‘synthetic + 50 aligned real data + 4.4. Training with Unaligned Data
40 unaligned data’. Pixel-wise loss lpixel and alignment- To test our alignment-invariant loss on real-world un-
invariant loss linv are used for 40 unaligned images. Ta- aligned data, we first collected a dataset of unaligned im-
ble 2 shows employing 40 unaligned data with lpixel loss age pairs with cameras and a portable glass, as shown in
degrades the performance, even worse than that from 50 Fig. 1 . Both a DSLR camera and a smart phone are used to
aligned images without additional unaligned data. capture the images. We collected 450 image pairs in total,
In addition, we also investigate the contextual loss lcx and some samples are shown in Fig 6. These image pairs
of [27]. Results from both contextual loss lcx and our are randomly split into a training set of 400 samples and a
alignment-invariant loss linv (or combination of them linv + testing set with 50 samples.
lcx ) surpass analogous results obtained with only aligned We conduct experiments on the BDN-F and ERRNet
images by appreciable margins, indicating that these losses models, each of which is first trained on aligned dataset
provide useful supervision to networks granted unaligned (w/o unaligned) as in Section 4.3, and then finetuned with
data. Note although linv and lcx perform equally well, our our alignment-invariant loss and unaligned training data.
linv is much simpler and computationally efficient than lcx , The resulting pairs before and after finetuning are assem-
suggesting linv is lightweight alternative to lcx in terms of bled for human assessment, as no existing numerical metric
our reflection removal task. is available for evaluating unaligned data.
We asked 30 human observers to provide a preference
2 Our alignment-invariant loss l
inv can handle shifts of up to 20 pixels.
See suppl. material for more details. 3 Images indexed by 1, 2, 74 are removed due to misalignment.

8183
Input LB14 [25] CEILNet-F [5] Zhang et al. [47] BDN-F [44] ERRNet Reference

Figure 5: Visual comparison on real-world images. The images are obtained from ‘Real20’ (Rows 1-3) and our collected unaligned dataset
(Rows 4- 5). More results can be found in the suppl. material.

Table 3: Quantitative results of different methods on four real-world benchmark datasets. The best results are indicated by red color and
the second best results are denoted by blue color. The results of ‘Average’ are obtained by averaging the metric scores of all images from
these four real-world datasets.
Methods
Dataset Index Input LB14 CEILNet CEILNet Zhang BDN BDN ERRNet
[25] [5] F et al. [47] [44] F
PSNR 19.05 18.29 18.45 20.32 21.89 18.41 20.06 22.89
SSIM 0.733 0.683 0.690 0.739 0.787 0.726 0.738 0.803
Real20
NCC 0.812 0.789 0.813 0.834 0.903 0.792 0.825 0.877
LMSE 0.027 0.033 0.031 0.028 0.022 0.032 0.027 0.022
PSNR 23.74 19.39 23.62 23.36 22.72 22.73 24.00 24.87
SSIM 0.878 0.786 0.867 0.873 0.879 0.856 0.893 0.896
Objects
NCC 0.981 0.971 0.972 0.974 0.964 0.978 0.978 0.982
LMSE 0.004 0.007 0.005 0.005 0.005 0.005 0.004 0.003
PSNR 21.30 14.88 21.24 19.17 16.85 20.71 22.19 22.04
SSIM 0.878 0.795 0.834 0.793 0.799 0.859 0.881 0.876
Postcard
NCC 0.947 0.929 0.945 0.926 0.886 0.943 0.941 0.946
LMSE 0.005 0.008 0.008 0.013 0.007 0.005 0.004 0.004
PSNR 26.24 19.05 22.36 22.05 21.56 22.36 22.74 24.25
SSIM 0.897 0.755 0.821 0.844 0.836 0.830 0.872 0.853
Wild
NCC 0.941 0.894 0.918 0.924 0.919 0.932 0.922 0.917
LMSE 0.005 0.027 0.013 0.009 0.010 0.009 0.008 0.011
PSNR 22.85 17.51 22.30 21.41 20.22 21.70 22.96 23.59
SSIM 0.874 0.781 0.841 0.832 0.838 0.848 0.879 0.879
Average
NCC 0.955 0.937 0.948 0.943 0.925 0.951 0.950 0.956
LMSE 0.006 0.011 0.009 0.010 0.007 0.007 0.006 0.005

score among {-2,-1,0,1,2} with 2 indicating the finetuned In total, 3000 human judgments are collected (2 methods,
result is significantly better while -2 the opposite. To avoid 30 users, 50 images pairs). More details regarding this eval-
bias, we randomly switch the image positions of each pair. uation process can be found in the suppl. material.

8184
Figure 6: Image samples in our unaligned image dataset. Our dataset covers a large variety of indoor and outdoor environments including
dynamic scenes with vehicles, human, etc.

Score Range Ratio BDN-F ERRNet 2 2

(0.25, 2] 78% 54%


1 1
[−0.25, 0.25] 18% 36%
[−2, −0.25) 4% 10% 0 0

Average Score 0.62 0.51 -1 -1


10 20 30 40 50 10 20 30 40 50

Table 4: Human preference scores of self-comparsion experiments. Left: results of BDN-F; Right: results of ERRNet. X axis of each
sub-figure represents the image # of testing images (50 in total).

BDN-F ERRNet

input reference w/o unaligned w. unaligned w/o unaligned w. unaligned


Figure 7: Results of training with and without unaligned data. See suppl. material for more examples. (Best view on screen with zoom)

Table 4 shows the average of human preference scores ciently extract the underlying knowledge from real train-
for the resulting pairs of each method. As can be seen, hu- ing data, we introduce context encoding modules, which
man observers clearly tend to prefer the results produced can be seamlessly embedded into our network to help dis-
by the finetuned models over the raw ones, which demon- criminate and suppress the reflection component. Extensive
strates the benefit of leveraging unaligned data for training experiments demonstrate our approach set a new state-of-
independent of the network architecture. Figure 7 shows the-art on real-world benchmarks of single image reflection
some typical results of the two methods; the results are sig- removal, both quantitatively and visually.
nificantly improved by training on unaligned data.

5. Conclusion
We have proposed an enhanced reflection removal net- Acknowledgments
work together with an alignment-invariant loss function to
help resolve the difficulty of single image reflection re- We thank Yunhao Zou for great help collecting the re-
moval. We investigated the possibility to directly utilize flection image dataset. This work was supported by the Na-
misaligned training data, which can significantly alleviate tional Natural Science Foundation of China under Grants
the burden of capturing real-world training data. To effi- No. 61425013 and No. 61672096.

8185
References [16] M. Jin, S. Ssstrunk, and P. Favaro. Learning to see through
reflections. In IEEE International Conference on Computa-
[1] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Remov- tional Photography (ICCP), May 2018.
ing photography artifacts using gradient projection and flash- [17] J. Johnson, A. Alahi, and L. Feifei. Perceptual losses for
exposure sampling. ACM Transactions on Graphics (TOG), real-time style transfer and super-resolution. European Con-
24(3):828–835, 2005. ference on Computer Vision (ECCV), pages 694–711, 2016.
[2] N. Arvanitopoulos, R. Achanta, and S. Susstrunk. Single im- [18] A. Jolicoeur-Martineau. The relativistic discriminator: a key
age reflection suppression. In The IEEE Conference on Com- element missing from standard GAN. In International Con-
puter Vision and Pattern Recognition (CVPR), July 2017. ference on Learning Representations (ICLR), 2019.
[3] Z. Chi, X. Wu, X. Shu, and J. Gu. Single image reflection [19] D. P. Kingma and J. Ba. Adam: A method for stochastic
removal using deep encoder-decoder network. arXiv preprint optimization. arXiv preprint arXiv:1412.6980, 2014.
arXiv:1802.00094, 2018. [20] N. Kong, Y.-W. Tai, and J. S. Shin. A physically-based
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, approach to reflection separation: from physical modeling
and A. Zisserman. The pascal visual object classes (voc) to constrained optimization. IEEE Transactions on Pattern
challenge. International Journal of Computer Vision (IJCV), Analysis and Machine Intelligence (TPAMI), 36(2):209–221,
88(2):303–338, 2010. 2014.
[5] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf. A generic [21] D. Lee, M.-H. Yang, and S. Oh. Generative single image re-
deep architecture for single image reflection removal and im- flection separation. arXiv preprint arXiv:1801.04102, 2018.
age smoothing. In The IEEE International Conference on [22] A. Levin and Y. Weiss. User assisted separation of reflections
Computer Vision (ICCV), Oct 2017. from a single image using a sparsity prior. IEEE Transac-
[6] H. Farid and E. H. Adelson. Separating reflections and light- tions on Pattern Analysis and Machine Intelligence (TPAMI),
ing using independent components analysis. In IEEE Confer- 29(9):1647–1654, 2007.
ence on Computer Vision and Pattern Recognition (CVPR), [23] A. Levin, A. Zomet, and Y. Weiss. Learning to perceive
July 1999. transparency from the statistics of natural scenes. In Ad-
[7] K. Gai, Z. Shi, and C. Zhang. Blind separation of superim- vances in Neural Information Processing Systems (NIPS).
posed moving images using image statistics. IEEE Transac- December 2002.
tions on Pattern Analysis and Machine Intelligence (TPAMI), [24] Y. Li and M. S. Brown. Exploiting reflection change for au-
34(1):19–32, 2012. tomatic reflection removal. In The IEEE International Con-
[8] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Free- ference on Computer Vision (ICCV), December 2013.
man. Ground truth dataset and baseline evaluations for in- [25] Y. Li and M. S. Brown. Single image layer separation us-
trinsic image algorithms. In IEEE International Conference ing relative smoothness. In IEEE Conference on Computer
on Computer Vision (ICCV). IEEE, Oct 2009. Vision and Pattern Recognition (CVPR), pages 2752–2759,
2014.
[9] X. Guo, X. Cao, and Y. Ma. Robust separation of reflection
from multiple images. In IEEE Conference on Computer [26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced
Vision and Pattern Recognition (CVPR), July 2014. deep residual networks for single image super-resolution.
In The IEEE Conference on Computer Vision and Pattern
[10] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hy-
Recognition (CVPR) Workshops, July 2017.
percolumns for object segmentation and fine-grained local-
[27] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual
ization. In The IEEE Conference on Computer Vision and
loss for image transformation with non-aligned data. In The
Pattern Recognition (CVPR), June 2015.
European Conference on Computer Vision (ECCV), Septem-
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling ber 2018.
in deep convolutional networks for visual recognition. IEEE
[28] S. Nah, T. Hyun Kim, and K. Mu Lee. Deep multi-scale
Transactions on Pattern Analysis and Machine Intelligence
convolutional neural network for dynamic scene deblurring.
(TPAMI), 37(9):1904–1916, 2015.
In The IEEE Conference on Computer Vision and Pattern
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning Recognition (CVPR), July 2017.
for image recognition. In The IEEE Conference on Computer [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
Vision and Pattern Recognition (CVPR), June 2016. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[13] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net- et al. Imagenet large scale visual recognition challenge. In-
works. In The IEEE Conference on Computer Vision and ternational Journal of Computer Vision (IJCV), 115(3):211–
Pattern Recognition (CVPR), June 2018. 252, 2015.
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [30] B. Sarel and M. Irani. Separating transparent layers through
deep network training by reducing internal covariate shift. layer information exchange. In European Conference on
arXiv preprint arXiv:1502.03167, 2015. Computer Vision (ECCV), September 2004.
[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image [31] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman. Reflec-
translation with conditional adversarial networks. In The tion removal using ghosting cues. In The IEEE Conference
IEEE Conference on Computer Vision and Pattern Recog- on Computer Vision and Pattern Recognition (CVPR), June
nition (CVPR), July 2017. 2015.

8186
[32] K. Simonyan and A. Zisserman. Very deep convolutional [48] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
networks for large-scale image recognition. International parsing network. In The IEEE Conference on Computer Vi-
Conference on Learning Representations (ICLR), 2015. sion and Pattern Recognition (CVPR), July 2017.
[33] K. Simonyan and A. Zisserman. Very deep convolutional [49] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
networks for large-scale image recognition. In International to-image translation using cycle-consistent adversarial net-
Conference on Machine Learning (ICLR), 2015. works. In The IEEE International Conference on Computer
[34] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and Vision (ICCV), Oct 2017.
R. Szeliski. Image-based rendering for scenes with reflec-
tions. ACM Transactions on Graphics (TOG), 31(4):100–1,
2012.
[35] R. Szeliski, S. Avidan, and P. Anandan. Layer extrac-
tion from multiple images containing reflections and trans-
parency. In IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), July 2000.
[36] R. Wan, B. Shi, L. Duan, A. Tan, W. Gao, and A. C. Kot.
Region-aware reflection removal with unified content and
gradient priors. IEEE Transactions on Image Processing,
27(6):2927–2941, 2018.
[37] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot.
Benchmarking single-image reflection removal algorithms.
In The IEEE International Conference on Computer Vision
(ICCV), Oct 2017.
[38] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot. Crrn:
Multi-scale guided concurrent reflection removal network.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[39] R. Wan, B. Shi, T. A. Hwee, and A. C. Kot. Depth of field
guided reflection removal. In IEEE International Conference
on Image Processing, September 2016.
[40] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
Image quality assessment: from error visibility to struc-
tural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004.
[41] Y. Wu and K. He. Group normalization. In European Con-
ference on Computer Vision (ECCV), September 2018.
[42] L. Xu, C. Lu, Y. Xu, and J. Jia. Image smoothing via L0
gradient minimization. In ACM Transactions on Graphics
(TOG), volume 30, page 174, 2011.
[43] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman. A com-
putational approach for obstruction-free photography. ACM
Transactions on Graphics (TOG), 34(4):79, 2015.
[44] J. Yang, D. Gong, L. Liu, and Q. Shi. Seeing deeply and
bidirectionally: A deep learning approach for single image
reflection removal. In The European Conference on Com-
puter Vision (ECCV), September 2018.
[45] J. Yang, H. Li, Y. Dai, and R. T. Tan. Robust optical flow
estimation of double-layer images under transparency or re-
flection. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2016.
[46] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
A. Agrawal. Context encoding for semantic segmentation.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[47] X. Zhang, R. Ng, and Q. Chen. Single image reflection
separation with perceptual losses. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June
2018.

8187

You might also like