0% found this document useful (0 votes)
41 views14 pages

PEPSI++: Fast and Lightweight Network For Image Inpainting

The document proposes a new neural network architecture called PEPSI for fast and lightweight image inpainting. PEPSI consists of a single shared encoding network and parallel decoding networks called coarse and inpainting paths. The coarse path produces a preliminary inpainting result to train the encoding network, while the inpainting path generates higher quality results using contextual attention. An extended version called Diet-PEPSI is also proposed to significantly reduce network parameters through novel rate-adaptive dilated convolutional layers. Experimental results show PEPSI and Diet-PEPSI improve inpainting quality while reducing computational costs compared to state-of-the-art methods.

Uploaded by

Himanshu Panwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

PEPSI++: Fast and Lightweight Network For Image Inpainting

The document proposes a new neural network architecture called PEPSI for fast and lightweight image inpainting. PEPSI consists of a single shared encoding network and parallel decoding networks called coarse and inpainting paths. The coarse path produces a preliminary inpainting result to train the encoding network, while the inpainting path generates higher quality results using contextual attention. An extended version called Diet-PEPSI is also proposed to significantly reduce network parameters through novel rate-adaptive dilated convolutional layers. Experimental results show PEPSI and Diet-PEPSI improve inpainting quality while reducing computational costs compared to state-of-the-art methods.

Uploaded by

Himanshu Panwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 1

PEPSI++: Fast and Lightweight Network


for Image Inpainting
Yong-Goo Shin, Min-Cheol Sagong, Yoon-Jae Yeo, Seung-Wook Kim, and Sung-Jea Ko, Fellow, IEEE

Abstract—Among the various generative adversarial network the coarse-to-fine network with a contextual attention module
(GAN)-based image inpainting methods, coarse-to-fine network (CAM) has shown remarkable performance [4], [19]. This
with a contextual attention module (CAM) has shown remark- network is composed of two stacked generative networks
able performance. However, owing to two stacked generative
arXiv:1905.09010v5 [cs.CV] 6 Mar 2020

networks, the coarse-to-fine network needs numerous compu- including the coarse network and refinement one. The coarse
tational resources such as convolution operations and network network roughly fills the hole regions using a simple dilated
parameters, which result in low speed. To address this problem, convolutional network trained with reconstruction loss. The
we propose a novel network architecture called PEPSI: parallel refinement network improves the quality of the roughly com-
extended-decoder path for semantic inpainting network, which pleted image by using the CAM that generates feature patches
aims at reducing the hardware costs and improving the inpainting
performance. PEPSI consists of a single shared encoding network of the hole regions by borrowing information from distant
and parallel decoding networks called coarse and inpainting spatial locations. Despite the promising results, the coarse-
paths. The coarse path produces a preliminary inpainting result to-fine network requires high computational resources and
to train the encoding network for the prediction of features consumes considerable memories.
for the CAM. Simultaneously, the inpainting path generates In previous work [20], we introduced a novel network struc-
higher inpainting quality using the refined features reconstructed
via the CAM. In addition, we propose Diet-PEPSI that signif- ture called PEPSI: parallel extended-decoder path for semantic
icantly reduces the network parameters while maintaining the inpainting, which aims at reducing the number of convolution
performance. In Diet-PEPSI, to capture the global contextual operations as well as improving the inpainting performance.
information with low hardware costs, we propose novel rate- PEPSI is composed of a single encoding network and parallel
adaptive dilated convolutional layers, which employ the common decoding networks which consist of coarse and inpainting
weights but produce dynamic features depending on the given
dilation rates. Extensive experiments comparing the performance paths. The coarse path generates a preliminary inpainting result
with state-of-the-art image inpainting methods demonstrate that to train the encoding network for prediction of features for the
both PEPSI and Diet-PEPSI improve the qualitative scores, i.e. CAM. At the same time, the inpainting path produces image
the peak signal-to-noise ratio (PSNR) and structural similarity with high quality using the refined features reconstructed
(SSIM), as well as significantly reduce hardware costs such as via the CAM. To make a single encoding network handle
computational time and the number of network parameters.
Index Terms—Deep learning, generative adversarial network, two different tasks, which are feature extraction for both
image inpainting roughly completed and high-quality results, we propose a joint
learning technique that jointly optimizes two different paths.
This learning scheme facilitates the generation of high-quality
I. I NTRODUCTION inpainting image without the stacked generative networks, i.e.
the coarse-to-fine network.

I MAGE inpainting techniques which attempt to remove


an unwanted object or synthesize missing parts of an
image have attracted wide-spread interest in computer vision
Although PEPSI exhibits faster operation speed compared
with conventional methods, it still needs substantial memory
owing to a series of dilated convolutional layers in the en-
and graphics communities [1]–[17]. Recent studies used the coding network, which occupies nearly 67 percent of network
generative adversarial network (GAN) to produce appropriate parameters. The intuitive way to save memory consumption is
structures for the missing regions, i.e. hole regions [8], [9], to prune channels in the dilated convolutional layers; however,
[18]. Among the recent state-of-the-art inpainting methods, it often results in inferior results. To address this challenge,
this paper presents an extended version of PEPSI, called Diet-
Y.-G. Shin is with the School of Electrical Engineering Department, Korea
University, Anam-dong, Sungbuk-gu, Seoul, 136-713, Rep. of Korea (e-mail: PEPSI, which significantly reduces the network parameters
[email protected]). while retaining the inpainting performance. In Diet-PEPSI,
M.-C. Sagong is with the School of Electrical Engineering Department, we propose novel rate-adaptive dilated convolutional layers
Korea University, Anam-dong, Sungbuk-gu, Seoul, 136-713, Rep. of Korea
(e-mail: [email protected]). which require low hardware costs by sharing the weights in
Y.-J. Yeo is with the School of Electrical Engineering Department, Korea every layer but generate dynamic features according to the
University, Anam-dong, Sungbuk-gu, Seoul, 136-713, Rep. of Korea (e-mail: given dilation rates. More specifically, to produce the rate-
[email protected]).
S.-W. Kim is with the School of Electrical Engineering Department, Korea specific features, the rate-adaptive dilated convolutional layers
University, Anam-dong, Sungbuk-gu, Seoul, 136-713, Rep. of Korea (e-mail: modulate the shared weights by differently scaling and shifting
[email protected]). according to the given dilation rates. Since the rate-adaptive
S.-J. Ko is with the School of Electrical Engineering Department, Korea
University, Anam-dong, Sungbuk-gu, Seoul, 136-713, Rep. of Korea (e-mail: dilated convolutional layers share the weights with each other,
[email protected]). the number of network parameters can be significantly reduced
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 2

Fig. 1. Overview of the network architectures of the conventional and proposed methods, where D and G indicate the discriminator and generator, respectively.
(a) Architecture of traditional encoder-decoder network [7]. (b) Architecture of coarse-to-fine network [4], [19]. (c) Architecture of PEPSI.

compared with multiple standard dilated convolutional layers. that achieves superior performance compared to conventional
In this paper, we apply the proposed rate-adaptive dilated methods as well as significantly reduces the operation time.
convolutional layers to Diet-PEPSI using residual blocks [21] (ii) We propose Diet-PEPSI that applies novel rate-adaptive
called Diet-PEPSI units (DPUs). By replacing the multiple convolution layers to further reduce the hardware costs while
dilated convolutional layers with DPUs, Diet-PEPSI covers maintaining the overall quality of the results, which makes
the same size of the receptive field with a smaller number of the proposed method compatible with the hardware. (iii) A
parameters than PEPSI. novel discriminator, called RED, is proposed to handle both
Furthermore, we investigate an obstacle with the discrimina- squared and irregular hole regions for real applications. In
tor in traditional GAN-based image inpainting methods [14], the remainder of this paper, we introduce the related work
[22]. In general, conventional methods employ global and local and preliminaries in Section II and Section III, respectively.
discriminators trained with a combined loss, the L2 pixel- The PEPSI and Diet-PEPSI are discussed in Section IV. In
wise reconstruction loss and adversarial loss, which assists the Section V, extensive experimental results are presented to
networks in generating a more natural image by minimizing demonstrate that the proposed method outperforms conven-
the difference between the reference and the inpainted images. tional methods on various datasets such as CelebA [23], [24],
More specifically, the global discriminator takes the whole Place2 [25], and ImageNet [26]. Finally, the conclusion is
image as input to recognize global consistency, whereas the provided in Section VI.
local one only views at a small region around the hole in order
to judge the quality of more detailed appearance. However, II. R ELATED WORK
the local discriminator has a drawback that it can only deal Existing image inpainting techniques can be divided into
with a single rectangular hole region. In other words, since the two groups [4]: traditional and deep learning-based methods.
holes can appear with arbitrary shapes, sizes, and locations in The traditional techniques include diffusion-based and patch-
real-world applications, the local discriminator is difficult to based methods. The diffusion-based method fills the hole
apply to the inpainting network for inpainting the holes with regions by propagating the local image appearance around the
irregular shapes. To solve this problem, we propose a region holes [1], [2], [4], [5]. The diffusion-based method performs
ensemble discriminator (RED) which integrates the global and well on the small and narrow holes, but often fails to fill
local discriminators. Since each pixel in the last layer has a complex hole regions such as faces and objects with non-
different receptive field in the image domain, the RED adopts repetitive structures. In contrast, the patch-based technique
individual fully connected layers on each pixel in the last results in the better performance in filling the complicated
convolutional layer. By individually computing an adversarial images with large hole regions [4], [27]. This method samples
loss in each pixel, the RED can deal with the various holes texture patches from the existing regions of image, i.e. back-
with arbitrary shapes. ground regions, and pastes them into the hole region. Barnes
In summary, this paper has three major contributions. et al. [3] introduced a fast approximate nearest neighbor patch
(i) We propose a novel network architecture called PEPSI search algorithm, called PatchMatch, which exhibited notable
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 3

Fig. 2. Illustration of the CAM. The conventional CAM reconstructs fore-


ground patches by measuring the cosine similarities with background patches.
In contrast, the modified CAM uses the Euclidean distance to compute
similarity scores.

performance for image editing applications such as the image Fig. 3. Toy examples for the coarse network. (a) Masked input image. (b)
inpainting. However, PatchMatch often fills the hole regions Original image. (c) Result from the coarse-to-fine network. (d) Result without
regardless of the visual semantics or the global structure of an the coarse result. (e) Result with LR coarse path.
image, which results in the resultant images with poor visual
quality. TABLE I
E XPERIMENTAL RESULTS WITH G ATED C ONV (GC) [19] USING
By using the convolutional neural network (CNN), the deep DIFFERENT COARSE PATH . GC* INDICATES A MODEL TRAINED WITHOUT
learning-based method learns how to extract semantic infor- COARSE RESULTS AND GC† INDICATES A MODEL TRAINED WITH
SIMPLIFIED COARSE PATH .
mation for producing the structures of the hole regions [8], [9],
[18]. The CNN-based image inpainting methods employing an Square mask Free-form mask
Time
encoder-decoder structure have shown superior performance PSNR SSIM PSNR SSIM
on inpainting the complex hole region compared with the GC 24.67 0.8949 27.78 0.9252 21.39 ms
GC∗ 23.50 0.8822 26.35 0.9098 14.28 ms
diffusion- or patch-based methods [8], [18]. However, these GC† 23.71 0.8752 26.22 0.9026 13.32 ms
methods often generate an image with visual artifacts such
as boundary artifacts and blurry texture inconsistent with
surrounding areas. To alleviate this problem, Pathak et al. [10] refinement network having the CAM. This method achieved
adopted the GAN [22] to enhance the coherence between the remarkable performance compared with the recent state-of-
background and hole regions. They trained the entire network the-art inpainting methods; however, it requires considerable
using a combined loss, the L2 pixel-wise reconstruction loss computational resources owing to the two stacked generative
and adversarial loss, which drives the networks to minimize networks.
the difference between the reference and inpainted images as
well as to produce plausible new contents in highly structured
images such as faces and scenes. However, this method has a III. P RELIMINARIES
limitation that it only can fill the square hole located at the
A. Generative adversarial networks
center of an image.
To inpaint the images with square hole in arbitrary locations, The GAN was first introduced by Goodfellow et al. [22]
as shown in Fig. 1(a), Iizuka et al. [7] proposed an improved for image generation. In general, GAN consists of a generator
network structure which employs two sibling discriminators: G and a discriminator D which are trained with competing
global and local discriminators. More specifically, the local goals. The generator is trained to produce a new image,
discriminator only considers the inpainted region to classify indistinguishable from real images, while the discriminator is
the local texture consistency, whereas the global discriminator optimized to differentiate between real and generated images.
inspects that the resultant image is consistent across the whole Formally, the G (D) tries to minimize (maximize) the loss
image. Recently, Yu et al. [4] have extended this work by using function, i.e. adversarial loss, as follows:
the coarse-to-fine network and the CAM. In particular, by
computing the cosine similarity between the background and
foreground feature patches, the CAM learns where to borrow min max Ex∼Pdata (x) [log D(x)]
G D
the background features for the hole region. In order to collect + Ez∼Pz(z) [log(1 − D(G(z)))], (1)
the background features involved with the missing region, the
CAM requires the features at the missing region encoded from where z and x denote a random noise vector and a real
roughly completed images. Thus, as shown in Fig. 1(b), this image sampled from the noise Pz (z) and real data distribution
method employs two stacked generative networks (coarse and Pdata (x), respectively. Recently, the GAN has been applied to
refinement networks) to generate an intermediate result, i.e. several semantic inpainting techniques [4], [7], [10] to fill the
the coarse result, and an inpainting result refined through the holes naturally.
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 4

Fig. 4. An architecture of PEPSI. The coarse path and inpainting path share their weights to improve each other. The coarse path is trained only with the L1
reconstruction loss while the inpainting path is trained with both of L1 and adversarial losses.

TABLE II TABLE III


D ETAILED ARCHITECTURE OF ENCODING NETWORK . D ETAILED ARCHITECTURE OF THE DECODING NETWORK . T HE OUTPUT
LAYER CONSISTS OF A CONVOLUTION LAYER CLIPPED VALUE TO THE [-1,
Type Kernel Dilation Stride Outputs 1].
Convolution 5×5 1 1×1 32
Convolution 3×3 1 2×2 64 Type Kernel Dilation Stride Outputs
Convolution 3×3 1 1×1 64 Convolution ×2 3×3 1 1×1 128
Convolution 3×3 1 2×2 128 Upsample (×2 ↑) - - - -
Convolution 3×3 1 1×1 128 Convolution ×2 3×3 1 1×1 64
Convolution 3×3 1 2×2 256 Upsample (×2 ↑) - - - -
Dilated Convolution ×2 3×3 1 1×1 32
3×3 2 1×1 256
convolution Upsample (×2 ↑) - - - -
Dilated Convolution ×2 3×3 1 1×1 16
3×3 4 1×1 256
convolution Convolution
Dilated 3×3 1 1×1 3
3×3 8 1×1 256 (Output)
convolution
Dilated
3×3 1 1×1 256
convolution
the CAM rebuilds features of foreground regions, i.e. recon-
structed feature. The CAM effectively learns where to borrow
B. Coarse-to-fine network or copy the feature information from the background region
for the unknown foreground regions, but it requires the coarse
In [4], [19], a two-stage network called a coarse-to-fine result to explicitly attend on related features at distant spatial
network, which separately conducts a couple of tasks, is locations.
proposed. The coarse-to-fine network first generates an initial
To justify this assumption, we conducted experiments
coarse prediction, i.e. coarse result, using the coarse network,
that measure the performance of coarse-to-fine network
and then refines the results by encoding features from the
with/without the coarse path. In our experiments, we trained
coarse result with the refinement network. To refine the coarse
the refinement network using raw masked images as an input.
prediction effectively, they introduced CAM that generates
As shown in Table I and Fig. 3, the refinement network without
patches of the hole region using features from distant back-
the coarse result shows worse results than the full coarse-to-
ground patches. As depicted in Fig. 2, the CAM divides the
fine network. These results reveal that if the coarse feature
input feature maps into a target foreground and its surrounding
of the hole region is not encoded well, the CAM reconstructs
background, and extracts 3 × 3 patches. Then, the similarity
features using unrelated feature patches, resulting in inferior
score s(x,y),(x0 ,y0 ) between the foreground patch at (x, y), fx,y ,
results. For instance, as shown in Fig. 3(d), the refinement
and the background patch at (x0 , y 0 ), bx0 ,y0 is computed by
network trained without the coarse result produces artifacts
using the normalized inner product (cosine similarity), which
such as a wrinkle on the cheek. In other words, the coarse-to-
is expressed as follows:
fine network must pass through a two-stage encoder-decoder
  network which requires massive computational resources. Fur-
fx,y bx0 ,y0 thermore, to reduce the operation time of the coarse-to-fine
s(x,y),(x0 ,y0 ) = , , (2)
kfx,y k kbx0 ,y0 k network with another way, we conducted an extra experiment
s∗(x,y),(x0 ,y0 ) = softmax(λs(x,y),(x0 ,y0 ) ), (3) by simplifying the coarse network. In our experiments, we
generated the coarse result with low resolution (64 × 64) and
where λ is a hyper-parameter for scaled softmax. By weighted fed it to the refinement network by resizing its resolution to
sum of background patches using s∗(x,y),(x0 ,y0 ) as weights, the original size. However, as depicted in Fig. 3(e) and Table I,
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 5

the simplified coarse network exhibits worse performance. For


instance, as shown in Fig. 3(e), the simplified coarse network
results in asymmetric eyes with defects; the generated right
eye has different colors as compared to the left one. These
observations indicate that the simplified coarse network can
produce the roughly completed image with fast speed but this
completed image is not suitable for the refinement network.

IV. P ROPOSED METHOD


A. Architecture of PEPSI
As shown in Fig. 4, PEPSI unifies the stacked networks
of the coarse-to-fine network into a single generative network
with a single shared encoding network and a parallel decoding
network called the coarse and inpainting paths, respectively.
The encoding network aims at jointly learning to extract the
features from background regions as well as to complete the
features of hole regions without the coarse results. As listed Fig. 5. Rate-adaptive scaling and shifting operations. βd and γd have different
values depending on the given rate. Tensor broadcasting is included in scaling
in Table II, the encoding network consists of a series of 3 × 3 and shifting operations.
convolutional layers, except for the first layer which uses a
5 × 5 convolutional layer. To enlarge the receptive field of the
encoding network, we utilize multiple dilated convolutional using a series of dilated convolutional layers, which requires
layers with different dilation rates in the last four convolutional numerous network parameters. The intuitive way to reduce
layers. hardware cost is to prune the channels of these layers, but
A parallel decoding network consists of coarse and inpaint- it often yields inferior results in practice. To cope with this
ing paths that share the weight parameters with each other. A problem, we propose novel rate-adaptive dilated convolutional
detailed architecture of the decoding network is described in layers that utilize the shared weights but produce dynamic
Table III. The coarse path produces a roughly completed result feature maps depending on the given dilation rates. More
from the feature maps obtained via the encoding network, specifically, to produce rate-specific features, the rate-adaptive
whereas the inpainting path first reconstructs the encoded dilated convolutional layers alter the shared weights by scaling
feature map by using the CAM and produces a higher-quality and shifting differently according to the given dilation rates.
inpainting result by decoding the reconstructed features. Since Since the rate-adaptive dilated convolutional layers share the
two different paths use the same encoded feature maps as weights in every layer, the number of network parameters can
their input, this joint learning strategy encourages the encoding be significantly reduced compared with multiple standard di-
network to produce valuable features for two different image lated convolutional layers. In this subsection, we first introduce
generation tasks. To jointly train both paths, we explicitly em- how the rate-adaptive dilated convolutional layers produce
ploy the reconstruction L1 loss to the coarse path, whereas the different feature maps. Then, we explain how the rate-adaptive
inpainting path is trained by using both L1 and the adversarial dilated convolutional layers are applied to PEPSI.
losses. Additional information about the joint learning scheme In general, the weights of the convolutional layer are
will be described in Section IV-E. It should be noted that we considered as a four-dimensional tensor W ∈ Rk×k×Cin ×Cout ,
employ only the inpainting path during the tests, as depicted where k is the kernel-size, while Cin and Cout are the number
in Fig. 1(c), which substantially reduces the computational of input and output channels, respectively. In other words,
complexity. as shown in Fig. 5, the weights in each convolutional layer
In terms of layer implementations in the encoding and can be represented as Cout filters with Cin channels, i.e.
decoding networks, PEPSI employs reflection padding for wi , {i = 1, . . . , Cout }. To produce different features according
all convolutional layers and uses the exponential linear unit to the given dilation rates, we modulate W using the learned
(ELU) [28] as an activation function, except for the last scale γd ∈ R1×1×Cin ×Cout and bias βd ∈ R1×1×Cin ×Cout
convolutional layer. In addition, [-1, 1] normalized image with parameters, where d indicates the dilation rate; γd and βd are
256 × 256 pixels is employed as an input of PEPSI, and learned separately depending on the given dilation rate. This
PEPSI produces the output image with the same resolution modulating process can be expressed as follows:
by clipping the output values into [-1, 1] instead of using the
tanh function.
W d = γ d · W + βd , (4)
B. Architecture of Diet-PEPSI where Wd ∈ Rk×k×Cin ×Cout represents the rate-adaptively
Although PEPSI effectively reduces the number of convo- modified weights. Note that tensor broadcasting is included in
lution operations, it still requires a similar number of network (4). Using these scaling and shifting processes, the common
parameters as the coarse-to-fine network. As mentioned in weights W can be specialized to the desired dilation rate using
Section IV-A, PEPSI aggregates the contextual information a small number of parameters.
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 6

Fig. 6. Architecture of Diet-PEPSI. We replace the multiple dilated convolutional layers with DPUs. In the DPUs, rate-adaptive convolution layers share their
weights whereas the 1 × 1 standard convolutional layers do not share their weights.

TABLE IV
D ETAILED ARCHITECTURE OF RED. A FTER EACH CONVOLUTION LAYER ,
EXCEPT LAST ONE , THERE IS A LEAKY-R E LU AS THE ACTIVATION
FUNCTION . E VERY LAYER IS NORMALIZED BY A SPECTRAL
NORMALIZATION . FC* INDICATES THE FULLY- CONNECTED LAYER WHICH
EMPLOYS PIXEL - WISE DIFFERENT WEIGHTS FOR CONVOLUTION
OPERATION .

Type Kernel Stride Outputs


Convolution 5×5 2×2 64
Convolution 5×5 2×2 128
Convolution 5×5 2×2 256
Convolution 5×5 2×2 256
Convolution 5×5 2×2 256
Convolution 5×5 2×2 512
Fig. 7. Overview of the RED. In the last layer, each pixel employs the fully FC* 1×1 1×1 1
connected layer with different weights. It aims to classify hole regions that
may appear in any region with any sizes in an image.
layers of PEPSI with residual blocks, i.e. DPUs, which consist
of a 3 ×3 rate-adaptive dilated convolutional layer and a 1 ×1
To demonstrate how Wd can generate different feature maps standard convolutional layer. By increasing the dilation rate,
depending on the given dilation rate, we analyze the com- the DPUs can cover the same size of the receptive field with
putational process in the rate-adaptive dilated convolutional PEPSI. While the standard dilated convolutional layers need
layer. The output of this convolutional layer y is formulated 3 × 3 × Cin × Cout × n network parameters, the DPUs require
as follows: (9 + 3n) × Cin × Cout network parameters where n indicates
the number of DPUs or dilated convolutional layers. Thus,
when n is larger than one, DPUs require fewer parameters than
y = x ⊗ (γd W + βd ) = x ⊗ γd W + x ⊗ βd , (5) the multiple dilated convolutional layers. We will empirically
where x and ⊗ indicate the input and convolution operation, demonstrate the validity of DPUs in Section V-B.
respectively. The first term x ⊗ γd W represents a scaling
process which produces the features that are scaled differently C. Region Ensemble Discriminator (RED)
according to the given dilation rate, whereas the second term Traditional image inpainting networks [4] utilized both
x ⊗ βd is a projection process which derives the rate-specific global and local discriminators to determine whether an image
features by projecting x into βd . In other words, even though has been completed consistently. However, the local discrimi-
the same features are used as the input of the rate-adaptive nator can only handle the hole region with a fixed-size square
convolutional layer, this layer can produce different features shape. Thus, it is difficult to employ the local discriminator to
depending on the given dilation rates. train the inpainting network for irregular holes. To solve this
Using the rate-adaptive convolutional layers, in this study, problem, we propose a RED inspired by the region ensemble
we propose a novel lightweight model of PEPSI called Diet- network [29] which detects a target object appearing anywhere
PEPSI, which significantly reduces the network parameters in the image by individually handling multiple feature regions.
while preserving the inpainting performance. In Diet-PEPSI, As described in Fig. 7 and Table IV, six strided convolutions
as shown in Fig. 6, we replace the standard dilated convolution with a kernel size of 5 × 5 and stride 2 are stacked to captures
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 7

the feature of the whole image. Then, we adopt an individual


fully-connected layer on each pixel in the last convolutional
layer to individually differentiate that each block is real or
fake. In other words, we conduct the 1 × 1 convolution
operation on the last layer using pixel-wise different weights.
It is worth noting that the major difference between RED and
existing discriminator [30], called PatchGAN-discriminator, is
the last convolutional layer. The PatchGAN-discriminator uses
the single regressor in the last convolutional layer, whereas
the RED employs individual regressors in each pixel. This ap-
proach allows the RED to act as global and local discriminator
simultaneously. The effectiveness of the RED will be revealed
in Section V-B.
Fig. 8. Examples of masked image. (a) Original images. (b) Images with
square mask. (c) Images with free-form mask.
D. Modified CAM
As mentioned in III-B, the conventional CAM [4] uses
the cosine similarity to measure the similarity scores between LG = −Ex∼PXi [D(x)], (7)
foreground and background feature patches. However, in Eq. 2,
since the magnitudes of foreground and background patches, LD = Ex∼PY [max(0, 1 − D(x))]
i.e. fx,y and bx0 ,y0 , are ignored, this approach can result
in the distortion of the semantic feature representation. To + Ex∼PXi [max(0, 1 + D(x))], (8)
alleviate this problem, we propose a modified CAM which where PXi and PY denote the data distributions of inpainting
utilizes the Euclidean distance to measure the similarity scores results and input images, respectively. It is worth noting that
(d(x,y),(x0 ,y0 ) ) without the normalization procedure. In the we apply the spectral normalization [32] to all layers in the
modified CAM, we apply the Euclidean distance instead of RED to further stabilize the training of GANs. Since the
the cosine similarity to compute (d(x,y),(x0 ,y0 ) ). Since the goal of the inpainting path is not to produce the hole regions
Euclidean distance considers the angle between two vectors of naturally but also to recover the missing part of the original
feature patches and their magnitudes simultaneously, it is more image accurately, we add a strong constraint using L1 norm
appropriate for reconstructing the feature patch. However, to Eq. 7 as follows:
since the range of the Euclidean distance is [0, ∞), it is
difficult to be directly applied to the softmax. To cope with N
λi X (n)
this problem, we define the truncated distance similarity score LG = kX − Y (n) k1 − λadv Ex∼PXi [D(x)], (9)
de(x,y),(x0 ,y0 ) as follows: N n=1 i
(n)
where Xi and Y (n) represent the n-th image pair of the gen-
d(x,y),(x0 ,y0 ) − m(d(x,y),(x0 ,y0 ) )
d(x,y),(x0 ,y0 )
e = tanh (−( )), erated image through the inpainting path and its corresponding
σ(d(x,y),(x0 ,y0 ) ) original image in a mini-batch, respectively, N is the number
(6) of image pairs in a mini-batch, and λi and λadv are hyper-
parameters which control the relative importance of each loss
where d(x,y),(x0 ,y0 ) = kfx,y − bx0 ,y0 k. Since de(x,y),(x0 ,y0 ) has
term.
limited values within [−1, 1], it operates like a threshold which
On the other hand, the coarse path is designed to accurately
sorts out the distance scores less than the mean value. In
restore the missing features for the CAM. Therefore, we
other words, de(x,y),(x0 ,y0 ) supports to divide the background
simply optimize the coarse path using an L1 loss function
patches into two groups that may or may not be related to
which is defined as follows:
the foreground patch. By using de(x,y),(x0 ,y0 ) , the modified
CAM weighs them via scaled softmax and reconstructs the
N
foreground patch using the weighted sum of background ones 1 X
LC = kX (n) − Y (n) k1 , (10)
at last like the conventional CAM. The superiority of the N n=1 c
modified CAM will be explained in Section V-B.
(n)
where Xc are the n-th image pair of the generated image via
E. Loss function the coarse path in a mini-batch. Finally, we define the total loss
function of the generative network of PEPSI and Diet-PEPSI
To train PEPSI and Diet-PEPSI, we jointly optimize two
as follows:
different paths: the inpainting path and the coarse path. For
the inpainting path, we employ the GAN [22] optimization
framework in Eq. 1, which is described in Section III-A. To k
Ltotal = LG + λc (1 − )LC , (11)
avoid the gradient vanishing problem in the generator, inspired k max
by [31], we employ the hinge version of the adversarial loss where λc is a hyper-parameter controlling the contributions
instead of Eq. 1, which is expressed as follows: from each loss term, and k and kmax represent the iteration of
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 8

Fig. 9. Comparison of the proposed and conventional methods on randomly square masked CelebA-HQ datasets. (a) Ground truth. (b) Input image of the
network. (c) Results of Context Encoder [10]. (d) Results of Globally-Locally [7]. (e) Results of gated convolution [19]. (f) Results of PEPSI. (g) Results of
Diet-PEPSI.

the learning procedure and the maximum number of iterations, where the learning rates of the discriminator and generator
respectively. As the training progresses, we gradually decrease were 4 × 10−4 and 1 × 10−4 , respectively. In addition, we
the contribution of the LC for the decoding network to focus reduced the learning rate to 1/10 after 0.9 million iterations.
on the inpainting path. More specifically, as the training The hyper-parameters of the proposed method were set to
progresses, (1 − k/kmax ) becomes zero, which results in λi = 10, λc = 5, and λadv = 0.1. Our experiments were
reducing the contribution of LC . conducted using CPU Intel(R) Xeon(R) CPU E3-1245 v5 and
GPU TITAN X (Pascal), and implemented in TensorFlow v1.8.
V. E XPERIMENTS For our experiments, we used the CelebA-HQ [23], [24],
A. Implementation details ImageNet [26], and Place2 [25] datasets. More specifically,
Free-Form Mask As shown in Fig. 8(b), existing image in the CelebA-HQ dataset, we randomly sampled the 27,000
inpainting methods [4], [7], [10] usually adopt the regular images as a training set and 3,000 ones as a test set. We also
mask, e.g. hole region with rectangular shape, which indicates trained the network with all images in the ImageNet dataset
the background regions during the training procedure. and tested it on Place2 dataset to measure the performance of
However, the networks trained with the regular mask often trained deep learning models on other datasets; these experi-
exhibit weak performance on inpainting the hole with irregular ments were conducted to confirm the generalization ability of
shape and result in visual artifacts such as color discrepancy the proposed method. To demonstrate the superiority of PEPSI
and blurriness. To address this problem, as depicted in and Diet-PEPSI, in addition, we compared their qualitative,
Fig. 8(c), Yu et al. [19] adopted the free-form mask algorithm quantitative, operation speeds, and number of network parame-
during the training procedure, which automatically generates ters with those of the conventional generative methods: context
multiple random free-form holes with variable numbers, sizes, encoders (CE) [10], globally and locally completion network
shapes, and locations randomly sampled at every iteration. (GL) [7], generator with contextual attention (GCA) [4], and
More specifically, this algorithm first produces the free-form generator with gated convolution (GatedConv) [19].
mask by drawing multiple different lines and erasing pixels
closer than the arbitrary distance from these lines. For a
B. Performance Evaluation
fair comparison, in our experiments, we employed the same
free-form mask generation algorithm for training PEPSI and Qualitative Comparison To reveal the superiority of PEPSI
Diet-PEPSI. and Diet-PEPSI, we compared the qualitative performance
of the proposed methods with those of the conventional
Training Procedure PEPSI and Diet-PEPSI were trained generative methods using images with the squared mask and
for one million iterations using a batch size of eight in an free-form mask. In our experiments, we implemented the
end-to-end manner. Because the parameters in PEPSI and Diet- conventional methods by following the training procedure in
PEPSI can be differentiated, we performed an optimization each study. The resultant images with squared mask and free-
using the Adam optimizer [33] and set the parameters of Adam form mask are described in Figs. 9 and 10. As shown in Figs. 9
optimizers β1 and β2 to 0.5 and 0.9, respectively. Motivated and 10, CE [10] and GL [7] show obvious visual artifacts
by [34], we applied the two-timescale update rule (TTUR) including blurred or distorted images in the masked region. In
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 9

(a) Ground truth (b) Input image (c) CE (d) GL (e) GatedConv (f) PEPSI (g) Diet-PEPSI

Fig. 10. Comparison of the proposed and conventional methods on free-form masked CelebA-HQ datasets. (a) Ground truth. (b) Input image of the network.
(c) Results of Context Encoder [10]. (d) Results of Globally-Locally [7]. (e) Results of gated convolution [19]. (f) Results of PEPSI. (g) Results of Diet-PEPSI.

Fig. 11. Comparison of the proposed and conventional methods on Place2 dataset. (a) Ground truth. (b) Input image of the network. (c) Results of the
non-generative method, PatchMatch [3]. (d) Results of GatedConv [19]. (e) Results of PEPSI. (f) Results of Diet-PEPSI.

particular, these methods show inferior performance when in- the existing methods, PEPSI shows visually appealing results
painting the free-form mask, which indicates that CE [10] and and high relevance between hole and background regions. In
GL [7] cannot be applied to real applications. GatedConv [19] addition, as shown in Fig. 9(g) and Fig. 10(g), the output image
exhibits fine performance compared to CE [10] and GL [7], produced via Diet-PEPSI were comparable to PEPSI while
it still suffers from a lack of relevance between the hole and saving a significant number of network parameters. From these
background regions such as symmetry of eyes. Compared to results, we confirmed that the proposed methods outperform
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 10

TABLE V
R ESULTS OF GLOBAL AND LOCAL PSNR S , SSIM, AND OPERATION TIME WITH SQUARE AND FREE - FORMED MASKS ON C ELEBA-HQ DATASET.

Square mask Free-form mask Number of


PSNR PSNR Time (ms) Network
Method SSIM SSIM
Local Global Local Global Parameters
CE [10] 17.7 23.7 0.872 9.7 16.3 0.794 5.8 5.1M
GL [7] 19.4 25.0 0.896 15.1 21.5 0.843 39.4 5.8M
GCA [4] 19.0 24.9 0.898 12.4 18.9 0.798 22.5 2.9M
GatedConv [19] 18.7 24.7 0.895 21.2 27.8 0.925 21.4 4.1M
PEPSI 19.5 25.6 0.901 22.0 28.6 0.929
9.2 3.5M
PEPSI w/o coarse path 19.2 25.2 0.894 21.6 28.2 0.923
Diet-PEPSI 19.4 25.5 0.898 22.0 28.5 0.928 10.9 2.5M

TABLE VI TABLE VII


E XPERIMENTAL RESULTS THAT FURTHER REDUCE THE NETWORK R ESULTS OF GLOBAL AND LOCAL PSNR S AND SSIM ON P LACE 2
PARAMETERS USING THE GROUP CONVOLUTION TECHNIQUE . DATASET.

PSNR
Square mask Free-form mask Number of Mask Method SSIM
Local Global
PSNR SSIM PSNR SSIM parameters
GatedConv [19] 14.2 20.3 0.818
PEPSI 25.6 0.901 28.6 0.929 3.5M
Square PEPSI 15.2 21.2 0.832
Diet-PEPSI 25.5 0.898 28.5 0.928 2.5M
Diet-PEPSI 15.5 21.5 0.840
Diet-PEPSI (g = 2) 25.4 0.896 28.5 0.928 1.8M
Diet-PEPSI (g = 4) 25.2 0.894 28.4 0.926 1.5M GatedConv [19] 17.4 24.0 0.875
Free-form PEPSI 18.2 24.8 0.882
Diet-PEPSI 18.7 25.2 0.889

compared with the conventional methods, while significantly


reducing the hardware costs. only in the square mask since it uses an image blending
Furthermore, we trained and tested PEPSI and Diet- technique as a post-processing. However, it needs additional
PEPSI using the challenging datasets, i.e. ImageNet and computation time owing to the post-processing and still suffers
Place2 datasets, to demonstrate that the proposed methods from blurred images as shown in Fig. 9. Also, like the CE
can be applied to real applications. In this paper, we and GCA, GL shows poor performance on free-from mask.
compared performance of the proposed methods with that Because these methods, i.e. CE, GCA, and GL, designed for
of the GatedConv and the non-generative method, called inpainting the rectangular mask, they could not cover the free-
PatchMatch [3], which is widely applied to image editing from mask; they could not generalize well on the free-form
applications. We set the image resolution as 256 × 256. mask.
Resultant images are depicted in Fig. 11. PatchMatch shows GatedConv [19] shows better performance in both square
visually poor performance especially on the edge of images and free-form holes than other existing methods, but it needs
since it cannot consider the global contexts of the image for some computation time owing to the two stacked generative
inpainting the hole region. GatedConv generates more realistic networks. Compared with conventional methods, PEPSI and
results without color discrepancy or edge distortion compared Diet-PEPSI show fine performance in both square and free-
to PatchMatch technique. However, it often produces the from masks. In particular, compared with GatedConv, PEPSI
images with wrong textures as shown in the first and third and Diet-PEPSI not only exhibit better PSNR and SSIM
rows in Fig. 11. In contrast to the conventional methods, performances but also require less computational time and
PEPSI and Diet-PEPSI generate the most natural images hardware costs. In addition, Diet-PEPSI achieves comparable
without artifacts or distortion on various contents and complex performance with PEPSI while reducing the network param-
scenes. Thus, we confirmed that the proposed method can be eters almost by 30 percent. Consequently, these observations
applied to the real application for image inpainting. indicate that the proposed methods can successfully generate
inpainting results with high-quality and less hardware costs
Quantitative Comparison In this study, we adopted the compared with the conventional inpainting techniques.
two different metrics for quantitative assessment: peak signal- Moreover, to reveal the effectiveness of the coarse path, we
to-noise ratio (PSNR) of the local and global regions, i.e. conducted an extra experiment in which PEPSI was trained
PSNR of the hole region and the whole image, and structural without using the coarse path learning. The experimental
similarity (SSIM) [35] of the whole image. Table V provides results are described in Table V. PEPSI exhibits the better
the comprehensive performance benchmarks between the pro- performance than PEPSI trained without the coarse path in
posed methods and conventional ones [4], [7], [10], [19] in terms of all quantitative metrics. These results demonstrate
the CelebA-HQ datasets [23]. As shown in Table V, compared that the coarse path drives the encoding network to produce
with the proposed methods, CE [10] and GCA [4] shows worse missing features properly for the CAM. In other words, the
performance on both square mask and free-form mask. GL [7] single-stage network structure of PEPSI can overcome the
exhibits comparable performance with the proposed methods limitation of the two-stage coarse-to-fine network through a
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 11

Fig. 12. Illustration of techniques to aggregate the global contextual informa-


tion while reducing the number of parameters. (a) Dilated convolutional layer
with pruning channel. (b) Residual block consisting of group convolutional
layers.

TABLE VIII
E XPERIMENTAL RESULTS USING DIFFERENT LIGHTWEIGHT UNITS .

Square mask Free-form mask


PSNR SSIM PSNR SSIM
Pruning 25.21 0.8961 28.28 0.9270
DGC 25.28 0.8959 28.43 0.9270
DPU 25.38 0.8960 28.53 0.9278

Fig. 13. Comparison of RED and SNM-Disc [19] on CelebA-HQ datasets.


joint learning scheme. (a) Input image. (b) Results of PEPSI trained with RED. (c) Results of PEPSI
On the other hand, Diet-PEPSI retains the ability of PEPSI trained with SNM-Disc.
while significantly reducing the network parameters as shown
in Table V. These results reveal that the DPU with rate- TABLE IX
adaptive convolutional layer can replace the standard dilated E XPERIMENTAL RESULTS USING DIFFERENT DISCRIMINATORS .
convolutional layer with a small number of network param-
eters. To further reduce the hardware costs of Diet-PEPSI, Square mask Free-form mask
we conducted additional experiments that apply the group PSNR SSIM PSNR SSIM
SNM-Disc [19] 25.68 0.901 28.71 0.932
convolution technique [36] to the DPU. In our experiments, RED 25.57 0.901 28.59 0.929
we trained Diet-PEPSI by employing the group convolution
technique to both layers in the DPU. Note that we utilized
the channel shuffling technique between the two convolutional
layers of the DPU. As shown in Table VI, even though methods in CelebA-HQ dataset, as our comparison. As
Diet-PEPSI utilizes a significantly less number of network shown in Table VII, PEPSI achieves better performance than
parameters, it achieves competitive performance with PEPSI GatedConv in the Place2 dataset. Furthermore, Diet-PEPSI
as well as shows superior performance compared to other exhibits superior performance compared to GatedConv and
conventional methods. These results confirm that Diet-PEPSI PEPSI. These results indicate that the proposed methods
can generate high-quality images with low hardware costs. can consistently generate high-quality results using various
To demonstrate the generalization ability of PEPSI and contents and complex images.
Diet-PEPSI, we conduct another experiment using the
challenging datasets, ImageNet [26] and Place2 [25]. As DPU analysis To demonstrate the ability of the DPU, we
mentioned in V-A, in our experiments, we trained the network conducted additional experiments that reduce the network
using the ImageNet dataset and tested the trained network parameters using different techniques. Fig. 12 shows the
on the Place2 dataset. Among the various conventional models used in our experiments. Figs. 12(a) and (b) illustrate
methods, we selected the GatedConv [19], which exhibits the convolution layer with a pruning channel and the residual
superior performance compared to other conventional block with dilated group convolutional layer, respectively,
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 12

TABLE X
P ERFORMANCE COMPARISON BETWEEN COSINE SIMILARITY AND
E UCLIDEAN DISTANCE APPLYING ON PEPSI.

Square mask Free-form mask


PSNR SSIM PSNR SSIM
Cosine similarity 25.16 0.8950 27.95 0.9218
Euclidean distance 25.57 0.9007 28.59 0.9293

of the region ensemble network [29] which classifies objects


in any region of the image. Thus, in adversarial learning, the
Fig. 14. Comparisons of image reconstruction between the cosine similarity generator attempts to produce every region of the image to
and truncated distance similarity. (a) Original images. (b) Masked images. (c) be indistinguishable from real images. This procedure further
Images reconstructed by using the cosine similarity. (d) Images reconstructed improves the performance of the generator in free-form masks
by using the truncated distance similarity.
including irregular holes. Thus, we expect that RED can be
applied to various image inpainting networks for generating
which are an intuitive approach to decrease the number of visually plausible images.
parameters. Note that we employed the residual block with
the same architecture as the DPU for a fair comparison. Modified CAM analysis To demonstrate the validity of
Additionally, we adjusted the pruning channel and the number modified CAM, we performed toy examples comparing the
of groups to make models using an almost similar number cosine similarity and truncated distance similarity. We recon-
of parameters. In our experiments, we set the channels of structed the hole region using the weighted sum of existing
pruned convolution layers to 113 and the group numbers of image patches where the weights, i.e. similarity scores, are
the residual block with dilated group convolutional layer to computed by using the cosine similarity or truncated Eculidean
four. The number of groups in the DPU is set to two. As distance. Fig. 14 shows comparisons of reconstructed images.
shown in Table VIII, the pruning strategy shows inferior As depicted in Figs. 14(c) and (d), images reconstructed by
quantitative scores in terms of PSNR in both square and applying the truncated distance similarity can collect more
free-form masks. Although the residual block with group similar patches than the cosine similarity; these results indicate
dilated convolutional layer shows slightly better performance that the Euclidean distance is more suitable to calculate the
compared to the pruning strategy, it is still weak. Compared similarity score compared to the cosine similarity. To confirm
with these models, the DPU shows superior performance in the improvement of the modified CAM, moreover, we com-
both square and free-form masks. Therefore, these results pared the quantitative performance of PEPSI with conventional
confirm that the DPU is suitable to effectively aggregate and modified CAMs. As shown in Table X, the modified CAM
the global contextual information with a small number of enhances the performance as compared to the conventional
parameters. CAM, implying that the modified CAM is more appropriate
to learn the relationship between background and hole regions.
RED analysis We demonstrated the superiority of RED
by comparing its performance to that of the SNM- VI. C ONCLUSION
discriminator [19] (SNM-Disc), which is an extended ver- In this study, we have introduced a novel image inpaint-
sion of the PatchGAN-discriminator for image inpainting ing model called PEPSI which overcomes the limitation of
with free-form mask. For fair comparison, we employed the two-stage coarse-to-fine network via the joint learning
each discriminator on the same generator with PEPSI. As scheme. We provided qualitative and quantitative comparisons
shown in Table IX, the SNM-Disc exhibits slightly better on CelebA-HQ and Place2 datasets. Experimental results
performance in terms of PSNR and SSIM compared to the revealed that PEPSI not only achieves superior performance
RED. However, Fig. 13 shows that the SNM-Disc could not as compared with conventional techniques, but also signifi-
generate a visually plausible image despite having a high cantly reduces the computational time via a parallel decoding
PSNR value; PEPSI trained with the SNM-Disc produced the path and an effective joint learning scheme. Furthermore, we
results with visual artifacts such as blurred or distorted images have introduced Diet-PEPSI which utilizes novel rate-adaptive
in the masked region. These results indicate that the SNM- convolutional layers to aggregate the global contextual infor-
Disc cannot effectively compete with the generative networks, mation with low hardware costs. Experimental results shows
which makes the generator mainly focus on minimizing the L1 that Diet-PEPSI preserves the performance of PEPSI while
loss in the objective function of PEPSI. Therefore, even though significantly reducing the hardware costs, which facilitates
PEPSI trained with the SNM-Disc exhibits good quantitative hardware implementation. Both networks are trained with the
performance, it is difficult to apply to the image inpainting in proposed RED and show visually plausible results in square
practice. holes as well as holes with an irregular shape. Therefore, it is
On the other hand, we investigated the reason why RED expected that the proposed methods can be widely employed in
could effectively drive the generator to produce visually various applications including image generation, style transfer,
pleasing inpainting results. The RED follows the inspiration and image editing.
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 13

R EFERENCES [23] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing


of gans for improved quality, stability, and variation,” arXiv preprint
[1] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image in- arXiv:1710.10196, 2017. 2, 8, 10
painting,” in Proceedings of the 27th annual conference on Computer [24] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes
graphics and interactive techniques. ACM Press/Addison-Wesley in the wild,” in Proceedings of the IEEE International Conference on
Publishing Co., 2000, pp. 417–424. 1, 2 Computer Vision, 2015, pp. 3730–3738. 2, 8
[2] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis [25] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A
and transfer,” in Proceedings of the 28th annual conference on Computer 10 million image database for scene recognition,” IEEE transactions on
graphics and interactive techniques. ACM, 2001, pp. 341–346. 1, 2 pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–1464,
[3] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patch- 2018. 2, 8, 11
match: A randomized correspondence algorithm for structural image [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
editing,” ACM Transactions on Graphics (ToG), vol. 28, no. 3, p. 24, with deep convolutional neural networks,” in Advances in neural infor-
2009. 1, 2, 9, 10 mation processing systems, 2012, pp. 1097–1105. 2, 8, 11
[4] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative [27] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, “Summarizing visual
image inpainting with contextual attention,” arXiv preprint, 2018. 1, 2, data using bidirectional similarity,” in Computer Vision and Pattern
3, 4, 6, 7, 8, 10 Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008,
[5] H. Noori, S. Saryazdi, and H. Nezamabadi-Pour, “A convolution based pp. 1–8. 2
image inpainting,” in 1st International Conference on Communication [28] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate
and Engineering, 2010. 1, 2 deep network learning by exponential linear units (elus),” arXiv preprint
[6] H. Li, G. Li, L. Lin, and Y. Yu, “Context-aware semantic inpainting,” arXiv:1511.07289, 2015. 5
arXiv preprint arXiv:1712.07778, 2017. 1 [29] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang, “Region
[7] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally ensemble network: Improving convolutional network for hand pose
consistent image completion,” ACM Transactions on Graphics (TOG), estimation,” in Image Processing (ICIP), 2017 IEEE International
vol. 36, no. 4, p. 107, 2017. 1, 2, 3, 8, 9, 10 Conference on. IEEE, 2017, pp. 4512–4516. 6, 12
[8] R. Köhler, C. Schuler, B. Schölkopf, and S. Harmeling, “Mask-specific [30] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
inpainting with deep neural networks,” in German Conference on Pattern with conditional adversarial networks,” arXiv preprint, 2017. 7
Recognition. Springer, 2014, pp. 523–534. 1, 3 [31] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention gen-
[9] C. Li and M. Wand, “Combining markov random fields and convo- erative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018. 7
lutional neural networks for image synthesis,” in Proceedings of the [32] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral
IEEE Conference on Computer Vision and Pattern Recognition, 2016, normalization for generative adversarial networks,” arXiv preprint
pp. 2479–2486. 1, 3 arXiv:1802.05957, 2018. 7
[10] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
“Context encoders: Feature learning by inpainting,” in Proceedings of the arXiv preprint arXiv:1412.6980, 2014. 8
IEEE Conference on Computer Vision and Pattern Recognition, 2016, [34] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
pp. 2536–2544. 1, 3, 8, 9, 10 “Gans trained by a two time-scale update rule converge to a local nash
[11] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High- equilibrium,” in Advances in Neural Information Processing Systems,
resolution image inpainting using multi-scale neural patch synthesis,” 2017, pp. 6626–6637. 8
in The IEEE Conference on Computer Vision and Pattern Recognition [35] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
(CVPR), vol. 1, no. 2, 2017, p. 3. 1 quality assessment: from error visibility to structural similarity,” IEEE
[12] A. Fawzi, H. Samulowitz, D. Turaga, and P. Frossard, “Image inpainting transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004. 10
through neural networks hallucinations,” in Image, Video, and Multidi- [36] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi-
mensional Signal Processing Workshop (IVMSP), 2016 IEEE 12th. Ieee, cient convolutional neural network for mobile devices,” in Proceedings
2016, pp. 1–5. 1 of the IEEE Conference on Computer Vision and Pattern Recognition,
[13] N. Cai, Z. Su, Z. Lin, H. Wang, Z. Yang, and B. W.-K. Ling, “Blind 2018, pp. 6848–6856. 11
inpainting using the fully convolutional neural network,” The Visual
Computer, vol. 33, no. 2, pp. 249–261, 2017. 1
[14] R. A. Yeh, C. Chen, T.-Y. Lim, A. G. Schwing, M. Hasegawa-Johnson,
and M. N. Do, “Semantic image inpainting with deep generative
models.” in CVPR, vol. 2, no. 3, 2017, p. 4. 1, 2
[15] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro,
“Image inpainting for irregular holes using partial convolutions,” in Yong-Goo Shin received the B.S. and Ph.D. degrees
Proceedings of the European Conference on Computer Vision (ECCV), in Electronics Engineering from Korea University,
2018, pp. 85–100. 1 Seoul, Rep. of Korea, in 2014 and 2020, respectively.
[16] Y. Song, C. Yang, Z. Lin, X. Liu, Q. Huang, H. Li, and C.-C. Jay Kuo, He is currently a research professor in the Depart-
“Contextual-based image inpainting: Infer, match, and translate,” in ment of Electrical Engineering of Korea University.
Proceedings of the European Conference on Computer Vision (ECCV), His research interests are in the areas of digital
2018, pp. 3–19. 1 image processing, computer vision, and artificial
[17] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia, “Image inpainting via intelligence.
generative multi-column convolutional neural networks,” in Advances in
Neural Information Processing Systems, 2018, pp. 331–340. 1
[18] L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network
for image deconvolution,” in Advances in Neural Information Processing
Systems, 2014, pp. 1790–1798. 1, 3
[19] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang,
“Free-form image inpainting with gated convolution,” arXiv preprint
arXiv:1806.03589, 2018. 1, 2, 3, 4, 8, 9, 10, 11, 12 Min-Cheol Sagong received his B.S. degree in Elec-
[20] M.-C. Sagong, Y.-G. Shin, S.-W. Kim, S. Park, and S.-J. Ko, “Pepsi: trical Engineering from Korea University in 2018.
Fast image inpainting with parallel decoding network,” in Proceedings He is currently pursuing his M.S. degree in Elec-
of the IEEE Conference on Computer Vision and Pattern Recognition, trical Engineering at Korea University. His research
2019, to be published. 1 interests are in the areas of digital signal processing,
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image computer vision, and artificial intelligence.
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778. 2
[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems, 2014, pp. 2672–
2680. 2, 3, 7
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 14

Yoon-Jae Yeo received his B.S. degree in Electrical


Engineering from Korea University in 2017. He is
currently pursuing his Ph.D. degree in Electrical
Engineering at Korea University. His research inter-
ests are in the areas of image processing, computer
vision, and deep learning.

Seung-Wook Kim received the B.S. and Ph.D.


degrees in Electronics Engineering from Korea Uni-
versity, Seoul, Rep. of Korea, in 2012 and 2019,
respectively. He is currently a research professor in
the Department of Electrical Engineering of Korea
University. His research interests are in the areas of
image processing and computer vision based on deep
learning.

Sung-Jea Ko (M88-SM97-F12) received his Ph.D.


degree in 1988 and his M.S. degree in 1986, both
in Electrical and Computer Engineering, from State
University of New York at Buffalo, and his B.S.
degree in Electronic Engineering at Korea University
in 1980. In 1992, he joined the Department of
Electronic Engineering at Korea University where he
is currently a Professor. From 1988 to 1992, he was
an Assistant Professor in the Department of Elec-
trical and Computer Engineering at the University
of Michigan-Dearborn. He has published over 210
international journal articles. He also holds over 60 registered patents in fields
such as video signal processing and computer vision.
Prof. Ko received the best paper award from the IEEE Asia Pacific
Conference on Circuits and Systems (1996), the LG Research Award (1999),
and both the technical achievement award (2012) and the Chester Sall award
from the IEEE Consumer Electronics Society (2017). He was the President of
the IEIE in 2013 and the Vice-President of the IEEE CE Society from 2013
to 2016. He is a member of the National Academy of Engineering of Korea.
He is a member of the editorial board of the IEEE Transactions on Consumer
Electronics.

You might also like