PEPSI++: Fast and Lightweight Network For Image Inpainting
PEPSI++: Fast and Lightweight Network For Image Inpainting
Abstract—Among the various generative adversarial network the coarse-to-fine network with a contextual attention module
(GAN)-based image inpainting methods, coarse-to-fine network (CAM) has shown remarkable performance [4], [19]. This
with a contextual attention module (CAM) has shown remark- network is composed of two stacked generative networks
able performance. However, owing to two stacked generative
arXiv:1905.09010v5 [cs.CV] 6 Mar 2020
networks, the coarse-to-fine network needs numerous compu- including the coarse network and refinement one. The coarse
tational resources such as convolution operations and network network roughly fills the hole regions using a simple dilated
parameters, which result in low speed. To address this problem, convolutional network trained with reconstruction loss. The
we propose a novel network architecture called PEPSI: parallel refinement network improves the quality of the roughly com-
extended-decoder path for semantic inpainting network, which pleted image by using the CAM that generates feature patches
aims at reducing the hardware costs and improving the inpainting
performance. PEPSI consists of a single shared encoding network of the hole regions by borrowing information from distant
and parallel decoding networks called coarse and inpainting spatial locations. Despite the promising results, the coarse-
paths. The coarse path produces a preliminary inpainting result to-fine network requires high computational resources and
to train the encoding network for the prediction of features consumes considerable memories.
for the CAM. Simultaneously, the inpainting path generates In previous work [20], we introduced a novel network struc-
higher inpainting quality using the refined features reconstructed
via the CAM. In addition, we propose Diet-PEPSI that signif- ture called PEPSI: parallel extended-decoder path for semantic
icantly reduces the network parameters while maintaining the inpainting, which aims at reducing the number of convolution
performance. In Diet-PEPSI, to capture the global contextual operations as well as improving the inpainting performance.
information with low hardware costs, we propose novel rate- PEPSI is composed of a single encoding network and parallel
adaptive dilated convolutional layers, which employ the common decoding networks which consist of coarse and inpainting
weights but produce dynamic features depending on the given
dilation rates. Extensive experiments comparing the performance paths. The coarse path generates a preliminary inpainting result
with state-of-the-art image inpainting methods demonstrate that to train the encoding network for prediction of features for the
both PEPSI and Diet-PEPSI improve the qualitative scores, i.e. CAM. At the same time, the inpainting path produces image
the peak signal-to-noise ratio (PSNR) and structural similarity with high quality using the refined features reconstructed
(SSIM), as well as significantly reduce hardware costs such as via the CAM. To make a single encoding network handle
computational time and the number of network parameters.
Index Terms—Deep learning, generative adversarial network, two different tasks, which are feature extraction for both
image inpainting roughly completed and high-quality results, we propose a joint
learning technique that jointly optimizes two different paths.
This learning scheme facilitates the generation of high-quality
I. I NTRODUCTION inpainting image without the stacked generative networks, i.e.
the coarse-to-fine network.
Fig. 1. Overview of the network architectures of the conventional and proposed methods, where D and G indicate the discriminator and generator, respectively.
(a) Architecture of traditional encoder-decoder network [7]. (b) Architecture of coarse-to-fine network [4], [19]. (c) Architecture of PEPSI.
compared with multiple standard dilated convolutional layers. that achieves superior performance compared to conventional
In this paper, we apply the proposed rate-adaptive dilated methods as well as significantly reduces the operation time.
convolutional layers to Diet-PEPSI using residual blocks [21] (ii) We propose Diet-PEPSI that applies novel rate-adaptive
called Diet-PEPSI units (DPUs). By replacing the multiple convolution layers to further reduce the hardware costs while
dilated convolutional layers with DPUs, Diet-PEPSI covers maintaining the overall quality of the results, which makes
the same size of the receptive field with a smaller number of the proposed method compatible with the hardware. (iii) A
parameters than PEPSI. novel discriminator, called RED, is proposed to handle both
Furthermore, we investigate an obstacle with the discrimina- squared and irregular hole regions for real applications. In
tor in traditional GAN-based image inpainting methods [14], the remainder of this paper, we introduce the related work
[22]. In general, conventional methods employ global and local and preliminaries in Section II and Section III, respectively.
discriminators trained with a combined loss, the L2 pixel- The PEPSI and Diet-PEPSI are discussed in Section IV. In
wise reconstruction loss and adversarial loss, which assists the Section V, extensive experimental results are presented to
networks in generating a more natural image by minimizing demonstrate that the proposed method outperforms conven-
the difference between the reference and the inpainted images. tional methods on various datasets such as CelebA [23], [24],
More specifically, the global discriminator takes the whole Place2 [25], and ImageNet [26]. Finally, the conclusion is
image as input to recognize global consistency, whereas the provided in Section VI.
local one only views at a small region around the hole in order
to judge the quality of more detailed appearance. However, II. R ELATED WORK
the local discriminator has a drawback that it can only deal Existing image inpainting techniques can be divided into
with a single rectangular hole region. In other words, since the two groups [4]: traditional and deep learning-based methods.
holes can appear with arbitrary shapes, sizes, and locations in The traditional techniques include diffusion-based and patch-
real-world applications, the local discriminator is difficult to based methods. The diffusion-based method fills the hole
apply to the inpainting network for inpainting the holes with regions by propagating the local image appearance around the
irregular shapes. To solve this problem, we propose a region holes [1], [2], [4], [5]. The diffusion-based method performs
ensemble discriminator (RED) which integrates the global and well on the small and narrow holes, but often fails to fill
local discriminators. Since each pixel in the last layer has a complex hole regions such as faces and objects with non-
different receptive field in the image domain, the RED adopts repetitive structures. In contrast, the patch-based technique
individual fully connected layers on each pixel in the last results in the better performance in filling the complicated
convolutional layer. By individually computing an adversarial images with large hole regions [4], [27]. This method samples
loss in each pixel, the RED can deal with the various holes texture patches from the existing regions of image, i.e. back-
with arbitrary shapes. ground regions, and pastes them into the hole region. Barnes
In summary, this paper has three major contributions. et al. [3] introduced a fast approximate nearest neighbor patch
(i) We propose a novel network architecture called PEPSI search algorithm, called PatchMatch, which exhibited notable
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 3
performance for image editing applications such as the image Fig. 3. Toy examples for the coarse network. (a) Masked input image. (b)
inpainting. However, PatchMatch often fills the hole regions Original image. (c) Result from the coarse-to-fine network. (d) Result without
regardless of the visual semantics or the global structure of an the coarse result. (e) Result with LR coarse path.
image, which results in the resultant images with poor visual
quality. TABLE I
E XPERIMENTAL RESULTS WITH G ATED C ONV (GC) [19] USING
By using the convolutional neural network (CNN), the deep DIFFERENT COARSE PATH . GC* INDICATES A MODEL TRAINED WITHOUT
learning-based method learns how to extract semantic infor- COARSE RESULTS AND GC† INDICATES A MODEL TRAINED WITH
SIMPLIFIED COARSE PATH .
mation for producing the structures of the hole regions [8], [9],
[18]. The CNN-based image inpainting methods employing an Square mask Free-form mask
Time
encoder-decoder structure have shown superior performance PSNR SSIM PSNR SSIM
on inpainting the complex hole region compared with the GC 24.67 0.8949 27.78 0.9252 21.39 ms
GC∗ 23.50 0.8822 26.35 0.9098 14.28 ms
diffusion- or patch-based methods [8], [18]. However, these GC† 23.71 0.8752 26.22 0.9026 13.32 ms
methods often generate an image with visual artifacts such
as boundary artifacts and blurry texture inconsistent with
surrounding areas. To alleviate this problem, Pathak et al. [10] refinement network having the CAM. This method achieved
adopted the GAN [22] to enhance the coherence between the remarkable performance compared with the recent state-of-
background and hole regions. They trained the entire network the-art inpainting methods; however, it requires considerable
using a combined loss, the L2 pixel-wise reconstruction loss computational resources owing to the two stacked generative
and adversarial loss, which drives the networks to minimize networks.
the difference between the reference and inpainted images as
well as to produce plausible new contents in highly structured
images such as faces and scenes. However, this method has a III. P RELIMINARIES
limitation that it only can fill the square hole located at the
A. Generative adversarial networks
center of an image.
To inpaint the images with square hole in arbitrary locations, The GAN was first introduced by Goodfellow et al. [22]
as shown in Fig. 1(a), Iizuka et al. [7] proposed an improved for image generation. In general, GAN consists of a generator
network structure which employs two sibling discriminators: G and a discriminator D which are trained with competing
global and local discriminators. More specifically, the local goals. The generator is trained to produce a new image,
discriminator only considers the inpainted region to classify indistinguishable from real images, while the discriminator is
the local texture consistency, whereas the global discriminator optimized to differentiate between real and generated images.
inspects that the resultant image is consistent across the whole Formally, the G (D) tries to minimize (maximize) the loss
image. Recently, Yu et al. [4] have extended this work by using function, i.e. adversarial loss, as follows:
the coarse-to-fine network and the CAM. In particular, by
computing the cosine similarity between the background and
foreground feature patches, the CAM learns where to borrow min max Ex∼Pdata (x) [log D(x)]
G D
the background features for the hole region. In order to collect + Ez∼Pz(z) [log(1 − D(G(z)))], (1)
the background features involved with the missing region, the
CAM requires the features at the missing region encoded from where z and x denote a random noise vector and a real
roughly completed images. Thus, as shown in Fig. 1(b), this image sampled from the noise Pz (z) and real data distribution
method employs two stacked generative networks (coarse and Pdata (x), respectively. Recently, the GAN has been applied to
refinement networks) to generate an intermediate result, i.e. several semantic inpainting techniques [4], [7], [10] to fill the
the coarse result, and an inpainting result refined through the holes naturally.
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 4
Fig. 4. An architecture of PEPSI. The coarse path and inpainting path share their weights to improve each other. The coarse path is trained only with the L1
reconstruction loss while the inpainting path is trained with both of L1 and adversarial losses.
Fig. 6. Architecture of Diet-PEPSI. We replace the multiple dilated convolutional layers with DPUs. In the DPUs, rate-adaptive convolution layers share their
weights whereas the 1 × 1 standard convolutional layers do not share their weights.
TABLE IV
D ETAILED ARCHITECTURE OF RED. A FTER EACH CONVOLUTION LAYER ,
EXCEPT LAST ONE , THERE IS A LEAKY-R E LU AS THE ACTIVATION
FUNCTION . E VERY LAYER IS NORMALIZED BY A SPECTRAL
NORMALIZATION . FC* INDICATES THE FULLY- CONNECTED LAYER WHICH
EMPLOYS PIXEL - WISE DIFFERENT WEIGHTS FOR CONVOLUTION
OPERATION .
Fig. 9. Comparison of the proposed and conventional methods on randomly square masked CelebA-HQ datasets. (a) Ground truth. (b) Input image of the
network. (c) Results of Context Encoder [10]. (d) Results of Globally-Locally [7]. (e) Results of gated convolution [19]. (f) Results of PEPSI. (g) Results of
Diet-PEPSI.
the learning procedure and the maximum number of iterations, where the learning rates of the discriminator and generator
respectively. As the training progresses, we gradually decrease were 4 × 10−4 and 1 × 10−4 , respectively. In addition, we
the contribution of the LC for the decoding network to focus reduced the learning rate to 1/10 after 0.9 million iterations.
on the inpainting path. More specifically, as the training The hyper-parameters of the proposed method were set to
progresses, (1 − k/kmax ) becomes zero, which results in λi = 10, λc = 5, and λadv = 0.1. Our experiments were
reducing the contribution of LC . conducted using CPU Intel(R) Xeon(R) CPU E3-1245 v5 and
GPU TITAN X (Pascal), and implemented in TensorFlow v1.8.
V. E XPERIMENTS For our experiments, we used the CelebA-HQ [23], [24],
A. Implementation details ImageNet [26], and Place2 [25] datasets. More specifically,
Free-Form Mask As shown in Fig. 8(b), existing image in the CelebA-HQ dataset, we randomly sampled the 27,000
inpainting methods [4], [7], [10] usually adopt the regular images as a training set and 3,000 ones as a test set. We also
mask, e.g. hole region with rectangular shape, which indicates trained the network with all images in the ImageNet dataset
the background regions during the training procedure. and tested it on Place2 dataset to measure the performance of
However, the networks trained with the regular mask often trained deep learning models on other datasets; these experi-
exhibit weak performance on inpainting the hole with irregular ments were conducted to confirm the generalization ability of
shape and result in visual artifacts such as color discrepancy the proposed method. To demonstrate the superiority of PEPSI
and blurriness. To address this problem, as depicted in and Diet-PEPSI, in addition, we compared their qualitative,
Fig. 8(c), Yu et al. [19] adopted the free-form mask algorithm quantitative, operation speeds, and number of network parame-
during the training procedure, which automatically generates ters with those of the conventional generative methods: context
multiple random free-form holes with variable numbers, sizes, encoders (CE) [10], globally and locally completion network
shapes, and locations randomly sampled at every iteration. (GL) [7], generator with contextual attention (GCA) [4], and
More specifically, this algorithm first produces the free-form generator with gated convolution (GatedConv) [19].
mask by drawing multiple different lines and erasing pixels
closer than the arbitrary distance from these lines. For a
B. Performance Evaluation
fair comparison, in our experiments, we employed the same
free-form mask generation algorithm for training PEPSI and Qualitative Comparison To reveal the superiority of PEPSI
Diet-PEPSI. and Diet-PEPSI, we compared the qualitative performance
of the proposed methods with those of the conventional
Training Procedure PEPSI and Diet-PEPSI were trained generative methods using images with the squared mask and
for one million iterations using a batch size of eight in an free-form mask. In our experiments, we implemented the
end-to-end manner. Because the parameters in PEPSI and Diet- conventional methods by following the training procedure in
PEPSI can be differentiated, we performed an optimization each study. The resultant images with squared mask and free-
using the Adam optimizer [33] and set the parameters of Adam form mask are described in Figs. 9 and 10. As shown in Figs. 9
optimizers β1 and β2 to 0.5 and 0.9, respectively. Motivated and 10, CE [10] and GL [7] show obvious visual artifacts
by [34], we applied the two-timescale update rule (TTUR) including blurred or distorted images in the masked region. In
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 9
(a) Ground truth (b) Input image (c) CE (d) GL (e) GatedConv (f) PEPSI (g) Diet-PEPSI
Fig. 10. Comparison of the proposed and conventional methods on free-form masked CelebA-HQ datasets. (a) Ground truth. (b) Input image of the network.
(c) Results of Context Encoder [10]. (d) Results of Globally-Locally [7]. (e) Results of gated convolution [19]. (f) Results of PEPSI. (g) Results of Diet-PEPSI.
Fig. 11. Comparison of the proposed and conventional methods on Place2 dataset. (a) Ground truth. (b) Input image of the network. (c) Results of the
non-generative method, PatchMatch [3]. (d) Results of GatedConv [19]. (e) Results of PEPSI. (f) Results of Diet-PEPSI.
particular, these methods show inferior performance when in- the existing methods, PEPSI shows visually appealing results
painting the free-form mask, which indicates that CE [10] and and high relevance between hole and background regions. In
GL [7] cannot be applied to real applications. GatedConv [19] addition, as shown in Fig. 9(g) and Fig. 10(g), the output image
exhibits fine performance compared to CE [10] and GL [7], produced via Diet-PEPSI were comparable to PEPSI while
it still suffers from a lack of relevance between the hole and saving a significant number of network parameters. From these
background regions such as symmetry of eyes. Compared to results, we confirmed that the proposed methods outperform
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEM 10
TABLE V
R ESULTS OF GLOBAL AND LOCAL PSNR S , SSIM, AND OPERATION TIME WITH SQUARE AND FREE - FORMED MASKS ON C ELEBA-HQ DATASET.
PSNR
Square mask Free-form mask Number of Mask Method SSIM
Local Global
PSNR SSIM PSNR SSIM parameters
GatedConv [19] 14.2 20.3 0.818
PEPSI 25.6 0.901 28.6 0.929 3.5M
Square PEPSI 15.2 21.2 0.832
Diet-PEPSI 25.5 0.898 28.5 0.928 2.5M
Diet-PEPSI 15.5 21.5 0.840
Diet-PEPSI (g = 2) 25.4 0.896 28.5 0.928 1.8M
Diet-PEPSI (g = 4) 25.2 0.894 28.4 0.926 1.5M GatedConv [19] 17.4 24.0 0.875
Free-form PEPSI 18.2 24.8 0.882
Diet-PEPSI 18.7 25.2 0.889
TABLE VIII
E XPERIMENTAL RESULTS USING DIFFERENT LIGHTWEIGHT UNITS .
TABLE X
P ERFORMANCE COMPARISON BETWEEN COSINE SIMILARITY AND
E UCLIDEAN DISTANCE APPLYING ON PEPSI.