Mobilepotrait c&g19
Mobilepotrait c&g19
Song-Hai Zhanga , Xin Donga , Jia Lib , Ruilong Lia , Yong-Liang Yangc
a Tsinghua University, Beijing, China
b CiscoSystems Networking Technology Co., Ltd., Hangzhou Branch Office, China
c University of Bath, Claverton Down, Bath, UK
ABSTRACT
Fig. 1. Portrait segmentation applications on mobile device. (a) Original image. (b) The corresponding segmentation. (c-d) Two important applications
based on portrait image segmentation.
mentation boundaries is difficult for convolutional neural net- semantic segmentation methods result in high precision, the ef-
works, we design an auxiliary boundary loss to help the net- ficiency is relatively low.
work generate better portrait boundaries. Meanwhile, we take Compared with large models with high complexity, there are
into account complex illumination conditions in portrait images some segmentation works that pay more attention to the effi-
and utilize the consistency constraint loss to improve the ro- ciency. ENet [13] proposed a new network architecture which
bustness. With the two auxiliary losses, we achieve the accu- is deep and narrow, the speed is much faster while the accuracy
racy of 96.62% on EG1800 dataset and 93.43% on Supervise- decline is obvious. ICNet [14] incorporated multi-resolution
Portrait dataset at 30 FPS on iPhone 7 with input image size of branches to improve the accuracy of the model, but the model
224 × 224. is still too large to run on mobile device. BiSeNet [15] is the
state-of-art real-time segmentation method on the CitySpace
dataset [16]. However, this method is not suitable for small
2. Related Work
size input images because of the crude up-sampling modules.
PortraitNet is related to research in semantic segmentation Automatic portrait segmentation as a specialized semantic
and lightweight convolutional neural networks. This section re- segmentation is important in mobile computing era. [7] col-
views typical semantic segmentation methods with deep convo- lected the first human portrait dataset named EG1800 and de-
lutional networks, and classical lightweight architectures. signed a segmentation network to distinguish the portrait and
Semantic image segmentation is a fundamental research background. [8] designed a boundary-sensitive network to im-
topic in computer vision. Many applications require highly prove the accuracy using soft boundary label. [17] proposed
accurate and efficient segmentation results as a basis for ana- a Border Network to improve the accuracy of segmentation.
lyzing and understanding images. With the recent advances of However, the existing works focused on accuracy but not the
deep learning, semantic segmentation methods based on deep computational efficiency. With the growing demand of mo-
convolutional neural networks [10, 11, 12, 9] have made great bile applications, a number of researches aiming at efficient
achievements, especially for improving segmentation precision. models for mobile device have been proposed [18, 19, 20, 21].
Fully convolution networks [1] is the first essential work that Depthwise separable convolutional layers are widely used in
proposed an end-to-end network for pixel-wise segmentation. It lightweight networks. PortraitNet employs MobileNet-v2 [19]
also defined a skip architecture to produce accurate masks. Seg- as backbone to extract features in the encoder module and uses
Net [2] came up with a classical encoder-decoder architecture depthwise separable convolution to substitute traditional convo-
for segmentation, while a similar method was UNet [3]. The lution in the decoder module to build a lightweight network.
main difference is that SegNet [2] transferred pooling indices Portrait images usually have complex illumination condi-
from encoder to decoder to produce sparse feature maps, while tions. Thus how to improve the robustness of the model under
UNet [3] transferred high resolution features from encoder to varying lighting conditions is very important. [22] proposed
up-sampled features in decoder. A series of research works a stability training method to improve the robustness of deep
named Deeplab [4, 5, 6] presented the most accurate methods neural networks. Euclidean distance was used in this method
of semantic segmentation at present. Deeplabv1 [4] used di- to evaluate the results. The stability training process could also
lated convolution to maintain the size of feature maps and use benefit segmentation networks. However, Euclidean distance
CRFs to refine the segmentation result. Deeplabv2 [5] proposed is not a good measurement when most pixels in the prediction
a module called atrous spatial pyramid pooling (ASPP) for im- differs little from the ground truth. Inspired by model distilla-
provement. Deeplabv3 [6] removed the CRFs module and mod- tion [23], we employ soft label and KL divergence in consis-
ified ASPP module to improve the accuracy. Although these tency constraint loss to assist training and improve robustness.
3
Fig. 2. Overview of PortraitNet. (a) The architecture of PortraitNet. The green blocks represent the encoder module, numbers in brackets represent
the down-sampling rates. Each green block represents several convolutional layers. The yellow and purple blocks represent the decoder module. Each
up-sampling operation will up-sample the feature maps by 2×. (b) The architecture of D-Block in the decoder module.
3. Method 2(b) shows the architecture of transition blocks. There are two
branches in the block. One branch contains two depthwise sepa-
In this section, we elaborate our method in detail. We first rable convolutions. The other contains a single 1×1 convolution
introduce the architecture of PortraitNet, which is specifically to adjust the number of channels.
designed for mobile device, and includes two modules, the en- In PortraitNet, we utilize MobileNet-v2 [19] as backbone in
coder module and the decoder module. Then, we describe two the encoder module. And we use massive depthwise convolu-
auxiliary losses used in PortraitNet to improve segmentation ac- tions in PortraitNet to get a higher running speed, which makes
curacy without causing extra cost at the testing stage. the model suitable for mobile device.
the pivotal features of the portrait images, such that the learned
feature can be effectively used for inferring better segmentation.
We generate boundary ground truth from manual labeled
mask ground truth using traditional boundary detection algo-
rithm such as Canny[25]. In order to reduce the difficulty of
learning boundaries, we set the width as 4 for 224 × 224 in-
put images (see Fig. 3). Since more than 90% of pixels in the
boundary ground truth images are negative, the representation
of boundary is difficult to learn. We therefore use focal loss [26]
to guide the learning of boundary masks. The overall loss L is:
n
X
Lm = − (yi log(pi ) + (1 − yi )log(1 − pi )) (1)
i=1
n
((1 − pi )γ yi log(pi ) + pγi (1 − yi )log(1 − pi ))
X
Le = − (2)
i=1
L = Lm + λ × Le (3)
In the above, Lm is the cross-entropy loss and Le is the focal
loss. λ is the weight of boundary loss. yi represents the ground
Fig. 3. The ground truth boundary generated by Canny operator. (a) Orig-
truth label of pixel i. pi represents the predicted probability of
inal image. (b) The corresponding segmentation. (c) The groundtruth
pixel i. The predicted probability pi in Eq. 1 and Eq. 2 is boundary.
computed as:
ez j
pi (z j ) = PK , (4) to generate A’. Texture enhancement will not change the shape
zk
k=1 e of the images, so the segmentation of image A and A’ are the
where z is the original output of PortraitNet, and K is the num- same. Suppose the network output of image A is heatmap B and
ber of groundtruth classes. the output of image A’ is heatmap B’, then heatmap B and B’
As only one convolutional layer is used to generate boundary should be the same theoretically. However, due to the texture
mask, the mask features and boundary features could make in- augmentation methods, the quality of image A’ is worse than
valid competition in the feature representations. To avoid this, A. As a result, the generated B’ is worse than B. Hence we use
small λ should be set. The boundary loss can improve the sen- the heatmap B with higher quality as the soft labels for heatmap
sitivity of the model to the portrait boundary, which in turn im- B’. Specifically, we add a consistency constraint loss between
proves the segmentation accuracy. heatmap B and B’, which is formulated as a KL divergence:
Consistency constraint loss. It is straightforward to use the
ground truth semantic segmentation as the supervision signal, n
X
where the portrait pixels in the image are manually labeled as 0
Lm =− (yi log(pi ) + (1 − yi )log(1 − pi ))
1, otherwise 0. Such labels are usually called hard labels, be- i=1
(5)
n
cause they only have binary categories. However, it has been X
proved that soft labels with more information can further ben- − (yi log(p0i ) + (1 − yi )log(1 − p0i ))
i=1
efit the model training. There are some research works focus-
n
ing on using soft labels to improve the accuracy of tiny models 1X qi
through model distillation [23, 27]. For the input images, they Lc = qi × log 0 × T 2 (6)
n i=1 qi
use a well-trained huge teacher model to generate soft labels,
and exploit the soft labels to supervise the training of tiny stu- L = Lm
0
+ α × Lc (7)
dent model. Model distillation requires a tedious training pro- Here α is used to balance the two losses. T is used to smooth
cess, and the amount of data may not be sufficient to train a huge the outputs. pi and pi0 in Eq. 5 are defined as follows:
teacher model. Compared with complicated model distillation, 0
ez j ez j
we propose a novel method to generate soft labels using the tiny pi (z j ) = PK p0i (z0j ) = PK 0 (8)
network itself with data augmentation. We also use consistency k=1 e zk k=1 e zk
constraint loss to assist the model training. And qi and q0i in Eq. 6 are defined similarly:
Commonly used data augmentation includes two main cate- zj z0
j
gories. One is deformation enhancement, such as random ro- eT eT
tate, flip, scale, crop, etc. The other is texture enhancement, qi (z j ) = PK zk q0i (z0j ) = z0
(9)
k=1 e
T
PK k
k=1 e
T
such as changing the brightness, contrast, sharpness of images,
adding random noise or Gaussian filtering, etc. For an origi- The consistency constraint loss could further improve the ac-
nal image, we firstly use deformation enhancements to gener- curacy of the model, and enhance its robustness under different
ate image A, and then apply texture enhancement on image A illumination conditions.
5
4. Experiments
4.1. Dataset
We train and test our method on two well-known portrait seg-
mentation datasets: EG1800 [7] and Supervise-Portrait.
EG1800 EG1800 contains 1800 portrait images collected
Fig. 6. Sample portrait images in Supervise-Portriat.
from Flickr, and each image is manually labeled at pixel level.
The images are mainly self-portrait captured by the front cam-
era of a mobile phone. The final images in EG1800 are scaled trait images in Supervise-Portrait have more complicated back-
and cropped automatically to 800×600 according to the bound- ground and severe occlusion. Some sample portrait images are
ing box generated by the face detector running on each image. shown in Fig. 6.
The 1800 images are divided into two groups. One is the train-
ing dataset with 1500 images, while the other is the validat- 4.2. Data Augmentation
ing/testing dataset with 300 images. Since several image URL To improve the generality of the trained model, we use sev-
links are invalid in the original EG1800 dataset, we finally use eral data augmentation methods to supplement the original
1447 images for training and 289 images for validation. Some training dataset, leading to better segmentation results. These
sample portrait images are shown in Fig. 5. data augmentation methods can be divided into two categories:
Supervise-Portrait Supervise-Portrait is a portrait segmen- one is deformation augmentation, the other is texture augmenta-
tation dataset collected from the public human segmenta- tion. Deformation augmentation augments the position or size
tion dataset Supervise.ly [28] using the same data process as of the target, but will not affect the texture. On the other hand,
EG1800. Supervise.ly dataset contains high-quality annotated texture augmentation complements the texture information of
person instances. The images are carefully labeled with per- the target while keeping the position and size.
son segmentation masks. We further run a face detector on the The deformation augmentation methods used in our experi-
dataset and automatically crop the images according to the face ments include:
bounding boxes. We discard the images on which face detec-
tor failed and finally collect 2258 portrait images with different • random horizontal flip
sizes. We randomly select 1858 images as training dataset and • random rotation {−45◦ ∼ 45◦ }
400 images as validating/testing dataset. We name the resul- • random resizing {0.5 ∼ 1.5}
tant dataset as Supervise-Portrait. Compare with EG1800, por- • random translation {−0.25 ∼ 0.25}
6
Fig. 7. Data augmentations used in PortraitNet. (a) Original images. (b) where maskPDi and maskGT i represent segmentation result
Result images after adding deformation augmentations on (a). (c) Result and ground truth label of i-th image of test dataset, respectively.
images after adding texture augmentations on (b). (d) The groundtruth
segmentation corresponding to (b-c).
The quantitative comparison is shown in Table 1. It can be seen
that the boundary loss improves the IOU accuracy by 0.22%
(from 96.32% to 96.54%) on EG1800 dataset, and by 0.41%
The texture augmentation methods used in our experiments (from 92.63% to 93.04%) on Supervise-Portriat dataset.
include: We also propose a specific metric to better evaluate the model
performance on portrait boundary than mean IOU. The new
• random noise {Gaussian noise, σ = 10 }
metric is similar to mean IOU with emphasized weight on
• image blur {kernel size is 3 and 5 randomly }
boundaries over inner pixels:
• random color change {0.4 ∼ 1.7}
• random brightness change {0.4 ∼ 1.7} N
1 X w(x)i (maskPDi ∩ maskGT i )
• random contrast change {0.6 ∼ 1.5} mean edge IOU = × ,
N i=1 w(x)i (maskPDi ∪ maskGT i )
• random sharpness change {0.8 ∼ 1.3} (11)
Every operation in deformation augmentation and texture where w(x)i represents the weight of pixel x in the i-th image.
augmentation added up with the probability of 0.5 during train- More specifically, the weight w(x)i declines continuously from
ing. After data augmentation, we normalize the input images boundaries to inside as in the follwing equation:
before training using image mean ([103.94, 116.78, 123.68], dis(x)2
BGR order) and image val (0.017). The normalization equation e− 2σ2 ,
x ∈ maskGT and y(x) = 1
w(x)i = (12)
is (image − mean) × val. Fig. 7 shows the data augmentation 0, x ∈ maskGT and y(x) = 0
methods used in the experiments.
where dis(x) represents the distance from pixel x to portrait
4.3. Experimental Setup boundary, and σ indicates the decline rate. An illustration of
We implement our model using the Pytorch framework [29]. the new metric with different σ is shown in Fig. 8. Based on the
All competitive models are trained using a single NVIDIA new metric, we compare the performance of the two networks
1080Ti graphics card. We use Adam algorithm with batchsize with different σ on EG1800 dataset, the results are shown in
64 and weight decay 5e-4 during training. The initial learning Fig. 9 and Table 2. It can be seen that that the performance
epoch
rate is 0.001. We use (lr × 0.95 20 ) to adjust the learning rate enhancement is larger when σ is smaller, since the metric em-
with 2000 epochs. In order to achieve higher running speed, we phasize more on boundary pixels when σ is small. This demon-
train and test our model on 224 × 224 RGB images with three strates the effectiveness of boundary loss in improving the pre-
channels. cision of segmentation boundaries.
Fig. 8. New metric for evaluating boundary precision. (a) Original images. (b) The corresponding segmentation. (c-e) The weight masks under the new
metric with different σ in Eq. 12.
Table 1. Accuracy comparison of PortraitNet with different losses. Table 2. Accuracy comparison of PortraitNet with new mean IOU metric
Method EG1800 Supervise.ly on EG1800 test dataset.
PortraitNet-M(ours, Exp.1) 96.32% 92.63% Sigma single model edge model increase
PortraitNet-B(ours, Exp.2) 96.54% 93.04% 3 91.56% 92.34% +0.78%
PortraitNet-C(ours, Exp.3) 96.57% 93.17% 5 93.80% 94.44% +0.64%
PortraitNet(ours, Exp.4) 96.62% 93.43% 10 96.11% 96.55% +0.44%
20 97.57% 97.86% +0.29%
40 98.23% 98.44% +0.21%
80 98.43% 98.63% +0.20%
Fig. 11. Segmentation results of challenging portrait images generated by different methods. The first row shows images with strong illumination. The
second and fourth rows show images with background color close to foreground portrait. The third row shows the portrait image with helmet. The last
row shows the portrait from a side view.
4.4.3. Accuracy Analysis use skip lines from encoder modules to reconstruct the spatial
information for better segmentation details. To employ segmen-
PortraitNet is specifically designed for mobile device com-
tation networks on mobile device, we set the input image size
pared with other real-time segmentation networks. We choose
of 224×224 for real-time inference. We train PortraitNet model
PortraitFCN+ [7], ENet [13] and BiSeNet [15] as baselines,
with mask loss and two auxiliary losses as the following:
since PortraitFCN+ [7] is one of the iconic methods for portrait
segmentation, ENet [13] and BiSeNet [15] are the state-of-the-
art. In our experiments, the backbone of BiSeNet is ResNet18. L = Lm
0
+ α × Lc + β × Le , (13)
0
For real-time inference on mobile device, we use MobileNet- where Lm , Lc , Le are defined in Eq. 5, Eq. 6, Eq. 2 respectively,
v2[19] as our backbone to extract features form original images, and α = 2, β = 0.3, T = 1.
and we use U-shape architecture to generate sharp segmenta- The performance on EG1800 and Supervise-Portrait datasets
tion boundaries. Depthwise separable convolutions are used is shown in Table 4. To further verify the performance of the
in PortraitNet to gain running speed. In encoder modules, the two auxiliary losses, we test a new model called BiSeNet+,
down-sampling rate is 32×. We use large receptive field to uti- which is BiSeNet with our two auxiliary losses. The experi-
lize global information to help deduce the segmentation mask, ments show that the two auxiliary losses also improve the result
which is necessary for portrait images. In decoder modules, we of BiSeNet. Fig. 11 shows several difficult portrait segmenta-
9