0% found this document useful (0 votes)

26 views

Mobilepotrait c&g19

PortraitNet is a real-time portrait segmentation network designed for mobile devices. It uses a lightweight U-shape architecture with two auxiliary losses, boundary loss and consistency constraint loss, to improve accuracy, especially for boundary pixels and under various lighting conditions. PortraitNet achieves state-of-the-art performance on portrait segmentation datasets in terms of both accuracy and efficiency, processing images at 30 FPS on an iPhone 7.

Uploaded by

yueyue19991013

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Mobilepotrait c&g19

Uploaded by

yueyue19991013

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

PortraitNet: Real-time Portrait Segmentation Network for Mobile Device

Song-Hai Zhanga , Xin Donga , Jia Lib , Ruilong Lia , Yong-Liang Yangc
a Tsinghua University, Beijing, China
b CiscoSystems Networking Technology Co., Ltd., Hangzhou Branch Office, China
c University of Bath, Claverton Down, Bath, UK

ABSTRACT

Real-time portrait segmentation plays a significant role in many applications on mobile

device, such as background replacement in video chat or teleconference. In this paper,
we propose a real-time portrait segmentation model, called PortraitNet, that can run ef-
fectively and efficiently on mobile device. PortraitNet is based on a lightweight U-shape
architecture with two auxiliary losses at the training stage, while no additional cost is
Keywords: Portrait, Semantic segmenta- required at the testing stage for portrait inference. The two auxiliary losses are bound-
tion, Boundary loss, Consistency con- ary loss and consistency constraint loss. The former improves the accuracy of boundary
straint loss, Mobile device pixels, and the latter enhances the robustness in complex lighting environment. We
evaluate PortraitNet on portrait segmentation dataset EG1800 and Supervise-Portrait.
Compared with the state-of-the-art methods, our approach achieves remarkable perfor-
mance in terms of both accuracy and efficiency, especially for generating results with
sharper boundaries and under severe illumination conditions. Meanwhile, PortraitNet
is capable of processing 224 × 224 RGB images at 30 FPS on iPhone 7.

1. Introduction precise segmentation boundaries while adapting to varying illu-

minations are crucial for user experience. Achieving high accu-
Semantic image segmentation with high accuracy and effi- racy and efficiency at the same time is also challenging. Previ-
ciency using convolutional neural networks has been a popu- ous portrait segmentation works [7, 8] mainly focus on improv-
lar research topic in computer vision. With the rapid develop- ing accuracy but not efficiency, thus are not suitable for real-
ment of mobile techniques, automatic portrait segmentation as time segmentation on mobile device due to the involvement of
a specialized segmentation problem attracts more and more at- sophisticated network architecture. In this paper, we propose a
tention, since it favors many mobile applications which require novel semantic segmentation network called PortraitNet, which
background editing (e.g., blurring, replacement, etc.) on por- is specifically designed for real-time portrait segmentation on
trait images, as shown in Fig. 1. mobile device with limited computational power. In terms of
Semantic segmentation can be formulated as a dense pre- network architecture, according to the characteristics of portrait
diction task. The goal is to predict for every pixel which images, we configure PortraitNet with 32× down-sampling rate
object class it belongs to. In recent years, semantic seg- in the encoder module to achieve large receptive field and high
mentation methods based on deep convolutional neural net- inferring efficiency. We also employ U-shape [3] architecture to
works [1, 2, 3, 4, 5, 6] have made significant progresses. up-sample the feature maps for better segmentation result. The
However, compared with general cases, portrait images exhibit decoder module consists of refined residual blocks [9] and up-
unique characteristics: 1) it usually contains at least one per- sampling blocks. We modify the residual blocks by replacing
son whose face area covers at least 10% of the whole portrait the normal convolution blocks by depthwise separable convo-
image [7, 8]; 2) it often has ambiguous boundaries and com- lution. In terms of network training, as predicting precise seg-
plex illumination conditions. In mobile applications, predicting
2

Fig. 1. Portrait segmentation applications on mobile device. (a) Original image. (b) The corresponding segmentation. (c-d) Two important applications
based on portrait image segmentation.

mentation boundaries is difficult for convolutional neural net- semantic segmentation methods result in high precision, the ef-
works, we design an auxiliary boundary loss to help the net- ficiency is relatively low.
work generate better portrait boundaries. Meanwhile, we take Compared with large models with high complexity, there are
into account complex illumination conditions in portrait images some segmentation works that pay more attention to the effi-
and utilize the consistency constraint loss to improve the ro- ciency. ENet [13] proposed a new network architecture which
bustness. With the two auxiliary losses, we achieve the accu- is deep and narrow, the speed is much faster while the accuracy
racy of 96.62% on EG1800 dataset and 93.43% on Supervise- decline is obvious. ICNet [14] incorporated multi-resolution
Portrait dataset at 30 FPS on iPhone 7 with input image size of branches to improve the accuracy of the model, but the model
224 × 224. is still too large to run on mobile device. BiSeNet [15] is the
state-of-art real-time segmentation method on the CitySpace
dataset [16]. However, this method is not suitable for small
2. Related Work
size input images because of the crude up-sampling modules.

PortraitNet is related to research in semantic segmentation Automatic portrait segmentation as a specialized semantic
and lightweight convolutional neural networks. This section re- segmentation is important in mobile computing era. [7] col-
views typical semantic segmentation methods with deep convo- lected the first human portrait dataset named EG1800 and de-
lutional networks, and classical lightweight architectures. signed a segmentation network to distinguish the portrait and
Semantic image segmentation is a fundamental research background. [8] designed a boundary-sensitive network to im-
topic in computer vision. Many applications require highly prove the accuracy using soft boundary label. [17] proposed
accurate and efficient segmentation results as a basis for ana- a Border Network to improve the accuracy of segmentation.
lyzing and understanding images. With the recent advances of However, the existing works focused on accuracy but not the
deep learning, semantic segmentation methods based on deep computational efficiency. With the growing demand of mo-
convolutional neural networks [10, 11, 12, 9] have made great bile applications, a number of researches aiming at efficient
achievements, especially for improving segmentation precision. models for mobile device have been proposed [18, 19, 20, 21].
Fully convolution networks [1] is the first essential work that Depthwise separable convolutional layers are widely used in
proposed an end-to-end network for pixel-wise segmentation. It lightweight networks. PortraitNet employs MobileNet-v2 [19]
also defined a skip architecture to produce accurate masks. Seg- as backbone to extract features in the encoder module and uses
Net [2] came up with a classical encoder-decoder architecture depthwise separable convolution to substitute traditional convo-
for segmentation, while a similar method was UNet [3]. The lution in the decoder module to build a lightweight network.
main difference is that SegNet [2] transferred pooling indices Portrait images usually have complex illumination condi-
from encoder to decoder to produce sparse feature maps, while tions. Thus how to improve the robustness of the model under
UNet [3] transferred high resolution features from encoder to varying lighting conditions is very important. [22] proposed
up-sampled features in decoder. A series of research works a stability training method to improve the robustness of deep
named Deeplab [4, 5, 6] presented the most accurate methods neural networks. Euclidean distance was used in this method
of semantic segmentation at present. Deeplabv1 [4] used di- to evaluate the results. The stability training process could also
lated convolution to maintain the size of feature maps and use benefit segmentation networks. However, Euclidean distance
CRFs to refine the segmentation result. Deeplabv2 [5] proposed is not a good measurement when most pixels in the prediction
a module called atrous spatial pyramid pooling (ASPP) for im- differs little from the ground truth. Inspired by model distilla-
provement. Deeplabv3 [6] removed the CRFs module and mod- tion [23], we employ soft label and KL divergence in consis-
ified ASPP module to improve the accuracy. Although these tency constraint loss to assist training and improve robustness.
3

Fig. 2. Overview of PortraitNet. (a) The architecture of PortraitNet. The green blocks represent the encoder module, numbers in brackets represent
the down-sampling rates. Each green block represents several convolutional layers. The yellow and purple blocks represent the decoder module. Each
up-sampling operation will up-sample the feature maps by 2×. (b) The architecture of D-Block in the decoder module.

3. Method 2(b) shows the architecture of transition blocks. There are two
branches in the block. One branch contains two depthwise sepa-
In this section, we elaborate our method in detail. We first rable convolutions. The other contains a single 1×1 convolution
introduce the architecture of PortraitNet, which is specifically to adjust the number of channels.
designed for mobile device, and includes two modules, the en- In PortraitNet, we utilize MobileNet-v2 [19] as backbone in
coder module and the decoder module. Then, we describe two the encoder module. And we use massive depthwise convolu-
auxiliary losses used in PortraitNet to improve segmentation actions in PortraitNet to get a higher running speed, which makes
curacy without causing extra cost at the testing stage. the model suitable for mobile device.

3.1. PortraitNet Architecture

3.2. Auxiliary Losses
Fig. 2 shows the architecture of PortraitNet. The encoder
module is used to extract features from the raw RGB image. In In order to improve the running speed of the model, Por-
contrast to general object segmentation, a portrait often occu- traitNet uses depthwise separable convolution layers to extract
pies a large area of the whole image. Achieving high accuracy features and up-sample the feature maps subsequently. As a
requires a good understanding of rich global and spatial infor- lightweight segmentation model, the precision declines com-
mation. Furthermore, in order to achieve real-time performance pared with sophisticated models. Therefore, we propose to add
on mobile device, we use small input size of 224 × 224 with two effective auxiliary losses during the training process, which
32× down-sampling rate in the encoder module while utilizing helps to improve the performance without causing extra cost for
image global information. Meanwhile, we adopt the U-shape inferring results.
architecture with 32× up-sampling rate in the decoder module Boundary loss. Compared with general object segmenta-
to reconstruct spatial information. We concatenate the feature tion, portrait segmentation is more sensitive to the segmenta-
maps as fusion maps in the decoder module to fully exploit tion boundaries. The network needs to generate sharper bound-
the capabilities of the model. Inspired by lightweight research aries in favor of applications such as background replacement
works [18, 19, 20, 21], we use depthwise separable convolu- or blurring. To utilize the useful information contained in se-
tions instead of traditional convolutions to improve inferring mantic boundaries, we propose to add a semantic boundary
efficiency. Each convolutional layer is followed by a Batch- loss in addition to the original semantic segmentation loss. We
Norm layer [24] and a ReLU layer. To reduce the complex- slightly change the last layer in the decoder module by adding
ity of the model, the decoder architecture is relatively simple a new convolution layer in parallel to generate boundary detec-
compared to the encoder. It only contains two main operations, tion maps, as illustrated in Fig. 2(b). On the other hand, the
namely up-sampling and transition. Up-sampling layers em- boundary convolutional layer will not be used for segmentation
ploy de-convolution to up-sample the feature maps. Each layer inference. Different from [17], we only employ one convolu-
up-samples the feature maps by 2×. We use modified residual tional layer for boundary prediction instead of adding a bound-
blocks [9] as transition modules in the decoder module. Fig. ary branch. We use the boundary auxiliary loss to further learn
4

the pivotal features of the portrait images, such that the learned
feature can be effectively used for inferring better segmentation.
We generate boundary ground truth from manual labeled
mask ground truth using traditional boundary detection algo-
rithm such as Canny[25]. In order to reduce the difficulty of
learning boundaries, we set the width as 4 for 224 × 224 in-
put images (see Fig. 3). Since more than 90% of pixels in the
boundary ground truth images are negative, the representation
of boundary is difficult to learn. We therefore use focal loss [26]
to guide the learning of boundary masks. The overall loss L is:
n
X
Lm = − (yi log(pi ) + (1 − yi )log(1 − pi )) (1)
i=1
n
((1 − pi )γ yi log(pi ) + pγi (1 − yi )log(1 − pi ))
X
Le = − (2)
i=1

L = Lm + λ × Le (3)
In the above, Lm is the cross-entropy loss and Le is the focal
loss. λ is the weight of boundary loss. yi represents the ground
Fig. 3. The ground truth boundary generated by Canny operator. (a) Orig-
truth label of pixel i. pi represents the predicted probability of
inal image. (b) The corresponding segmentation. (c) The groundtruth
pixel i. The predicted probability pi in Eq. 1 and Eq. 2 is boundary.
computed as:
ez j
pi (z j ) = PK , (4) to generate A’. Texture enhancement will not change the shape
zk
k=1 e of the images, so the segmentation of image A and A’ are the
where z is the original output of PortraitNet, and K is the num- same. Suppose the network output of image A is heatmap B and
ber of groundtruth classes. the output of image A’ is heatmap B’, then heatmap B and B’
As only one convolutional layer is used to generate boundary should be the same theoretically. However, due to the texture
mask, the mask features and boundary features could make in- augmentation methods, the quality of image A’ is worse than
valid competition in the feature representations. To avoid this, A. As a result, the generated B’ is worse than B. Hence we use
small λ should be set. The boundary loss can improve the sen- the heatmap B with higher quality as the soft labels for heatmap
sitivity of the model to the portrait boundary, which in turn im- B’. Specifically, we add a consistency constraint loss between
proves the segmentation accuracy. heatmap B and B’, which is formulated as a KL divergence:
Consistency constraint loss. It is straightforward to use the
ground truth semantic segmentation as the supervision signal, n
X
where the portrait pixels in the image are manually labeled as 0
Lm =− (yi log(pi ) + (1 − yi )log(1 − pi ))
1, otherwise 0. Such labels are usually called hard labels, be- i=1
(5)
n
cause they only have binary categories. However, it has been X
proved that soft labels with more information can further ben- − (yi log(p0i ) + (1 − yi )log(1 − p0i ))
i=1
efit the model training. There are some research works focus-
n
ing on using soft labels to improve the accuracy of tiny models 1X qi
through model distillation [23, 27]. For the input images, they Lc = qi × log 0 × T 2 (6)
n i=1 qi
use a well-trained huge teacher model to generate soft labels,
and exploit the soft labels to supervise the training of tiny stu- L = Lm
0
+ α × Lc (7)
dent model. Model distillation requires a tedious training pro- Here α is used to balance the two losses. T is used to smooth
cess, and the amount of data may not be sufficient to train a huge the outputs. pi and pi0 in Eq. 5 are defined as follows:
teacher model. Compared with complicated model distillation, 0
ez j ez j
we propose a novel method to generate soft labels using the tiny pi (z j ) = PK p0i (z0j ) = PK 0 (8)
network itself with data augmentation. We also use consistency k=1 e zk k=1 e zk
constraint loss to assist the model training. And qi and q0i in Eq. 6 are defined similarly:
Commonly used data augmentation includes two main cate- zj z0
j
gories. One is deformation enhancement, such as random ro- eT eT
tate, flip, scale, crop, etc. The other is texture enhancement, qi (z j ) = PK zk q0i (z0j ) = z0
(9)
k=1 e
T
PK k
k=1 e
T
such as changing the brightness, contrast, sharpness of images,
adding random noise or Gaussian filtering, etc. For an origi- The consistency constraint loss could further improve the ac-
nal image, we firstly use deformation enhancements to gener- curacy of the model, and enhance its robustness under different
ate image A, and then apply texture enhancement on image A illumination conditions.
5

Fig. 5. Sample portrait images in EG1800.

Fig. 4. Illustration of consistency constraint loss.

4. Experiments

In this section, we first introduce the datasets and experimen-

tal setup, then evaluate the performance of PortraitNet and the
effectiveness of the two auxiliary losses.

4.1. Dataset
We train and test our method on two well-known portrait seg-
mentation datasets: EG1800 [7] and Supervise-Portrait.
EG1800 EG1800 contains 1800 portrait images collected
Fig. 6. Sample portrait images in Supervise-Portriat.
from Flickr, and each image is manually labeled at pixel level.
The images are mainly self-portrait captured by the front cam-
era of a mobile phone. The final images in EG1800 are scaled trait images in Supervise-Portrait have more complicated back-
and cropped automatically to 800×600 according to the bound- ground and severe occlusion. Some sample portrait images are
ing box generated by the face detector running on each image. shown in Fig. 6.
The 1800 images are divided into two groups. One is the train-
ing dataset with 1500 images, while the other is the validat- 4.2. Data Augmentation
ing/testing dataset with 300 images. Since several image URL To improve the generality of the trained model, we use sev-
links are invalid in the original EG1800 dataset, we finally use eral data augmentation methods to supplement the original
1447 images for training and 289 images for validation. Some training dataset, leading to better segmentation results. These
sample portrait images are shown in Fig. 5. data augmentation methods can be divided into two categories:
Supervise-Portrait Supervise-Portrait is a portrait segmen- one is deformation augmentation, the other is texture augmenta-
tation dataset collected from the public human segmentation. Deformation augmentation augments the position or size
tion dataset Supervise.ly [28] using the same data process as of the target, but will not affect the texture. On the other hand,
EG1800. Supervise.ly dataset contains high-quality annotated texture augmentation complements the texture information of
person instances. The images are carefully labeled with per- the target while keeping the position and size.
son segmentation masks. We further run a face detector on the The deformation augmentation methods used in our experi-
dataset and automatically crop the images according to the face ments include:
bounding boxes. We discard the images on which face detec-
tor failed and finally collect 2258 portrait images with different • random horizontal flip
sizes. We randomly select 1858 images as training dataset and • random rotation {−45◦ ∼ 45◦ }
400 images as validating/testing dataset. We name the resul- • random resizing {0.5 ∼ 1.5}
tant dataset as Supervise-Portrait. Compare with EG1800, por- • random translation {−0.25 ∼ 0.25}
6

4.4.1. Boundary loss

The boundary accuracy greatly influences the user experi-
ences in the applications of portrait segmentation such as back-
ground blurring or replacement. We add a new convolutional
layer paralleled with mask prediction layer to predict the por-
trait boundary. In order to verify the effectiveness of bound-
ary loss, we conduct two experiments with or without boundary
loss. More specifically, we train PortraitNet-M with only seg-
mentation loss, and train PortraitNet-B with both segmentation
loss and boundary loss. We empirically set λ = 0.1 to balance
segmentation loss and boundary loss. We do not use α in focal
loss and set γ = 2 in Eq. 2. We initialize PortraitNet-B with
well-trained PortraitNet-M. All hyper-parameters of the exper-
imental setup are same in different experiments.
The quantitative metric used to evaluate segmentation preci-
sion is the mean Interaction-over-Union(IOU) as follows:
N
1 X maskPDi ∩ maskGT i
mean IOU = × , (10)
N i=1 maskPDi ∪ maskGT i

Fig. 7. Data augmentations used in PortraitNet. (a) Original images. (b) where maskPDi and maskGT i represent segmentation result
Result images after adding deformation augmentations on (a). (c) Result and ground truth label of i-th image of test dataset, respectively.
images after adding texture augmentations on (b). (d) The groundtruth
segmentation corresponding to (b-c).
The quantitative comparison is shown in Table 1. It can be seen
that the boundary loss improves the IOU accuracy by 0.22%
(from 96.32% to 96.54%) on EG1800 dataset, and by 0.41%
The texture augmentation methods used in our experiments (from 92.63% to 93.04%) on Supervise-Portriat dataset.
include: We also propose a specific metric to better evaluate the model
performance on portrait boundary than mean IOU. The new
• random noise {Gaussian noise, σ = 10 }
metric is similar to mean IOU with emphasized weight on
• image blur {kernel size is 3 and 5 randomly }
boundaries over inner pixels:
• random color change {0.4 ∼ 1.7}
• random brightness change {0.4 ∼ 1.7} N
1 X w(x)i (maskPDi ∩ maskGT i )
• random contrast change {0.6 ∼ 1.5} mean edge IOU = × ,
N i=1 w(x)i (maskPDi ∪ maskGT i )
• random sharpness change {0.8 ∼ 1.3} (11)
Every operation in deformation augmentation and texture where w(x)i represents the weight of pixel x in the i-th image.
augmentation added up with the probability of 0.5 during train- More specifically, the weight w(x)i declines continuously from
ing. After data augmentation, we normalize the input images boundaries to inside as in the follwing equation:
before training using image mean ([103.94, 116.78, 123.68],  dis(x)2
BGR order) and image val (0.017). The normalization equation e− 2σ2 ,

x ∈ maskGT and y(x) = 1
w(x)i =  (12)
is (image − mean) × val. Fig. 7 shows the data augmentation 0, x ∈ maskGT and y(x) = 0
methods used in the experiments.
where dis(x) represents the distance from pixel x to portrait
4.3. Experimental Setup boundary, and σ indicates the decline rate. An illustration of
We implement our model using the Pytorch framework [29]. the new metric with different σ is shown in Fig. 8. Based on the
All competitive models are trained using a single NVIDIA new metric, we compare the performance of the two networks
1080Ti graphics card. We use Adam algorithm with batchsize with different σ on EG1800 dataset, the results are shown in
64 and weight decay 5e-4 during training. The initial learning Fig. 9 and Table 2. It can be seen that that the performance
epoch
rate is 0.001. We use (lr × 0.95 20 ) to adjust the learning rate enhancement is larger when σ is smaller, since the metric em-
with 2000 epochs. In order to achieve higher running speed, we phasize more on boundary pixels when σ is small. This demon-
train and test our model on 224 × 224 RGB images with three strates the effectiveness of boundary loss in improving the pre-
channels. cision of segmentation boundaries.

4.4. Ablation Study 4.4.2. Consistency constraint loss

In this sub-section, we carefully evaluate the performance of Due to the complication of lighting conditions when taking
PortraitNet. First, we validate the effectiveness of two auxiliary selfies on smartphone, we use texture augmentation methods to
losses respectively. Next, we demonstrate the performance of improve the robustness of PortraitNet. Meanwhile, we also find
PortraitNet by comparing with the state-of-the-art methods. that soft label could further improve the segmentation precision
7

Fig. 8. New metric for evaluating boundary precision. (a) Original images. (b) The corresponding segmentation. (c-e) The weight masks under the new
metric with different σ in Eq. 12.

Table 1. Accuracy comparison of PortraitNet with different losses. Table 2. Accuracy comparison of PortraitNet with new mean IOU metric
Method EG1800 Supervise.ly on EG1800 test dataset.
PortraitNet-M(ours, Exp.1) 96.32% 92.63% Sigma single model edge model increase
PortraitNet-B(ours, Exp.2) 96.54% 93.04% 3 91.56% 92.34% +0.78%
PortraitNet-C(ours, Exp.3) 96.57% 93.17% 5 93.80% 94.44% +0.64%
PortraitNet(ours, Exp.4) 96.62% 93.43% 10 96.11% 96.55% +0.44%
20 97.57% 97.86% +0.29%
40 98.23% 98.44% +0.21%
80 98.43% 98.63% +0.20%

Table 3. Accuracy comparison of PortraitNet-M and PortraitNet-C with

new EG1800 test dataset using the mean IOU metric.
Method EG1800 Supervise.ly
PortraitNet-M(ours, Exp.1) 95.78% 91.51%
PortraitNet-C(ours, Exp.3) 96.24% 91.98%

Fig. 9. The precision comparison in mean edge IOU metric between

Portrait-M and Portrait-B. Portrait-M trained model only with mask loss,
Portrait-M model trained with mask loss and boundary loss, parameter
λ = 0.1.

with the help of consistency constraint loss. We conduct a con-

trast test to verify its effectiveness. PortraitNet-M is the same Fig. 10. Results generated by PortraitNet-M and PortraitNet-C.
as discussed before, we train PortraitNet-C with both segmen-
tation loss and consistency constraint loss. We set α = 2 to
balance the two losses, and T is set to 1. The evaluation with
mean IOU on EG1800 dataset and Supervise-Portrait dataset is tion on the test dataset to capture illumination variations. We
reported in Table 1. Fig. 10 shows some segmentation results evaluate the two networks with mean IOU metric on the new
generated by PortraitNet-M and PortraitNet-C, respectively. test dataset as in Table 3. The model with consistency constraint
To validate our proposed model, we perform data augmenta- loss is more robust to illumination condition change.
8

Fig. 11. Segmentation results of challenging portrait images generated by different methods. The first row shows images with strong illumination. The
second and fourth rows show images with background color close to foreground portrait. The third row shows the portrait image with helmet. The last
row shows the portrait from a side view.

4.4.3. Accuracy Analysis use skip lines from encoder modules to reconstruct the spatial
information for better segmentation details. To employ segmen-
PortraitNet is specifically designed for mobile device com-
tation networks on mobile device, we set the input image size
pared with other real-time segmentation networks. We choose
of 224×224 for real-time inference. We train PortraitNet model
PortraitFCN+ [7], ENet [13] and BiSeNet [15] as baselines,
with mask loss and two auxiliary losses as the following:
since PortraitFCN+ [7] is one of the iconic methods for portrait
segmentation, ENet [13] and BiSeNet [15] are the state-of-the-
art. In our experiments, the backbone of BiSeNet is ResNet18. L = Lm
0
+ α × Lc + β × Le , (13)
0
For real-time inference on mobile device, we use MobileNet- where Lm , Lc , Le are defined in Eq. 5, Eq. 6, Eq. 2 respectively,
v2[19] as our backbone to extract features form original images, and α = 2, β = 0.3, T = 1.
and we use U-shape architecture to generate sharp segmenta- The performance on EG1800 and Supervise-Portrait datasets
tion boundaries. Depthwise separable convolutions are used is shown in Table 4. To further verify the performance of the
in PortraitNet to gain running speed. In encoder modules, the two auxiliary losses, we test a new model called BiSeNet+,
down-sampling rate is 32×. We use large receptive field to uti- which is BiSeNet with our two auxiliary losses. The experi-
lize global information to help deduce the segmentation mask, ments show that the two auxiliary losses also improve the result
which is necessary for portrait images. In decoder modules, we of BiSeNet. Fig. 11 shows several difficult portrait segmenta-
9

variations. The experimental results demonstrate both high ac-

Table 4. Accuracy comparison with the state-of–the-art real-time segmen-
tation methods using the mean IOU metric. The numbers in brackets rep- curacy and efficiency of our approach, verifying that PortraitNet
resent the image size for inference. Horizontal flip or image resizing are could serve as a lightweight tool for real-time portrait segmen-
not used in testing. tation on mobile device.
Method EG1800 Supervise.ly
PortraitFCN+(800 × 600) 95.91% 92.78%
ENet(224 × 224) 96.00% 92.38% Acknowledgments
BiSeNet(448 × 448) 95.79% 92.56%
BiSeNet(224 × 224) 95.25% 91.25% This work was supported by the National Key Technol-
BiSeNet+(224 × 224) 95.55% 91.76% ogy R&D Program (2016YFB1001402), the Natural Science
PortraitNet(ours, 224 × 224) 96.62% 93.43% Foundation of China (61772298, 61832016), Research Grant
of Beijing Higher Institution Engineering Research Center
and Tsinghua-Tencent Joint Laboratory for Internet Innovation
Table 5. Quantitative performance comparison. FLOPs are estimated with Technology.
the size in brackets.
Method FLOPS Parameters
PortraitFCN+ (6 × 224 × 224) 62.89G 134.27M References
ENet (3 × 224 × 224) 0.44G 0.36M
BiSeNet (3 × 448 × 448) 9.52G 12.4M [1] Long, J, Shelhamer, E, Darrell, T. Fully convolutional networks for
BiSeNet (3 × 224 × 224) 2.38G 12.4M semantic segmentation. In: Proceedings of the IEEE conference on com-
puter vision and pattern recognition. 2015, p. 3431–3440.
PortraitNet(ours, 3 × 224 × 224) 0.51G 2.1M [2] Badrinarayanan, V, Kendall, A, Cipolla, R. Segnet: A deep convolu-
tional encoder-decoder architecture for image segmentation. IEEE trans-
actions on pattern analysis and machine intelligence 2017;39(12):2481–
Table 6. Speed comparison with the state-of-the-art real-time segmentation 2495.
models on NVIDIA 1080Ti graphic card. [3] Ronneberger, O, Fischer, P, Brox, T. U-net: Convolutional networks for
Method NVIDIA 1080Ti biomedical image segmentation. In: International Conference on Medical
image computing and computer-assisted intervention. Springer; 2015, p.
PortraitFCN+ (6 × 224 × 224) 19.04ms
234–241.
ENet (3 × 224 × 224) 12.53ms [4] Chen, LC, Papandreou, G, Kokkinos, I, Murphy, K, Yuille, AL. Seman-
BiSeNet (3 × 448 × 448) 5.15ms tic image segmentation with deep convolutional nets and fully connected
BiSeNet (3 × 224 × 224) 3.11ms crfs. arXiv preprint arXiv:14127062 2014;.
[5] Chen, LC, Papandreou, G, Kokkinos, I, Murphy, K, Yuille, AL.
PortraitNet(ours, 3 × 224 × 224) 4.92ms Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs. IEEE transactions on pattern
analysis and machine intelligence 2018;40(4):834–848.
tion results generated by different methods. [6] Chen, LC, Papandreou, G, Schroff, F, Adam, H. Rethinking
atrous convolution for semantic image segmentation. arXiv preprint
arXiv:170605587 2017;.
4.4.4. Speed Analysis [7] Shen, X, Hertzmann, A, Jia, J, Paris, S, Price, B, Shechtman, E,
et al. Automatic portrait segmentation for image stylization. In: Computer
Inference efficiency is crucial for portrait segmentation on Graphics Forum; vol. 35. Wiley Online Library; 2016, p. 93–102.
mobile device. We evaluate the FLOPs (float point opera- [8] Du, X, Wang, X, Li, D, Zhu, J, Tasci, S, Upright, C, et al.
tions) and the scale of parameters on different models (see Boundary-sensitive network for portrait segmentation. arXiv preprint
Table 5). We also test the actual running speed on NVIDIA arXiv:171208675 2017;.
[9] He, K, Zhang, X, Ren, S, Sun, J. Deep residual learning for image
1080Ti graphic card compared with other methods (see Ta- recognition. In: Proceedings of the IEEE conference on computer vision
ble 6). For fair comparison, we use bilinear interpolation based and pattern recognition. 2016, p. 770–778.
up-sampling instead of de-convolution in PortraitNet. We find [10] Krizhevsky, A, Sutskever, I, Hinton, GE. Imagenet classification with
that PortraitNet achieves a good balance between accuracy and deep convolutional neural networks. In: Advances in neural information
processing systems. 2012, p. 1097–1105.
efficiency. Moreover, we adapt PortraitNet from Pytorch [29] to [11] Simonyan, K, Zisserman, A. Very deep convolutional networks for
Coreml [30] and test the inferring time on IOS. For image size large-scale image recognition. arXiv preprint arXiv:14091556 2014;.
of 224 × 224, the cost of PortraitNet processing one image is [12] Szegedy, C, Liu, W, Jia, Y, Sermanet, P, Reed, S, Anguelov, D, et al.
around 32ms, while other real-time segmentation methods can- Going deeper with convolutions. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. 2015, p. 1–9.
not directly run on iPhone without modification. [13] Paszke, A, Chaurasia, A, Kim, S, Culurciello, E. Enet: A deep neural
network architecture for real-time semantic segmentation. arXiv preprint
arXiv:160602147 2016;.
5. Conclusion [14] Zhao, H, Qi, X, Shen, X, Shi, J, Jia, J. Icnet for real-time semantic
segmentation on high-resolution images. In: Proceedings of the European
Conference on Computer Vision (ECCV). 2018, p. 405–420.
In this paper, we present PortraitNet, a specifically designed [15] Yu, C, Wang, J, Peng, C, Gao, C, Yu, G, Sang, N. Bisenet: Bilateral
lightweight model for segmenting portrait images on mobile de- segmentation network for real-time semantic segmentation. In: European
vice. We propose to add two auxiliary losses to assist training Conference on Computer Vision. Springer; 2018, p. 334–349.
[16] Cordts, M, Omran, M, Ramos, S, Rehfeld, T, Enzweiler, M, Benenson,
without additional cost for segmentation inference. The bound-
R, et al. The cityscapes dataset for semantic urban scene understanding.
ary loss helps to generate sharper boundaries, and the consistent In: Proceedings of the IEEE conference on computer vision and pattern
constraint loss improves the robustness with respect to lighting recognition. 2016, p. 3213–3223.
10

[17] Yu, C, Wang, J, Peng, C, Gao, C, Yu, G, Sang, N. Learning a discrim-

inative feature network for semantic segmentation. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2018, p.
1857–1866.
[18] Howard, AG, Zhu, M, Chen, B, Kalenichenko, D, Wang, W, Weyand,
T, et al. Mobilenets: Efficient convolutional neural networks for mobile
vision applications. arXiv preprint arXiv:170404861 2017;.
[19] Sandler, M, Howard, A, Zhu, M, Zhmoginov, A, Chen, LC. Mo-
bilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 2018,
p. 4510–4520.
[20] Hluchyj, MG, Karol, MJ. Shuffle net: An application of generalized
perfect shuffles to multihop lightwave networks. Journal of Lightwave
Technology 1991;9(10):1386–1397.
[21] Ma, N, Zhang, X, Zheng, HT, Sun, J. Shufflenet v2: Practical guidelines
for efficient cnn architecture design. In: Proceedings of the European
Conference on Computer Vision (ECCV). 2018, p. 116–131.
[22] Zheng, S, Song, Y, Leung, T, Goodfellow, I. Improving the robustness
of deep neural networks via stability training. In: Proceedings of the ieee
conference on computer vision and pattern recognition. 2016, p. 4480–
4488.
[23] Hinton, G, Vinyals, O, Dean, J. Distilling the knowledge in a neural
network. arXiv preprint arXiv:150302531 2015;.
[24] Ioffe, S, Szegedy, C. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. arXiv preprint
arXiv:150203167 2015;.
[25] Canny, J. A computational approach to edge detection. IEEE Transac-
tions on pattern analysis and machine intelligence 1986;(6):679–698.
[26] Lin, TY, Goyal, P, Girshick, R, He, K, Dollár, P. Focal loss for dense
object detection. In: Proceedings of the IEEE international conference on
computer vision. 2017, p. 2980–2988.
[27] Anil, R, Pereyra, G, Passos, A, Ormandi, R, Dahl, GE, Hinton, GE.
Large scale distributed neural network training through online distillation.
arXiv preprint arXiv:180403235 2018;.
[28] Supervise.ly. https://supervise.ly/; 2018.
[29] Paszke, A, Gross, S, Chintala, S, Chanan, G, Yang, E, DeVito, Z, et al.
Automatic differentiation in pytorch. 2017.
[30] Coreml. https://developer.apple.com/documentation/
coreml; 2018.

Deep Learning R18 Jntuh Lab Manual
0% (1)
Deep Learning R18 Jntuh Lab Manual
21 pages
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
No ratings yet
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
11 pages
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
No ratings yet
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
6 pages
Research Paper
No ratings yet
Research Paper
7 pages
6 Segnet
No ratings yet
6 Segnet
14 pages
Image Segmentation Keras: Implementation of Segnet, FCN, Unet, Pspnet and Other Models in Keras
No ratings yet
Image Segmentation Keras: Implementation of Segnet, FCN, Unet, Pspnet and Other Models in Keras
5 pages
Optimisation of Semantic Segmentation Algorithm For Autonomous Driving Using U-NET Architecture
No ratings yet
Optimisation of Semantic Segmentation Algorithm For Autonomous Driving Using U-NET Architecture
16 pages
SegNet: A Deep Convolutional Encoder-Decoder Architecture For Image Segmentation
No ratings yet
SegNet: A Deep Convolutional Encoder-Decoder Architecture For Image Segmentation
15 pages
2021 ICPR FASSDNet
No ratings yet
2021 ICPR FASSDNet
8 pages
7. Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
No ratings yet
7. Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
12 pages
W-Net A Deep Model For Fully Unsupervised Image Segmentation
No ratings yet
W-Net A Deep Model For Fully Unsupervised Image Segmentation
13 pages
Boundary-Aware Segmentation Network For Mobile and Web Applications
No ratings yet
Boundary-Aware Segmentation Network For Mobile and Web Applications
19 pages
2015 - DeepLab v1 - Semantic Image Segmentation With Deep Convolutional Nets and Fully Connected Crfs
No ratings yet
2015 - DeepLab v1 - Semantic Image Segmentation With Deep Convolutional Nets and Fully Connected Crfs
14 pages
RefineNet
No ratings yet
RefineNet
11 pages
Applsci 11 08802 - Compressed
No ratings yet
Applsci 11 08802 - Compressed
28 pages
Bise Net
No ratings yet
Bise Net
17 pages
1511.02680v2
No ratings yet
1511.02680v2
11 pages
Deep Semantic Segmentation New Model of Natural and Medical Images
No ratings yet
Deep Semantic Segmentation New Model of Natural and Medical Images
4 pages
Semantic Segmentation
No ratings yet
Semantic Segmentation
22 pages
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
No ratings yet
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
6 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
18 pages
Traditional Methods Like Fully Conv
No ratings yet
Traditional Methods Like Fully Conv
2 pages
CN4SRSS Combined Network For Super Resolution Reco - 2024 - Engineering Applica
No ratings yet
CN4SRSS Combined Network For Super Resolution Reco - 2024 - Engineering Applica
32 pages
DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
No ratings yet
DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
10 pages
Wang Dual Super-Resolution Learning For Semantic Segmentation CVPR 2020 Paper
No ratings yet
Wang Dual Super-Resolution Learning For Semantic Segmentation CVPR 2020 Paper
10 pages
Two-Stage Framework For Faster Semantic Segmentation
No ratings yet
Two-Stage Framework For Faster Semantic Segmentation
9 pages
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
main
No ratings yet
main
13 pages
Sensors: Semantic Segmentation With Transfer Learning For Off-Road Autonomous Driving
No ratings yet
Sensors: Semantic Segmentation With Transfer Learning For Off-Road Autonomous Driving
21 pages
Dlcv2017d3l1segmentation 170623173102
No ratings yet
Dlcv2017d3l1segmentation 170623173102
36 pages
IJRAR1DUP001
No ratings yet
IJRAR1DUP001
3 pages
Auto-DeepLab Hierarchical Neural Architecture Search For Semantic Image Segmentation
No ratings yet
Auto-DeepLab Hierarchical Neural Architecture Search For Semantic Image Segmentation
12 pages
Rendering Computer Graphics: Exploring Visual Realism: Insights into Computer Graphics
From Everand
Rendering Computer Graphics: Exploring Visual Realism: Insights into Computer Graphics
Fouad Sabry
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Efficient Lightweight Residual Network For Real-Time Road Semantic Segmentation
No ratings yet
Efficient Lightweight Residual Network For Real-Time Road Semantic Segmentation
8 pages
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
No ratings yet
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
68 pages
Harley MSC Thesis Menos Especializadpo
No ratings yet
Harley MSC Thesis Menos Especializadpo
71 pages
RA ASPP Deeplab
No ratings yet
RA ASPP Deeplab
12 pages
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
No ratings yet
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
28 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
Dental X-Ray Image Segmenation Using A U-Shaped Deep Convolutional Network
No ratings yet
Dental X-Ray Image Segmenation Using A U-Shaped Deep Convolutional Network
13 pages
Deep Semantic Segmentation New Model of Natural and Medical Images
No ratings yet
Deep Semantic Segmentation New Model of Natural and Medical Images
4 pages
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
No ratings yet
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
20 pages
Lecture Sematic-Segmentation
No ratings yet
Lecture Sematic-Segmentation
23 pages
DL UNIT 5
No ratings yet
DL UNIT 5
63 pages
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
No ratings yet
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
11 pages
SegNeXt Rethinking Convolutional Attention Design Segmentation
No ratings yet
SegNeXt Rethinking Convolutional Attention Design Segmentation
15 pages
Paper 1
No ratings yet
Paper 1
8 pages
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
No ratings yet
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
9 pages
Swiftnet: Real-Time Video Object Segmentation
No ratings yet
Swiftnet: Real-Time Video Object Segmentation
10 pages
large kernel matters
No ratings yet
large kernel matters
11 pages
Thesis AlexanderJaus BIBTEX
No ratings yet
Thesis AlexanderJaus BIBTEX
9 pages
Image Segmentation Using Deep Learning A Survey
No ratings yet
Image Segmentation Using Deep Learning A Survey
20 pages
Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform
No ratings yet
Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform
10 pages
Segmentation Transformer: Object-Contextual Representations For Semantic Segmentation
No ratings yet
Segmentation Transformer: Object-Contextual Representations For Semantic Segmentation
21 pages
1-s2.0-S026288562300197X-main
No ratings yet
1-s2.0-S026288562300197X-main
11 pages
Deconvolution Network ICCV 2015 Paper PDF
No ratings yet
Deconvolution Network ICCV 2015 Paper PDF
9 pages
Blitznet: A Real-Time Deep Network For Scene Understanding
No ratings yet
Blitznet: A Real-Time Deep Network For Scene Understanding
11 pages
8DL
No ratings yet
8DL
6 pages
Advanced GIS - Spatial Data Modelling
No ratings yet
Advanced GIS - Spatial Data Modelling
11 pages
DL prac1
No ratings yet
DL prac1
5 pages
217-Article Text-522-579-10-20221218
No ratings yet
217-Article Text-522-579-10-20221218
7 pages
Application of Fake News Detection in Stock Market Analyzer and Predictor Using Sentiment Analysis
No ratings yet
Application of Fake News Detection in Stock Market Analyzer and Predictor Using Sentiment Analysis
7 pages
Module-4_Notes_13-12-2024.docx
No ratings yet
Module-4_Notes_13-12-2024.docx
21 pages
AI Based Eco Lifestyle Advisor
No ratings yet
AI Based Eco Lifestyle Advisor
8 pages
Mammographic Mass Detection Using Machine Learning Classifiers
No ratings yet
Mammographic Mass Detection Using Machine Learning Classifiers
10 pages
Project Cycle Notes Class 10 AI
No ratings yet
Project Cycle Notes Class 10 AI
14 pages
group b 2
No ratings yet
group b 2
12 pages
CT Lung Nodule Segmentation A Comparative Study of Data Preprocessing and Deep Learning Models
No ratings yet
CT Lung Nodule Segmentation A Comparative Study of Data Preprocessing and Deep Learning Models
7 pages
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
No ratings yet
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
10 pages
Fraud Call Detection
No ratings yet
Fraud Call Detection
12 pages
DiffuseMix_CVPR_24
No ratings yet
DiffuseMix_CVPR_24
18 pages
Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?
No ratings yet
Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?
8 pages
Step-By-Step Guide To Gain MLOps Skills
No ratings yet
Step-By-Step Guide To Gain MLOps Skills
6 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
Assignment Week 3 500832
No ratings yet
Assignment Week 3 500832
6 pages
Predicting Student Academic Success DDA
No ratings yet
Predicting Student Academic Success DDA
26 pages
Multiple Disease Prediction Final Print Out
No ratings yet
Multiple Disease Prediction Final Print Out
46 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Natural Language Processingand Sentiment Analysis
No ratings yet
Natural Language Processingand Sentiment Analysis
15 pages
Jhankar Paper Propulsion
No ratings yet
Jhankar Paper Propulsion
13 pages
Jamwal 2021
No ratings yet
Jamwal 2021
8 pages
Facial Emotion Recognition Thesis
100% (3)
Facial Emotion Recognition Thesis
5 pages
Architecture Document - Reviewprediction
No ratings yet
Architecture Document - Reviewprediction
11 pages
Increasing Adoption Rates at Animal Shelters: A Two-Phase Approach To Predict Length of Stay and Optimal Shelter Allocation
No ratings yet
Increasing Adoption Rates at Animal Shelters: A Two-Phase Approach To Predict Length of Stay and Optimal Shelter Allocation
16 pages
HEART DISEASE PREDICTION USING MACHINE
No ratings yet
HEART DISEASE PREDICTION USING MACHINE
88 pages
Python Decision Tree Classification
No ratings yet
Python Decision Tree Classification
14 pages

Mobilepotrait c&g19

Uploaded by

Mobilepotrait c&g19

Uploaded by

PortraitNet: Real-time Portrait Segmentation Network for Mobile Device

Real-time portrait segmentation plays a significant role in many applications on mobile

1. Introduction precise segmentation boundaries while adapting to varying illu-

3.1. PortraitNet Architecture

Fig. 5. Sample portrait images in EG1800.

Fig. 4. Illustration of consistency constraint loss.

In this section, we first introduce the datasets and experimen-

4.4.1. Boundary loss

4.4. Ablation Study 4.4.2. Consistency constraint loss

Table 3. Accuracy comparison of PortraitNet-M and PortraitNet-C with

Fig. 9. The precision comparison in mean edge IOU metric between

with the help of consistency constraint loss. We conduct a con-

variations. The experimental results demonstrate both high ac-

[17] Yu, C, Wang, J, Peng, C, Gao, C, Yu, G, Sang, N. Learning a discrim-

You might also like