DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
Abstract
der resource constraints. Our proposed network starts from
a single lightweight backbone and aggregates discrimina-
tive features through sub-network and sub-stage cascade
respectively. Based on the multi-scale feature propaga-
tion, DFANet substantially reduces the number of parame-
ters, but still obtains sufficient receptive field and enhances
the model learning ability, which strikes a balance between
the speed and segmentation performance. Experiments on
Cityscapes and CamVid datasets demonstrate the superior
performance of DFANet with 8× less FLOPs and 2× faster
than the existing state-of-the-art real-time semantic seg-
mentation methods while providing comparable accuracy.
Specifically, it achieves 70.3% Mean IOU on the Cityscapes Figure 1. Inference speed, FLOPs and mIoU performance on
test dataset with only 1.7 GFLOPs and a speed of 160 FPS Cityscapes test set. The bigger the circle, the faster the
on one NVIDIA Titan X card, and 71.3% Mean IOU with speed. Results of existing real-time methods, including ICNet[33],
3.4 GFLOPs while inferring on a higher resolution image. ENet[22], SQ[25], SegNet[1], FRRN[24], FCN-8S[19], Two-
Column[27], BiSeNet[29]. Two classical networks DeepLab[7]
and PSPNet[34] are displayed. Also, Our DFANet based on two
backbone networks and two input sizes are compared.
1. Introduction
Semantic segmentation, which aims to assign dense la- speed[1][22]. Though these methods seem effective, they
bels for all pixels in the image, is a fundamental task in easily lose the spatial details around boundaries and small
computer vision. It has a number of potential applications in objects. Also, a shallow network weakens feature discrimi-
the fields of autonomous driving, video surveillance, robot native ability. In order to overcome these drawbacks, other
sensing and so on. For most such applications, how to methods [33][29] adopt a multi-branch framework to com-
keep efficient inference speed and high accuracy with high- bine the spatial details and context information. Never-
resolution images is a critical problem. theless, the additional branches on the high-resolution im-
Previous real-time semantic segmentation approaches age limit the speed, and the mutual independence between
[1][25][27][29][33][22] have already obtained promising branches limits the model learning ability in these methods.
performances on various benchmarks[10][9][18][36][2]. Commonly, semantic segmentation task usually borrows
However, the operations on the high-resolution feature ’funnel’ backbone pretrained from image classification
maps consume significant amount of time in the U-shape task, such as ResNet[11], Xception[8], DenseNet[13] and
structures. Some works reduce the computation complex- so on. For real-time inference, we adopt a lightweight back-
ity by restricting the input image size[27], or pruning re- bone model and investigate how to improve the segmenta-
dundant channels in the network to boost the inference tion performance with limited computation. In mainstream
∗ The first two authors contribute equally to this work. This work is semantic segmentation architectures, a pyramid-style fea-
done when Hanchao Li is an intern at Megvii Technology. ture combination step like Spatial Pyramid Pooling[34][5]
9515
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
Figure 2. Structure Comparison. From left to right: (a) Multi-branch. (b) Spatial pyramid pooling. (c) Feature reuse in network level. (d)
Feature reuse in stage level. As a comparison, the proposed feature reuse methods enrich features with high-level context in another aspect.
interpretation, especially for various objects in multiple different depth position of the feature extraction network to
scales. These models have shown high-quality segmenta- achieve comparable performance. The whole architecture
tion results on several benchmarks while usually need huge of Deep Feature Aggregation Network (DFANet) is illus-
computing resources. trated in Figure 3.
Context Encoding: As SE-Net[12] explores the chan-
nel information to learn a channel-wise attention and has
3.1. Observations
achieved state-of-the-art performance in image classifica- We take a brief overview of the segmentation network
tion, attention mechanism becomes a powerful tool for structures, shown in Figure 2.
deep neural networks[3]. It can be seen as a channel- For real-time inference, [33][29] apply multiple
wise selection to improve module features representation. branches to perform multi-scale extraction and preserve im-
EncNet[32][20][6] introduces context encoding to enhance age spatial details. For example, BiSeNet[29] proposed a
per-pixel prediction that is conditional on the encoded se- shallow network process for high-resolution images and a
mantics. In this paper, we also propose a fully-connected deep network with fast downsampling to strike a balance
module to enhance backbone performance, which has little between classification ability and receptive filed. This struc-
impact on calculation. ture is displayed in Figure 2(a). Nevertheless, the draw-
Feature Aggregation: Traditional approaches imple- back of these methods is obvious that these models are short
ment a single path encoder-decoder network to solve pixel- of dealing with high-level features combined from parallel
to-pixel prediction. As the depth of network increase, branches, since it merely implements convolution layers to
how to aggregate features between blocks deserves fur- fuse features. Moreover, features lack communication be-
ther attention. Instead of simple skip connection design, tween parallel branches. Also, the additional branches on
RefineNet[17] introduces a complicated refine module in high-resolution images limit the acceleration of speed.
each upsampling stage between the encoder and decoder to In semantic segmentation task, spatial pyramid pooling
extract multi-scale features. Another aggregation approach (SPP) module is a common approach to deal with high-
is to implement dense connection. The idea of dense con- level features [5] (Figure 2(b)). The ability of spatial pyra-
nections has been recently proposed for image classification mid module is to extract high-level semantic context and in-
in [13] and extended to semantic segmentation in [14] [28]. crease receptive field, such as [4][34][16]. However, imple-
DLA[31] extent this method to develop deeper aggregation menting spatial pyramid module is usually time-consuming.
structures to enhance feature representation ability. Inspired by the above methods, we firstly replace the
high-level operation by upsampling the output of a network
3. Deep Feature Aggregation Network and refining the feature map with another sub-network, as
shown in Figure 2(c). Different from SPP module, the fea-
We start with our observation and analysis of calcula- ture maps are refined on a larger resolution and sub-pixel
tion volume when applying current semantic segmentation details are learned simultaneously. However, as the whole
methods in the real-time task. This motivates our aggrega- structure depth grows, high-dimension features and recep-
tion strategy to combine detail and spatial information in tive field usually suffer precision loss since the feature flow
9516
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
Figure 3. Overview of our Deep Feature Aggregation Network: sub-network aggregation, sub-stage aggregation, and dual-path decoder for
multi-level feature fusion. In the figure, ”C” means concatenation, ”xN” is N× up-sampling operation.
is a single path. tails in the decoder module. However, the deeper encoder
Pushing a bit further, we propose stage-level method blocks lack low-level features and spatial information to
(Figure 2(d)) to deliver low-level features and spatial in- make judgments in large-scale various objects and precise
formation to semantic understanding. Since all these sub- structure edge. Parallel-branch design uses original and de-
networks have the similar structure, stage-level refinement creased resolution as input, and the output is the fusion of
can be produced by concatenating the layers with the same large-scale branch and small-scale branch results, while this
resolution to generate the multi-stage context. Our pro- kind of design has a lack of information communication be-
posed Deep Feature Aggregation Network aims to exploit tween parallel branches.
features combined from both network-level and stage-level. Our sub-stage aggregation is proposed to combine fea-
tures through encoding period. We make the fusion of dif-
3.2. Deep Feature Aggregation ferent stages in the same depth of sub-networks. In detail,
We focus on making the fusion of different depth fea- the output of a certain stage in the previous sub-network is
tures in networks. Our aggregation strategy is composed contributed to the input of the next sub-network in the cor-
of sub-network aggregate and sub-stage aggregate methods. responding stage position.
The structure of DFANet is illustrated in Figure 3. For a single backbone Φn (x), a stage process can be de-
fined as φin . The stage in the previous backbone network is
Sub-network Aggregation. Sub-network aggregation
φin−1 . i means the index of the stage. Sub-stage aggrega-
implements a combination of high-level features at the net-
tion method can be formulated as:
work level. Based on the above analysis, we implement
our architecture as a stack of backbones by feeding the
output of the previous backbone to the next. From an- xi−1 + φin (xi−1
n ) if n = 1,
xin = n
other perspective, sub-network aggregation could be seen [xn , xin−1 ] + φin ([xi−1
i−1 i
n , xn−1 ]) otherwise,
as a refinement process. A backbone process is defined as (1)
y = Φ(x), the output of encoder Φn is the input of en- While, xin−1 is coming from:
coder Φn+1 , so sub-network aggregate can be formulated
xin−1 = xi−1 i i−1
n−1 + φn−1 (xn−1 ) (2)
as: Y = Φn (Φn−1 (...Φ1 (X))).
A similar idea has been introduced in [21]. The struc- Traditional approaches are learning a mapping of F(x)+
ture is composed of a stack of encoder-decoder ”hourglass” x for xi−1
n . In our proposed method, sub-stage aggregation
network. Sub-network aggregation allows these high-level method is learning a residual formulation of [xi−1 i
n , xn−1 ],
features to be processed again to further evaluate and re- at the beginning of each stage.
assess higher order spatial relationships. For n > 1 situation, the input of ith stage in nth network
Sub-stage Aggregation. Sub-stage aggregation focuses is given by combining the ith stage output in (n − 1)th net-
on fusing semantic and spatial information in stage-level be- work with the (i −1)th stage output in nth network, then the
tween multiple networks. As the depth of network grows, ith stage learns a residual representation of [xi−1 i
n , xn−1 ].
i−1 i
spatial details suffer precise lose. Common approaches, like xn has the same resolution as xn−1 , and we implement
U-shape, implement skip connection to recover image de- concatenation operation to fuse features.
9517
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
We keep the feature always flow from high-resolution 4. Experiments
into the low-resolution. Our formulation not only learns a
new mapping of nth feature maps but also preserves (n − While our proposed network is effective for high resolu-
1)th features and receptive field. Information flow can be tion images, we evaluate it on two challenging benchmarks:
transferred through multiple networks. Cityscapes and CamVid. The image resolution of these
two datasets are up to 2048 × 1024 and 960 × 720 respeciti-
3.3. Network Architecture valy, which makes it a big challenge for real-time seman-
tic segmentation. In the following, we first investigate the
The whole architecture is shown in Figure 3. In gen- effects of the proposed architecture, then conduct the accu-
eral, our semantic segmentation network could be seen as racy and speed results on Cityscapes and CamVid compared
an encoder-decoder structure. As discussed above, the en- with the existing real-time segmentation algorithms.
coder is an aggregation of three Xception backbones, com- All the networks mentioned below follow the same train-
posed with sub-network aggregate and sub-stage aggregate ing strategy. They are trained using mini-batch stochastic
methods. For real-time inference, we don’t put too much fo- gradient descent (SGD) with batch size 48, momentum 0.9
cus on the decoder. The decoder is designed as an efficient and weight decay 1e − 5. As common configuration, the
feature upsampling module to fuse low-level and high-level ”poly” learning rate policy is adopted where the initial rate
features. For convenience to implement our aggregate strat- iter power
is multiplied by (1 − max iter ) with power 0.9 and the
egy, our sub-network is implemented by a backbone with base learning rate is set as 2e − 1. The cross-entropy error
single bilinear upsampling as a naive decoder. All these at each pixel over the categories is applied as our loss func-
backbones have the same structure and are initalized with tion. Data augmentation contains mean subtraction, random
same pretrained weight. horizontal flip, random resizing with scale ranges in [0.75,
Backbone. The basic backbone is a lightweight Xcep- 1.75], and random cropping into fix size for training.
tion model with little modification for segmentation task,
we will discuss the network configuration in the next sec- 4.1. Analysis of DFA Architecture
tion. For semantic segmentation, not only providing dense We adopt Cityscapes to conduct the quantitative and
feature representation, how to gain semantic context effec- qualitative analysis of experiments firstly. The Cityscapes is
tively remains a problem. Therefore, we preserve fully- comprised of a large, diverse set of stereo video sequences
connected layers from ImageNet pretraining to enhance se- recorded in streets from 50 different cities, containing 30
mantic extraction. In classification task, fully-connected classes, and 19 of them are considered for training and eval-
(FC) layer is followed by global pooling layers to make uation. The dataset contains 5,000 finely annotated images
final probability vectors. Since classification task dataset and 19,998 images with coarse annotation, which all have a
[15] provides large amount of categories than segmentation high resolution of 2048 × 1024. Following the standard
datasets [10][36]. Fully-connected layer from ImageNet setting of Cityscapes, the fine annotated images are split
pretraining could be more powerful to extract category in- into training, validation and testing sets with 2,979, 500 and
formation than training from segmentation datasets. We ap- 1,525 images respectively. We only use the fine annotated
ply a 1 × 1 convolution layer followed with FC layer to images during training and stop the training process after
reduce channels to match the feature maps from Xception 40K iterations.
backbone. Then N ×C ×1×1 encoding vector is multiplied The model performance is evaluated on Cityscapes val-
with original extracted features in channel-wise manner. idation set. For fair comparison, we make the ablation
Decoder. Our proposed decoder module is illustrated in study under 1024 × 1024 crop size. In this process, we
Figure 3. For real-time inference, we don’t put too much fo- don’t employ any testing augmentation, like multi-scale or
cus on designing complicated decoder module. According multi-crop testing for the best result quality. For quanti-
to DeepLabV3+[7], not all the features of the stages are nec- tative evaluation, the mean of class-wise intersection over
essary to contribute to decoder module. We propose to fuse union (mIoU), and the number of float-point operations
high-level and low-level features directly. Because our en- (FLOPs) are applied to investigate the accuracy and com-
coder is composed of three backbones, we firstly fuse high- putation complexity measurement respectively.
level representation from the bottom of three backbones.
Then the high-level features are bilinearly upsampled by a
4.1.1 Lightweight Backbone Networks
factor of 4, and low-level information from each backbone
that have the same spatial resolution is fused respectively. As mentioned above, backbone network is one of the ma-
Then the high-level features and low-level details are added jor limitations of model acceleration. However, too small
together and upsampled by a factor of 4 to make the final backbone networks lead to serious degradation of segmen-
prediction. In decoder module, we only implement a few tation accuracy. Xception, designed with lightweight archi-
convolution calculations to reduce the number of channels. tecture, is known as achieving better speed-accuracy trade-
9518
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
stage Xception A Xception B Model Scale FLOPs mIoU(%)
conv1 ⎡ 3 × 3, 8, stride
⎤ 2 ⎡ 3 × 3, 8, stride
⎤ 2 ResNet-50 0.25 9.3G 64.5
3 × 3, 12 3 × 3, 8 ResNet-50 1.0 149.2G 68.3
enc2 ⎣ 3 × 3, 12 ⎦ × 4 ⎣ 3 × 3, 8 ⎦ × 4 ResNet-50 + ASPP 1.0 214.4G 72.1
⎡ 3 × 3, 48 ⎤ ⎡ 3 × 3, 32 ⎤ Xception A 1.0 1.6G 59.2
3 × 3, 24 3 × 3, 16 Xception A + ASPP 1.0 6.9G 67.1
enc3 ⎣ 3 × 3, 24 ⎦ × 6 ⎣ 3 × 3, 16 ⎦ × 6 Xception B 1.0 0.83G 55.4
⎡ 3 × 3, 96 ⎤ ⎡ 3 × 3, 64 ⎤ Xception B + ASPP 1.0 4.4G 64.7
3 × 3, 48 3 × 3, 32 Backbone A 1.0 1.6G 65.4
enc4 ⎣ 3 × 3, 48 ⎦ × 4 ⎣ 3 × 3, 32 ⎦ × 4
Backbone B 1.0 0.83G 59.2
3 × 3, 192 3 × 3, 128
Table 2. Different structure followed with or without ASPP, eval-
Table 1. Modified Xception architecture. Building blocks are uate on Cityscapes val dataset. ’Backbone’ means Xception net-
shown in brackets with the numbers of blocks stacked. 3×3 means work followed with FC attention module. ’Scale’ means scaling
a depthwise separable convolution except ”conv1”. In ”conv1” ratio of input image.
stage, we only implement a 3 × 3 convolution layer.
9519
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
Model FLOPs Params mIoU(%) Model FLOPs Params mIoU
Backbone A 1.6G 2.1M 65.4 Backbone A x2 2.4G 4.9M 66.3
Backbone A x2 2.4G 4.9M 66.3 Backbone A x2+HL 2.5G 5.0M 67.1
Backbone A x3 2.6G 7.6M 65.1 Backbone A x2+HL+LL 3.2G 5.1M 69.4
Backbone A x4 2.7G 10.2M 50.8 Backbone A x3 2.6G 7.6M 65.1
Backbone B 0.83G 1.4M 59.2 Backbone A x3+HL 2.7G 7.7M 69.6
Backbone B x2 1.2G 3.1M 62.1 Backbone A x3+HL+LL 3.4G 7.8M 71.9
Backbone B x3 1.4G 4.7M 58.2 Backbone B x3 1.4G 4.7M 58.2
Backbone B x4 1.5G 6.3M 50.7 Backbone B x3+HL 1.5G 4.9M 67.6
Backbone B x3+HL+LL 2.1G 4.9M 68.4
Table 3. Detailed performance comparison of our proposed aggre-
gation strategy. ’×N ’ means that we replicate N backbones to Table 4. Detailed performance comparison of our proposed de-
implement feature aggregation. coder module. ’HL’ means that fusing high-level features. ’LL’
means fusing low-level features.
9520
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
Model InputSize FLOPs Params Time(ms) Frame(fps) mIoU(%)
PSPNet[34] 713 × 713 412.2G 250.8M 1288 0.78 81.2
DeepLab[4] 512 × 1024 457.8G 262.1M 4000 0.25 63.1
SegNet[1] 640 × 360 286G 29.5M 16 16.7 57
ENet[22] 640 × 360 3.8G 0.4M 7 135.4 57
SQ[25] 1024 × 2048 270G - 60 16.7 59.8
CRF-RNN[35] 512 × 1024 - - 700 1.4 62.5
FCN-8S[19] 512 × 1024 136.2G - 500 2 63.1
FRRN[24] 512 × 1024 235G - 469 0.25 71.8
ICNet[33] 1024 × 2048 28.3G 26.5M 33 30.3 69.5
TwoColumn[27] 512 × 1024 57.2G - 68 14.7 72.9
BiSeNet1[29] 768 × 1536 14.8G 5.8M 13 72.3 68.4
BiSeNet2[29] 768 × 1536 55.3G 49M 21 45.7 74.7
DFANet A 1024 × 1024 3.4G 7.8M 10 100 71.3
DFANet B 1024 × 1024 2.1G 4.8M 8 120 67.1
DFANet A’ 512 × 1024 1.7G 7.8M 6 160 70.3
Table 5. Speed analysis on Cityscapes test dataset. ”-” indicates that the corresponding result is not provided by the methods.
listed for comparison in the table. In this process, we don’t Model Time(ms) Frame(fps) mIoU(%)
employ any testing augmentation. SegNet[1] 217 46 46.4
As can be observed, while the inference speed of the DPN[30] 830 1.2 60.1
proposed method significantly outperforms state-of-the-art DeepLab[4] 203 4.9 61.6
methods, the accuracy performance is kept comparable, at- ENet[22] - - 51.3
tributing to the simple and efficient pipeline. The base- ICNet[33] 36 27.8 67.1
line of the proposed method achieves mIoU 71.3% on BiSeNet1[29] - - 65.6
Cityscapes test set with 100 FPS inference speed. We ex- BiSeNet2[29] - - 68.7
tend the proposed method in two aspects that the input size DFANet A 8 120 64.7
and the channel dimension. When the backbone model is DFANet B 6 160 59.3
decreased to a simplied one, the accuracy performance of
DFANet is decreased to 67.1% corresponding with still 120 Table 6. Results on CamVid test set.
FPS inference speed, which is comparable with the previ-
ous state-of-the-art with 68.4% of bisenet[29]. However,
while the height of input image is downsampled to half, the image resolution for training and evaluation are both 960
FLOPs of the DFANet A drops to 1.7G, but the accuracy is × 720. The results are reported in Table 6. DFANets get
still good enough to outperform several existing methods. much faster inference speed 120 FPS and 160 FPS than
The fastest setting of our method runs at a speed of 160 other methods on this high resolution with slightly worse
FPS at mIoU 70.3%, while the previous fastest results[22] than the state-of-the-art methods[33].
is only 135 FPS at mIoU 57%. Compared with the previous
state-of-the-art model[29], the proposed DFANet A, B, A’ 5. Conclusion
has 1.38 ×, 1.65 × and 2.21 × speed acceleration and only
1/4, 1/7 and 1/8 FLOPs, with even slightly better segmenta- In this paper, we propose deep feature aggregation to
tion accuracy. Some visual results of the proposed DFANet tackle real-time semantic segmentation on high resolution
A is showed in Figure 4. With the proposed feature aggre- image. Out aggregation strategy connects a set of convo-
gation structure, we produce decent prediction results on lution layers to effectively refine high-level and low-level
Cityscapes. features, without any specifically designed operation. Anal-
ysis and quantitative experimental results on Cityscapes and
4.3. Comparison on Other Datasets CamVid dataset are presented to demonstrate the effective-
We also evaluate our DFANet on CamVid dataset. ness of our method.
CamVid contains images extracted from video sequences Acknowledgements This research was supported
with resolution up to 960 × 720. It contains 701 images by National Key R&D Program of China (No.
in total, including 367 for training, 101 for validation and 2017YFA0700800), and The National Key Research
233 for testing. We adopt the same setting as [23]. The and Development Program of China (2018YFC0831700).
9521
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
References tiramisu: Fully convolutional densenets for semantic seg-
mentation. In Computer Vision and Pattern Recognition
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Workshops (CVPRW), 2017 IEEE Conference on, pages
Segnet: A deep convolutional encoder-decoder architecture 1175–1183. IEEE, 2017.
for image segmentation. IEEE transactions on pattern anal- [15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
ysis and machine intelligence, 39(12):2481–2495, 2017. Imagenet classification with deep convolutional neural net-
[2] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and works. In Advances in neural information processing sys-
Roberto Cipolla. Segmentation and recognition using struc- tems, pages 1097–1105, 2012.
ture from motion point clouds. In ECCV (1), pages 44–57, [16] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang.
2008. Pyramid attention network for semantic segmentation. arXiv
[3] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian preprint arXiv:1805.10180, 2018.
Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and [17] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian
channel-wise attention in convolutional networks for image Reid. Refinenet: Multi-path refinement networks for high-
captioning. arXiv preprint arXiv:1611.05594, 2016. resolution semantic segmentation. In IEEE Conference on
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Computer Vision and Pattern Recognition (CVPR), 2017.
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image [18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
segmentation with deep convolutional nets, atrous convolu- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
tion, and fully connected crfs. IEEE transactions on pattern Zitnick. Microsoft coco: Common objects in context. In
analysis and machine intelligence, 40(4):834–848, 2018. European conference on computer vision, pages 740–755.
[5] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Springer, 2014.
Hartwig Adam. Rethinking atrous convolution for seman- [19] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
tic image segmentation. arXiv preprint arXiv:1706.05587, convolutional networks for semantic segmentation. In Pro-
2017. ceedings of the IEEE conference on computer vision and pat-
[6] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and tern recognition, pages 3431–3440, 2015.
Alan L Yuille. Attention to scale: Scale-aware semantic im- [20] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Re-
age segmentation. In Proceedings of the IEEE conference on current models of visual attention. In Advances in neural
computer vision and pattern recognition, pages 3640–3649, information processing systems, pages 2204–2212, 2014.
2016. [21] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-
[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Flo- glass networks for human pose estimation. In European Con-
rian Schroff, and Hartwig Adam. Encoder-decoder with ference on Computer Vision, pages 483–499. Springer, 2016.
atrous separable convolution for semantic image segmenta- [22] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eu-
tion. arXiv preprint arXiv:1802.02611, 2018. genio Culurciello. Enet: A deep neural network architec-
[8] François Chollet. Xception: Deep learning with depthwise ture for real-time semantic segmentation. arXiv preprint
separable convolutions. arXiv preprint, pages 1610–02357, arXiv:1606.02147, 2016.
2017. [23] Philip H. S. Torr Paul Sturgess, Karteek Alahari. Combining
[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo appearance and structure from motion features for road scene
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe understanding. In BMVC, 2009.
Franke, Stefan Roth, and Bernt Schiele. The cityscapes [24] Tobias Pohlen, Alexander Hermans, Markus Mathias, and
dataset for semantic urban scene understanding. In Proceed- Bastian Leibe. Full-resolution residual networks for seman-
ings of the IEEE conference on computer vision and pattern tic segmentation in street scenes. arXiv preprint, 2017.
recognition, pages 3213–3223, 2016. [25] Michael Treml, José Arjona-Medina, Thomas Unterthiner,
[10] Mark Everingham, Luc Van Gool, Christopher KI Williams, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, An-
John Winn, and Andrew Zisserman. The pascal visual object dreas Mayr, Martin Heusel, Markus Hofmarcher, Michael
classes (voc) challenge. International journal of computer Widrich, et al. Speeding up semantic segmentation for au-
vision, 88(2):303–338, 2010. tonomous driving. In MLITS, NIPS Workshop, 2016.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [26] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki
Deep residual learning for image recognition. In Proceed- Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta
ings of the IEEE conference on computer vision and pattern Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen,
recognition, pages 770–778, 2016. et al. Espnet: End-to-end speech processing toolkit. arXiv
[12] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- preprint arXiv:1804.00015, 2018.
works. arXiv preprint arXiv:1709.01507, 2017. [27] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Real-
[13] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens time semantic image segmentation via spatial sparsity. arXiv
van der Maaten. Densely connected convolutional networks. preprint arXiv:1712.00213, 2017.
In Proceedings of the IEEE conference on computer vision [28] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan
and pattern recognition, volume 1, page 3, 2017. Yang. Denseaspp for semantic segmentation in street scenes.
[14] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana [29] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
Romero, and Yoshua Bengio. The one hundred layers Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation
9522
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.
network for real-time semantic segmentation. arXiv preprint
arXiv:1808.00897, 2018.
[30] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
Gang Yu, and Nong Sang. Learning a discriminative fea-
ture network for semantic segmentation. arXiv preprint
arXiv:1804.09337, 2018.
[31] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor
Darrell. Deep layer aggregation. arXiv preprint
arXiv:1707.06484, 2017.
[32] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang,
Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con-
text encoding for semantic segmentation. arXiv preprint
arXiv:1803.08904, 2018.
[33] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jian-
ping Shi, and Jiaya Jia. Icnet for real-time semantic
segmentation on high-resolution images. arXiv preprint
arXiv:1704.08545, 2017.
[34] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), pages 2881–2890, 2017.
[35] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-
Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang
Huang, and Philip HS Torr. Conditional random fields as
recurrent neural networks. In Proceedings of the IEEE Inter-
national Conference on Computer Vision, pages 1529–1537,
2015.
[36] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Semantic understand-
ing of scenes through the ade20k dataset. arXiv preprint
arXiv:1608.05442, 2016.
9523
Authorized licensed use limited to: PES University Bengaluru. Downloaded on December 12,2023 at 15:55:27 UTC from IEEE Xplore. Restrictions apply.