0% found this document useful (0 votes)
17 views22 pages

2002.10120 SFSeg DF

Uploaded by

herst1934
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views22 pages

2002.10120 SFSeg DF

Uploaded by

herst1934
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Semantic Flow for Fast and Accurate Scene

Parsing

Xiangtai Li1 ? , Ansheng You1 * , Zhen Zhu2 , Houlong Zhao3 , Maoke Yang3 ,
Kuiyuan Yang3 , Shaohua Tan1 , and Yunhai Tong1
1
Key Laboratory of Machine Perception, MOE, School of EECS, Peking University
arXiv:2002.10120v3 [cs.CV] 29 Mar 2021

2
Huazhong University of Science and Technology
3
DeepMotion

Abstract. In this paper, we focus on designing effective method for fast


and accurate scene parsing. A common practice to improve the perfor-
mance is to attain high resolution feature maps with strong semantic
representation. Two strategies are widely used—atrous convolutions and
feature pyramid fusion, are either computation intensive or ineffective.
Inspired by the Optical Flow for motion alignment between adjacent
video frames, we propose a Flow Alignment Module (FAM) to learn Se-
mantic Flow between feature maps of adjacent levels, and broadcast high-
level features to high resolution features effectively and efficiently. Fur-
thermore, integrating our module to a common feature pyramid structure
exhibits superior performance over other real-time methods even on light-
weight backbone networks, such as ResNet-18. Extensive experiments are
conducted on several challenging datasets, including Cityscapes, PAS-
CAL Context, ADE20K and CamVid. Especially, our network is the
first to achieve 80.4% mIoU on Cityscapes with a frame rate of 26 FPS.
The code is available at https://github.com/lxtGH/SFSegNets.

Keywords: Scene Parsing, Semantic Flow, Flow Alignment Module

1 Introduction

Scene parsing or semantic segmentation is a fundamental vision task which aims


to classify each pixel in the images correctly. Two important factors that are
highly influential to the performance are: detailed information [48] and strong
semantics representation [6, 67]. The seminal work of Long et. al. [35] built a
deep Fully Convolutional Network (FCN), which is mainly composed from con-
volutional layers, in order to carve strong semantic representation. However,
detailed object boundary information, which is also crucial to the performance,
is usually missing due to the use of the down-sampling layers. To alleviate this
problem, state-of-the-art methods [16, 67, 68, 71] apply atrous convolutions [58]
at the last several stages of their networks to yield feature maps with strong
semantic representation while at the same time maintaining the high resolution.
?
The first two authors have equal contribution.
2 X. Li et al.

85

PSPNet(ms) ResNet38 SFNet(ResNet-18)


80 DUC
SFNet(DF2)

DF2-Seg SFNet(DF1)
75
RefineNet

Accuracy (mIoU%)
BiSeNet(ResNet-18) DF1-Seg
FRRN
DeepLabv2-CRF ICNet DFANet
70
BiSeNet(Xception-39)
Dilation10
Fast-SCNN
DPN FCN-8s
65
DeepLab
CRF-RNN ESPNetv2
SQ
60
ESPNet ENet
SegNet

55
<0.2 1 10 Real-time 100 >200
Speed (Frames/Second)

Fig. 1. Inference speed versus mIoU performance on test set of Cityscapes. Previous
models are marked as red points, and our models are shown in blue points which achieve
the best speed/accuracy trade-off. Note that our method with ResNet-18 as backbone
even achieves comparable accuracy with all accurate models at much faster speed.

Nevertheless, doing so inevitably requires intensive extra computation since


the feature maps in the last several layers can reach up to 64 times bigger
than those in FCNs. Given that the FCN using ResNet-18 [20] as the backbone
network has a frame rate of 57.2 FPS for a 1024 × 2048 image, after applying
atrous convolutions [58] to the network as done in [67, 68], the modified network
only has a frame rate of 8.7 FPS. Moreover, under a single GTX 1080Ti GPU
with no other ongoing programs, the previous state-of-the-art model PSPNet [67]
has a frame rate of only 1.6 FPS for 1024×2048 input images. As a consequence,
this is very problematic to many advanced real-world applications, such as self-
driving cars and robots navigation, which desperately demand real-time online
data processing.
In order to not only maintain detailed resolution information but also get fea-
tures that exhibit strong semantic representation, another direction is to build
FPN-like [25, 34, 48] models which leverage the lateral path to fuse feature maps
in a top-down manner. In this way, the deep features of the last several layers
strengthen the shallow features with high resolution and therefore, the refined
features are possible to satisfy the above two factors and beneficial to the ac-
curacy improvement. However, the accuracy of these methods [1, 48] is still un-
satisfactory when compared to those networks who hold large feature maps in
the last several stages. We suspect the low accuracy problem arises from the
ineffective propagation of semantics from deep layers to shallow layers.
To mitigate this issue, we propose to learn the Semantic Flow between
two network layers of different resolutions. The concept of Semantic Flow is
inspired from optical flow, which is widely used in video processing task [70]
to represent the pattern of apparent motion of objects, surfaces, and edges in
a visual scene caused by relative motion. In a flash of inspiration, we feel the
relationship between two feature maps of arbitrary resolutions from the same
image can also be represented with the “motion” of every pixel from one feature
map to the other one. In this case, once precise Semantic Flow is obtained,
the network is able to propagate semantic features with minimal information
Semantic Flow for Fast and Accurate Scene Parsing 3

loss. It should be noted that Semantic Flow is apparently different from optical
flow, since Semantic Flow takes feature maps from different levels as input and
assesses the discrepancy within them to find a suitable flow field that will give
dynamic indication about how to align these two feature maps effectively.
Based on the concept of Semantic Flow, we design a novel network module
called Flow Alignment Module (FAM) to utilize Semantic Flow in the scene
parsing task. Feature maps after FAM are embodied with both rich semantics
and abundant spatial information. Because FAM can effectively transmit the
semantic information from deep layers to shallow layers through very simple op-
erations, it shows superior efficacy in both improving the accuracy and keeping
superior efficiency. Moreover, FAM is end-to-end trainable, and can be plugged
into any backbone networks to improve the results with a minor computational
overhead. For simplicity, we call the networks that all incorporate FAM but have
different backbones as SFNet(backbone). As depicted in Figure 1, SFNet with
different backbone networks outperforms other competitors by a large margin
under the same speed. In particular, our method adopting ResNet-18 as back-
bone achieves 80.4% mIoU on the Cityscapes test server with a frame rate of 26
FPS. When adopting DF2 [31] as backbone, our method achieves 77.8% mIoU
with 61 FPS and 74.5% mIoU with 121 FPS when equipped with the DF1 back-
bone. Moreover, when using deeper backbone networks, such as ResNet-101,
SFNet achieves better results(81.8 %mIoU) than the previous state-of-the-art
model DANet [16](81.5 %mIoU), and only requires 33% computation of DANet
during the inference. Besides, the consistent superior efficacy of SFNet across
various datasets also clearly demonstrates its broad applicability.
To conclude, our main contributions are three-fold:
– We introduce the concept of Semantic Flow in the field of scene parsing and
propose a novel flow-based align module (FAM) to learn the Semantic Flow
between feature maps of adjacent levels and broadcast high-level features to
high resolution features more effectively and efficiently.
– We insert FAMs into the feature pyramid framework and build a feature
pyramid aligned network called SFNet for fast and accurate scene parsing.
– Detailed experiments and analysis indicate the efficacy of our proposed mod-
ule in both improving the accuracy and keeping light-weight. We achieve
state-of-the-art results on Cityscapes, Pascal Context, Camvid datasets and
a considerable gain on ADE20K.

2 Related Work
For scene parsing, there are mainly two paradigms for high-resolution semantic
map prediction. One paradigm tries to keep both spatial and semantic infor-
mation along the main network pathway, while the other paradigm distributes
spatial and semantic information to different parts in a network, then merges
them back via different strategies.
The first paradigm mostly relies on some network operations to retain high-
resolution feature maps in the latter network stages. Many state-of-the-art ac-
4 X. Li et al.

curate methods [16, 67, 71] follow this paradigm to design sophisticated head
networks to capture contextual information. PSPNet [67] proposes to leverage
pyramid pooling module (PPM) to model multi-scale contexts, whilst DeepLab
series [5–7, 55] uses astrous spatial pyramid pooling (ASPP). In [16, 18, 19,
21, 29, 59, 72], non-local operator [52] and self-attention mechanism [51] are
adopted to harvest pixel-wise context from the whole image. Meanwhile, several
works [24, 28, 32, 62, 63] use graph convolutional neural networks to propagate
information over the image by projecting features into an interaction space.

The second paradigm contains several state-of-the-art fast methods, where


high-level semantics are represented by low-resolution feature maps. A common
strategy is to fuse multi-level feature maps for high-resolution spatiality and
strong semantics [1,30,35,48,54]. ICNet [66] uses multi-scale images as input and
a cascade network to be more efficient. DFANet [27] utilizes a light-weight back-
bone to speed up its network and proposes a cross-level feature aggregation to
boost accuracy, while SwiftNet [44] uses lateral connections as the cost-effective
solution to restore the prediction resolution while maintaining the speed. To
further speed up, low-resolution images are used as input for high-level seman-
tics [37, 66] which reduce features into low resolution and then upsample them
back by a large factor. The direct consequence of using a large upsample factor
is performance degradation, especially for small objects and object boundaries.
Guided upsampling [37] is related to our method, where the semantic map is
upsampled back to the input image size guided by the feature map from an
early layer. However, this guidance is still insufficient for some cases due to the
information gap between the semantics and resolution. In contrast, our method
aligns feature maps from adjacent levels and further enhances the feature maps
using a feature pyramid framework towards both high resolution and strong se-
mantics, consequently resulting in the state-of-the-art performance considering
the trade-off between high accuracy and fast speed.

There is another set of works focusing on designing light-weight backbone net-


works to achieve real-time performances. ESPNets [38, 39] save computation by
decomposing standard convolution into point-wise convolution and spatial pyra-
mid of dilated convolutions. BiSeNet [56] introduces spatial path and semantic
path to reduce computation. Recently, several methods [31, 41, 65] use AutoML
techniques to search efficient architectures for scene parsing. Our method is com-
plementary to some of these works, which further boosts their accuracy. Since
our proposed semantic flow is inspired by optical flow [13], which is used in video
semantic segmentation, we also discuss several works in video semantic segmen-
tation. For accurate results, temporal information is exceedingly exploited by
using optical flow. Gadde et. al. [17] warps internal feature maps and Nilsson et.
al. [43] warps final semantic maps from nearby frame predictions to the cur-
rent map. To pursue faster speed, optical flow is used to bypass the low-level
feature computation of some frames by warping features from their preceding
frames [33, 70]. Our work is different from theirs by propagating information
hierarchically in another dimension, which is orthogonal to the temporal propa-
gation for videos.
Semantic Flow for Fast and Accurate Scene Parsing 5

Bilinear Upsampling

Flow Color Coding

Low-Resolution
Warp

After Warp

Flow Field
High-Resolution

Fig. 2. Visualization of feature maps and semantic flow field in FAM. Feature maps are
visualized by averaging along the channel dimension. Larger values are denoted by hot
colors and vice versa. We use the color code proposed in [2] to visualize the Semantic
Flow field. The orientation and magnitude of flow vectors are represented by hue and
saturation respectively.

3 Method
In this section, we will first give some preliminary knowledge about scene pars-
ing and introduce the misalignment problem therein. Then, we propose the Flow
Alignment Module (FAM) to resolve the misalignment issue by learning Seman-
tic Flow and warping top-layer feature maps accordingly. Finally, we present
the whole network architecture equipped with FAMs based on the FPN frame-
work [34] for fast and accurate scene parsing.

3.1 Preliminary
The task of scene parsing is to map an RGB image X ∈ RH×W ×3 to a semantic
map Y ∈ RH×W ×C with the same spatial resolution H × W , where C is the
number of predefined semantic categories. Following the setting of FPN [34], the
input image X is firstly mapped to a set of feature maps {Fl }l=2,...,5 from each
network stage, where Fl ∈ RHl ×Wl ×Cl is a Cl -dimensional feature map defined
on a spatial grid Ωl with size of Hl × Wl , Hl = H2l
, Wl = W
2l
. The coarsest fea-
ture map F5 comes from the deepest layer with strongest semantics. FCN-32s
directly predicts upon F5 and achieves over-smoothed results without fine de-
tails. However, some improvements can be achieved by fusing predictions from
lower levels [35]. FPN takes a step further to gradually fuse high-level feature
maps with low-level feature maps in a top-down pathway through 2× bi-linear
upsampling, which was originally proposed for object detection [34] and recently
introduced for scene parsing [25, 54]. The whole FPN framework highly relies on
upsampling operator to upsample the spatially smaller but semantically stronger
feature map to be larger in spatial size. However, the bilinear upsampling re-
covers the resolution of downsampled feature maps by interpolating a set of
uniformly sampled positions (i.e., it can only handle one kind of fixed and pre-
defined misalignment), while the misalignment between feature maps caused
by a residual connection, repeated downsampling and upsampling, is far more
complex. Therefore, position correspondence between feature maps needs to be
explicitly and dynamically established to resolve their actual misalignment.
6 X. Li et al.

<latexit sha1_base64="DtjbI6+eq8wUCosod6aTR7nZxlg=">AAAB8HicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2WhJtLDER0MCF7C17sGF377I7Z0Iu/AobC42x9efY+W9c4AoFXzLJy3szmZkXpVJY9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aNskM4y2WyMQ8RNRyKTRvoUDJH1LDqYok70Tjm5nfeeLGikTf4yTloaJDLWLBKDrpMeihUNySoF+t+XV/DrJKgoLUoECzX/3qDRKWKa6RSWptN/BTDHNqUDDJp5VeZnlK2ZgOeddRTd2aMJ8fPCVnThmQODGuNJK5+nsip8raiYpcp6I4ssveTPzP62YYX4W50GmGXLPFojiTBBMy+54MhOEM5cQRyoxwtxI2ooYydBlVXAjB8surpH1RDxy/82uN6yKOMpzAKZxDAJfQgFtoQgsYKHiGV3jzjPfivXsfi9aSV8wcwx94nz/veI/Q</latexit>
1/4 1/8 1/16 1/32

Stage1 Stage2 Stage3 Stage4

<latexit sha1_base64="DtjbI6+eq8wUCosod6aTR7nZxlg=">AAAB8HicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2WhJtLDER0MCF7C17sGF377I7Z0Iu/AobC42x9efY+W9c4AoFXzLJy3szmZkXpVJY9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aNskM4y2WyMQ8RNRyKTRvoUDJH1LDqYok70Tjm5nfeeLGikTf4yTloaJDLWLBKDrpMeihUNySoF+t+XV/DrJKgoLUoECzX/3qDRKWKa6RSWptN/BTDHNqUDDJp5VeZnlK2ZgOeddRTd2aMJ8fPCVnThmQODGuNJK5+nsip8raiYpcp6I4ssveTPzP62YYX4W50GmGXLPFojiTBBMy+54MhOEM5cQRyoxwtxI2ooYydBlVXAjB8surpH1RDxy/82uN6yKOMpzAKZxDAJfQgFtoQgsYKHiGV3jzjPfivXsfi9aSV8wcwx94nz/veI/Q</latexit>

|| Up <latexit sha1_base64="DtjbI6+eq8wUCosod6aTR7nZxlg=">AAAB8HicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2WhJtLDER0MCF7C17sGF377I7Z0Iu/AobC42x9efY+W9c4AoFXzLJy3szmZkXpVJY9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aNskM4y2WyMQ8RNRyKTRvoUDJH1LDqYok70Tjm5nfeeLGikTf4yTloaJDLWLBKDrpMeihUNySoF+t+XV/DrJKgoLUoECzX/3qDRKWKa6RSWptN/BTDHNqUDDJp5VeZnlK2ZgOeddRTd2aMJ8fPCVnThmQODGuNJK5+nsip8raiYpcp6I4ssveTPzP62YYX4W50GmGXLPFojiTBBMy+54MhOEM5cQRyoxwtxI2ooYydBlVXAjB8surpH1RDxy/82uN6yKOMpzAKZxDAJfQgFtoQgsYKHiGV3jzjPfivXsfi9aSV8wcwx94nz/veI/Q</latexit>

<latexit sha1_base64="9HhLVcwq98RwEYzeTd5kL3Kpmsg=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5MEcugjWUE84HJEfY2e8mSvb1jd04IR/6FjYUitv4bO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3M79zhPXRsTqAacJ9yM6UiIUjKKVHmukjyLihtQG5YpbdRcg68TLSQVyNAflr/4wZmnEFTJJjel5boJ+RjUKJvms1E8NTyib0BHvWaqoXeNni4tn5MIqQxLG2pZCslB/T2Q0MmYaBbYzojg2q95c/M/rpRhe+5lQSYpcseWiMJUEYzJ/nwyF5gzl1BLKtLC3EjammjK0IZVsCN7qy+ukfVX1LL93K42bPI4inME5XIIHdWjAHTShBQwUPMMrvDnGeXHenY9la8HJZ07hD5zPH0xOj/4=</latexit>

Input
PPM

Flow Field
Warp
offset warp

+ offset warp
Predict
FAM FAM FAM

Flow Field Low Resolution High Resolution


Prediction
FAM
FAM
FAM
|| Concat + Sum Upsample Convolution

(a) Flow Alignment Module (b) Warp Procedure (c) Overview of SFNet

Fig. 3. (a) The details of Flow Alignment Module. We combine the transformed high-
resolution feature map and low-resolution feature map to generate the semantic flow
field, which is utilized to warp the low-resolution feature map to high-resolution feature
map. (b) Warp procedure of Flow Alignment Module. The value of the high-resolution
feature map is the bilinear interpolation of the neighboring pixels in low-resolution
feature map, where the neighborhoods are defined according learned semantic flow
field. (c) Overview of our proposed SFNet. ResNet-18 backbone with four stages is
used for exemplar illustration. FAM: Flow Alignment Module. PPM: Pyramid Pooling
Module [67]. Best view it in color and zoom in.

3.2 Flow Alignment Module

Design Motivation. For more flexible and dynamic alignment, we thoroughly


investigate the idea of optical flow, which is very effective and flexible to align
two adjacent video frame features in the video processing task [4, 70]. The idea
of optical flow motivates us to design a flow-based alignment module (FAM)
to align feature maps of two adjacent levels by predicting a flow field inside the
network. We define such flow field as Semantic Flow, which is generated between
different levels in a feature pyramid. For efficiency, while designing our network,
we adopt an efficient backbone network—FlowNet-S [13].
Module Details. FAM is built within the FPN framework, where feature map
of each level is compressed into the same channel depth through two 1×1 con-
volution layers before entering the next level. Given two adjacent feature maps
Fl and Fl−1 with the same channel number, we up-sample Fl to the same size
as Fl−1 via a bi-linear interpolation layer. Then, we concatenate them together
and take the concatenated feature map as input for a sub-network that con-
tains two convolutional layers with the kernel size of 3 × 3. The output of the
sub-network is the prediction of the semantic flow field ∆l−1 ∈ RHl−1 ×Wl−1 ×2 .
Mathematically, the aforementioned steps can be written as:

∆l−1 = convl (cat(Fl , Fl−1 )), (1)

where cat(·) represents the concatenation operation and convl (·) is the 3×3 con-
volutional layer. Since our network adopts strided convolutions, which could lead
to very low resolution, for most cases, the respective field of the 3×3 convolution
convl is sufficient to cover most large objects of that feature map. Note that,
we discard the correlation layer proposed in FlowNet-C [13], where positional
correspondence is calculated explicitly. Because there exists a huge semantic gap
Semantic Flow for Fast and Accurate Scene Parsing 7

between higher-level layer and lower-level layer, explicit correspondence calcula-


tion on such features is difficult and tends to fail for offset prediction. Moreover,
adopting such a correlation layer introduces heavy computation cost, which vi-
olates our goal for the network to be fast and accurate.
After having computed ∆l−1 , each position pl−1 on the spatial grid Ωl−1 is
then mapped to a point pl on the upper level l via a simple addition operation.
Since there exists a resolution gap between features and flow field shown in
Fig 3(b), the warped grid and its offset should be halved as Eq 2,

pl−1 + ∆l−1 (pl−1 )


pl = . (2)
2
We then use the differentiable bi-linear sampling mechanism proposed in the
spatial transformer networks [22], which linearly interpolates the values of the
4-neighbors (top-left, top-right, bottom-left, and bottom-right) of pl to approx-
imate the final output of the FAM, denoted by F e l (pl−1 ). Mathematically,
X
F
e l (pl−1 ) = Fl (pl ) = wp Fl (p), (3)
p∈N (pl )

where N (pl ) represents neighbors of the warped points pl in Fl and wp denotes


the bi-linear kernel weights estimated by the distance of warped grid. This warp-
ing procedure may look similar to the convolution operation of the deformable
kernels in deformable convolution network (DCN) [10]. However, our method has
a lot of noticeable difference from DCN. First, our predicted offset field incor-
porates both higher-level and lower-level features to align the positions between
high-level and low-level feature maps, while the offset field of DCN moves the
positions of the kernels according to the predicted location offsets in order to
possess larger and more adaptive respective fields. Second, our module focuses
on aligning features while DCN works more like an attention mechanism that
attends to the salient parts of the objects. More detailed comparison can be
found in the experiment part.
On the whole, the proposed FAM module is light-weight and end-to-end
trainable because it only contains one 3×3 convolution layer and one parameter-
free warping operation in total. Besides these merits, it can be plugged into
networks multiple times with only a minor extra computation cost overhead.
Figure 3(a) gives the detailed settings of the proposed module while Figure 3(b)
shows the warping process. Figure 2 visualizes feature maps of two adjacent
levels, their learned semantic flow and the finally warped feature map. As shown
in Figure 2, the warped feature is more structurally neat than normal bi-linear
upsampled feature and leads to more consistent representation of objects, such
as the bus and car.

3.3 Network Architectures


Figure 3(c) illustrates the whole network architecture, which contains a bottom-
up pathway as the encoder and a top-down pathway as the decoder. While
8 X. Li et al.

the encoder has a backbone network offering feature representations of different


levels, the decoder can be seen as a FPN equipped with several FAMs.
Encoder Part. We choose standard networks pre-trained on ImageNet [49] for
image classification as our backbone network by removing the last fully connected
layer. Specifically, ResNet series [20], ShuffleNet v2 [36] and DF series [31] are
used and compared in our experiments. All backbones have 4 stages with residual
blocks, and each stage has a convolutional layer with stride 2 in the first place to
downsample the feature map chasing for both computational efficiency and larger
receptive fields. We additionally adopt the Pyramid Pooling Module (PPM) [67]
for its superior power to capture contextual information. In our setting, the
output of PPM shares the same resolution as that of the last residual module.
In this situation, we treat PPM and the last residual module together as the last
stage for the upcoming FPN. Other modules like ASPP [6] can also be plugged
into our network, which are also experimentally ablated in Sec. 4.1.
Aligned FPN Decoder takes feature maps from the encoder and uses the
aligned feature pyramid for final scene parsing. By replacing normal bi-linear
up-sampling with FAM in the top-down pathway of FPN [34], {Fl }4l=2 is refined
to {Fe l }4 , where top-level feature maps are aligned and fused into their bottom
l=2
levels via element-wise addition and l represents the range of feature pyramid
level. For scene parsing, {F e l }4 ∪ {F5 } are up-sampled to the same resolution
l=2
(i.e., 1/4 of input image) and concatenated together for prediction. Considering
there are still misalignments during the previous step, we also replace these
up-sampling operations with the proposed FAM.
Cascaded Deeply Supervised Learning. We use deeply supervised loss [67]
to supervise intermediate outputs of the decoder for easier optimization. In addi-
tion, following [56], online hard example mining [50] is also used by only training
on the 10% hardest pixels sorted by cross-entropy loss.

4 Experiments
We first carry out experiments on the Cityscapes [9] dataset, which is comprised
of a large set of high-resolution (2048×1024) images in street scenes. This dataset
has 5,000 images with high quality pixel-wise annotations for 19 classes, which
is further divided into 2975, 500, and 1525 images for training, validation and
testing. To be noted, coarse data are not used in this work. Besides, more exper-
iments on Pascal Context [14], ADE20K [69] and CamVid [3] are summarised
to further prove the generality of our method.

4.1 Experiments on Cityscapes


Implementation details: We use PyTorch [46] framework to carry out follow-
ing experiments. All networks are trained with the same setting, where stochas-
tic gradient descent (SGD) with batch size of 16 is used as optimizer, with
momentum of 0.9 and weight decay of 5e-4. All models are trained for 50K it-
erations with an initial learning rate of 0.01. As a common practice, the “poly”
Semantic Flow for Fast and Accurate Scene Parsing 9

Method F3 F4 F5 mIoU(%) ∆a(%)


Method Stride mIoU (%) ∆a(%)
FPN+PPM 76.6 -
FCN 32 71.5 - X 76.9 0.3 ↑
Dilated FCN 8 72.6 1.1 ↑ X 77.0 0.4 ↑
+FPN 32 74.8 3.3 ↑ X 77.5 0.9 ↑
+FAM 32 77.2 5.7 ↑
+FPN + PPM 32 76.6 5.1 ↑ X X 77.8 1.2 ↑
+FAM + PPM 32 78.7 7.2 ↑ X X X 78.3 1.7 ↑

(a) Ablation study on baseline model. (b) Ablation study on insertion


position.
Backbone mIoU(%) ∆a(%) #GFLOPs
ResNet-50 [20] 76.8 - 332.6
w/ FAM 79.2 2.4 ↑ 337.1
Method mIoU(%) ∆a(%) #GFLOPs ResNet-101 [20] 77.6 - 412.7
w/ FAM 79.8 2.2↑ 417.5
FAM 76.4 - -
+PPM [67] 78.3 1.9↑ 123.5 ShuffleNetv2 [36] 69.8 - 17.8
+NL [52] 76.8 0.4↑ 148.0 w/ FAM 72.1 2.3 ↑ 18.1
+ASPP [6] 77.6 1.2↑ 138.6 DF1 [31] 72.1 - 18.6
w/ FAM 74.3 2.2 ↑ 18.7
+DenseASPP [55] 77.5 1.1↑ 141.5
DF2 [31] 73.2 - 48.2
(c) Ablation study on context module. w/ FAM 75.8 2.6 ↑ 48.5

(d) Ablation on study on various


backbones.
Table 1. Experiments results on network design using Cityscapes validation set.

learning rate policy is adopted to decay the initial learning rate by multiplying
iter 0.9
(1 − total iter ) during training. Data augmentation contains random horizontal
flip, random resizing with scale range of [0.75, 2.0], and random cropping with
crop size of 1024 × 1024. During inference, we use the whole picture as input
to report performance unless explicitly mentioned. For quantitative evaluation,
mean of class-wise intersection-over-union (mIoU) is used for accurate compari-
son, and number of float-point operations (FLOPs) and frames per second (FPS)
are adopted for speed comparison.
Comparison with baseline methods: Table 1(a) reports the comparison
results against baselines on the validation set of Cityscapes [9], where ResNet-
18 [20] serves as the backbone. Comparing with the naive FCN, dilated FCN
improves mIoU by 1.1%. By appending the FPN decoder to the naive FCN,
we get 74.8% mIoU by an improvement of 3.2%. By replacing bilinear upsam-
pling with the proposed FAM, mIoU is boosted to 77.2%, which improves the
naive FCN and FPN decoder by 5.7% and 2.4% respectively. Finally, we append
PPM (Pyramid Pooling Module) [67] to capture global contextual information,
which achieves the best mIoU of 78.7 % together with FAM. Meanwhile, FAM is
complementary to PPM by observing FAM improves PPM from 76.6% to 78.7%.
Positions to insert FAM: We insert FAM to different stage positions in the
FPN decoder and report the results as Table 1(b). From the first three rows, FAM
improves all stages and gets the greatest improvement at the last stage, which
demonstrate that misalignment exists in all stages on FPN and is more severe
in coarse layers. This is consistent with the fact that coarse layers containing
stronger semantics but with lower resolution, and can greatly boost segmentation
performance when they are appropriately upsampled to high resolution. The best
result is achieved by adding FAM to all stages in the last row. Note that, for
fast speed, we adopt FAMs only in the adjacent feature pyramids.
10 X. Li et al.

Method mIoU (%) Method mIoU (%) Gflops

bilinear upsampling 78.3 k =1 77.8 120.4


k =3 78.3 123.5
deconvolution 77.9
k =5 78.1 131.6
nearest neighbor 78.2 k =7 78.0 140.5
(a) Ablation study on Upsampling (b) Ablation study on kernel size k in
operation in FAM. FAM where 3 FAMs are involved.
Method mIoU (%) ∆a(%) Method F3 F4 F5 mIoU(%) ∆a(%)

FPN +PPM 76.6 - FPN +PPM - - - 76.6 -


DCN X 76.9 0.3 ↑
correlation [13] 77.2 0.6 ↑ Ours X 77.5 0.9 ↑
Ours 77.5 0.9 ↑ DCN X X X 77.2 0.6 ↑
Ours X X X 78.3 1.7 ↑
(c) Ablation with FlowNet-C [13] in
(d) Comparison with DCN [10].
FAM.
Table 2. Experiments results on FAM design using Cityscapes validation set.

Ablation study on network architecture design: Considering current


state-of-the-art contextual modules are used as heads on dilated backbone net-
works [6,16,55,61,67,68], we further try different contextual heads in our methods
where coarse feature map is used for contextual modeling. Table 1(c) reports the
comparison results, where PPM [67] delivers the best result, while more recently
proposed methods such as Non-Local based heads [52] perform worse. There-
fore, we choose PPM as our contextual head considering its better performance
with lower computational cost. We further carry out experiments with differ-
ent backbone networks including both deep and light-weight networks, where
FPN decoder with PPM head is used as a strong baseline in Table 1(d). For
heavy networks, we choose ResNet-50 and ResNet-101 [20] as representation.
For light-weight networks, ShuffleNetv2 [36] and DF1/DF2 [31] are employed.
FAM significantly achieves better mIoU on all backbones with slightly extra
computational cost.
Ablation study on FAM design: We first explore the effect of upsampling in
FAM in Table 2(a). Replacing the bilinear upsampling with deconvolution and
nearest neighbor upsampling achieves 77.9 mIoU and 78.2 mIoU, respectively,
which are similar to the 78.3 mIoU achieved by bilinear upsampling. We also try
the various kernel size in Table 2(b). Larger kernel size of 5×5 is also tried which
results in a similar (78.2) but introduces more computation cost. In Table 2(c),
replacing FlowNet-S with correlation in FlowNet-C also leads to slightly worse
results (77.2) but increases the inference time. The results show that it is enough
to use lightweight FlowNet-S for aligning feature maps in FPN. In Table 2(d), we
compare our results with DCN [10]. We apply DCN on the concatenated feature
map of bilinear upsampled feature map and the feature map of next level. We
first insert one DCN in higher layers F5 where our FAM is better than it. After
applying DCN to all layers, the performance gap is much larger. This denotes
our method can also align low level edges for better boundaries and edges in
lower layers, which will be shown in visualization part.
Aligned feature representation: In this part, we give more visualization on
aligned feature representation as shown in Figure 4. We visualize the upsampled
Semantic Flow for Fast and Accurate Scene Parsing 11

Input Images Feature w/o FAM Feature w/ DCN Feature w/ FAM

Fig. 4. Visualization of the aligned feature. Compared with DCN, our module outputs
more structural feature representation.

(a) (b) (c) (d) (e) (f)

Fig. 5. Visualization of the learned semantic flow fields. Column (a) lists three exem-
plary images. Column (b)-(d) show the semantic flow of the three FAMs in an ascending
order of resolution during the decoding process, following the same color coding of Fig-
ure 2. Column (e) is the arrowhead visualization of flow fields in column (d). Column
(f) contains the segmentation results.

feature in the final stage of ResNet-18. It shows that compared with DCN [10],
our FAM feature is more structural and has much more precise objects bound-
aries which is consistent with the results in Table 2(d). That indicates FAM is
not an attention effect on feature similar to DCN, but actually aligns feature
towards more precise shape as compared in red boxes.
Visualization of Semantic Flow: Figure 5 visualizes semantic flow from
FAM in different stages. Similar with optical flow, semantic flow is visualized
by color coding and is bilinearly interpolated to image size for quick overview.
Besides, vector fields are also visualized for detailed inspection. From the visual-
ization, we observe that semantic flow tends to diffuse out from some positions
inside objects, where these positions are generally near object centers and have
better receptive fields to activate top-level features with pure, strong seman-
tics. Top-level features at these positions are then propagated to appropriate
high-resolution positions following the guidance of semantic flow. In addition,
semantic flows also have coarse-to-fine trends from top level to bottom level,
which phenomenon is consistent with the fact that semantic flows gradually
describe offsets between gradually smaller patterns.
Visual Improvement analysis: Figure 6(a) visualizes the prediction errors by
both methods, where FAM considerably resolves ambiguities inside large objects
12 X. Li et al.

Input Ground Truth w/o FAM w/ FAM Input PSPNet SFNet Ground Truth
(a) (b)

Fig. 6. (a), Qualitative comparison in terms of errors in predictions, where correctly


predicted pixels are shown as black background while wrongly predicted pixels are
colored with their groundtruth label color codes. (b), Scene parsing results comparison
against PSPNet [67], where significantly improved regions are marked with red dashed
boxes. Our method performs better on both small scale and large scale objects.
Method InputSize mIoU (%) #FPS #Params
ENet [45] 640 × 360 58.3 60 0.4M
ESPNet [38] 512 × 1024 60.3 132 0.4M
ESPNetv2 [39] 512 × 1024 62.1 80 0.8M
ERFNet [47] 512 × 1024 69.7 41.9 -
BiSeNet(ResNet-18) [56] 768 × 1536 74.6 43 12.9M
BiSeNet(Xception-39) [56] 768 × 1536 68.4 72 5.8M
ICNet [66] 1024 × 2048 69.5 34 26.5M
DF1-Seg [31] 1024 × 2048 73.0 80 8.55M
DF2-Seg [31] 1024 × 2048 74.8 55 8.55M
SwiftNet [44] 1024 × 2048 75.5 39.9 11.80M
SwiftNet-ens [44] 1024 × 2048 76.5 18.4 24.7M
DFANet [27] 1024 × 1024 71.3 100 7.8M
CellNet [65] 768 × 1536 70.5 108 -
SFNet(DF1) 1024 × 2048 74.5 74/121 9.03M
SFNet(DF2) 1024 × 2048 77.8 53/61 10.53M
SFNet(ResNet-18) 1024 × 2048 78.9 18/26 12.87M
SFNet(ResNet-18)† 1024 × 2048 80.4 18/26 12.87M

†Mapillary dataset used for pretraining.


Table 3. Comparison on Cityscapes test set with state-of-the-art real-time models. For
fair comparison, input size is also considered, and all models use single scale inference.

(e.g., truck) and produces more precise boundaries for small and thin objects
(e.g., poles, edges of wall). Figure6 (b) shows our model can better handle the
small objects with shaper boundaries than dilated PSPNet due to the alignment
on lower layers.
Comparison with real-time models: All compared methods are evaluated
by single-scale inference and input sizes are also listed for fair comparison. Our
speed is tested on one GTX 1080Ti GPU with full image resolution 1024 × 2048
as input, and we report speed of two versions, i.e., without and with TensorRT
acceleration. As shown in Table 3, our method based on DF1 achieves a more
accurate result(74.5%) than all methods faster than it. With DF2, our method
outperforms all previous methods while running at 60 FPS. With ResNet-18 as
backbone, our method achieves 78.9% mIoU and even reaches performance of
accurate models which will be discussed in the next experiment. By additionally
using Mapillary [42] dataset for pretraining, our ResNet-18 based model achieves
Semantic Flow for Fast and Accurate Scene Parsing 13

Method Backbone mIoU (%) #Params #GFLOPs†


SAC [64] ResNet-101 78.1 - -
DepthSeg [26] ResNet-101 78.2 - -
PSPNet [67] ResNet-101 78.4 65.7M 1065.4
BiSeNet [56] ResNe-18 77.7 12.3M 82.2
BiSeNet [56] ResNet-101 78.9 51.0M 219.1
DFN [57] ResNet-101 79.3 90.7M 1121.0
PSANet [68] ResNet-101 80.1 85.6M 1182.6
DenseASPP [55] DenseNet-161 80.6 35.7M 632.9
SPGNet [8] 2×ResNet-50 81.1 - -
ANNet [72] ResNet-101 81.3 63.0M 1089.8
CCNet [21] ResNet-101 81.4 66.5M 1153.9
DANet [16] ResNet-101 81.5 66.6M 1298.8
SFNet ResNet-18 79.5 12.87M 123.5
SFNet ResNet-101 81.8 50.32M 417.5

#GFLOPs calculation adopts 1024 × 1024 image as input.
Table 4. Comparison on Cityscapes test set with state-of-the-art accurate models. For
better accuracy, all models use multi-scale inference.

26 FPS with 80.4% mIoU, which sets the new state-of-the-art record on accuracy
and speed trade-off on Cityscapes benchmark. More detailed information are in
the supplementary file.
Comparison with accurate models: State-of-the-art accurate models [16,55,
67, 71] perform multi-scale and horizontal flip inference to achieve better results
on the Cityscapes test server. For fair comparison, we also report multi-scale
with flip testing results following previous methods [16, 67]. Model parameters
and computation FLOPs are also listed for comparison. Table 4 summarizes the
results, where our models achieve state-of-the-art accuracy while costs much less
computation. In particular, our method based on ResNet-18 is 1.1% mIoU higher
than PSPNet [67] while only requiring 11% of its computation. Our ResNet-101
based model achieves better results than DAnet [16] by 0.3% mIoU and only
requires 30% of its computation.

4.2 Experiment on More Datasets

We also perform more experiments on other three data-sets including Pascal


Context [40], ADE20K [69] and CamVid [3] to further prove the effectiveness of
our method. More detailed setting can be found in the supplemental file.
PASCAL Context: The results are illustrated as Table 5(a), our method out-
performs corresponding baselines by 1.7% mIoU and 2.6% mIoU with ResNet-
50 and ResNet-101 as backbones respectively. In addition, our method on both
ResNet-50 and ResNet-101 outperforms their existing counterparts by large mar-
gins with significantly lower computational cost.
ADE20K: is a challenging scene parsing dataset. Images in this dataset are
from different scenes with more scale variations. Table 5(b) reports the per-
formance comparisons, our method improves the baselines by 1.69% mIoU and
1.59% mIoU respectively, and outperforms previous state-of-the-art methods [67,
68] with much less computation.
14 X. Li et al.

Method Backbone mIoU (%) #GFLOPs


Ding et al. [12] ResNet-101 51.6 -
Method Backbone mIoU (%) #GFLOPs
EncNet [60] ResNet-50 49.2 -
EncNet [60] ResNet-101 51.7 - PSPNet [67] ResNet-50 42.78 167.6
DANet [16] ResNet-50 50.1 186.4 PSPNet [67] ResNet-101 43.29 238.4
DANet [16] ResNet-101 52.6 257.1 PSANet [68] ResNet-101 43.77 264.9
ANNet [72] ResNet-101 52.8 243.8 EncNet [60] ResNet-101 44.65 -
BAFPNet [11] ResNet-101 53.6 - CFNet [61] ResNet101 44.82 -
EMANet [29] ResNet-101 53.1 209.3
w/o FAM ResNet-50 41.12 74.8
w/o FAM ResNet-50 49.0 74.5 SFNet ResNet-50 42.81(1.69 ↑) 75.7
SFNet ResNet-50 50.7(1.7 ↑) 75.4 w/o FAM ResNet-101 43.08 93.1
w/o FAM ResNet-101 51.1 92.7
SFNet ResNet-101 44.67(1.59 ↑) 94.0
SFNet ResNet-101 53.8(2.7 ↑) 93.6

(a) Results on Pascal Context. (b) Results on ADE20K.


Evaluated on 60 classes.
Table 5. Experiments results on Pascal Context and ADE20k(Multi scale inference).
#GFLOPs calculation adopts 480 × 480 image as input.

Method Backbone mIoU (%) FPS


ICNet [66] ResNet-50 67.1 34.5
BiSegNet [56] Xception-39 65.6 -
BiSegNet [56] ResNet-18 68.7 -
DFANet A [27] - 64.7 120
DFANet B [27] - 59.3 160
w/o FAM DF2 67.2 139.8
SFNet DF2 70.4 (3.2 ↑) 134.1
SFNet ResNet-18 73.8 35.5

Table 6. Accuracy and Speed comparison with previous state-of-the-art real-time mod-
els on CamVid [3] test set where the input size is 960 × 720 with single scale inference.

CamVid: is another road scene dataset. This dataset involves 367 training
images, 101 validation images and 233 testing images with resolution of 960×720.
We apply our method with different light-weight backbones on this dataset and
report comparison results in Table 6. With DF2 as backbone, FAM improves
its baseline by 3.2% mIoU. Our method based on ResNet-18 performs best with
73.8% mIoU while running at 35.5 FPS.

5 Conclusion

In this paper, we devise to use the learned Semantic Flow to align multi-
level feature maps generated by a feature pyramid to the task of scene parsing.
With the proposed flow alignment module, high-level features are well fused into
low-level feature maps with high resolution. By discarding atrous convolutions
to reduce computation overhead and employing the flow alignment module to
enrich the semantic representation of low-level features, our network achieves
the best trade-off between semantic segmentation accuracy and running time
efficiency. Experiments on multiple challenging datasets illustrate the efficacy of
our method.
Semantic Flow for Fast and Accurate Scene Parsing 15

6 Supplemental Parts

Our supplemental material contains two parts. One is the more details on Cityscapes
datasets and the other is the detailed setting on other datasets. We will open-
source the our codebase.

7 Supplemental Experiments on Cityscapes

Detailed improvement on baseline models: Table 7 compares the detailed


results of each category on the validation set, where ResNet-101 is used as back-
bone, and FPN decoder with PPM head serves as the baseline. Our method
improves almost all categories, especially for ’truck’ with more than 19% mIoU
improvement.
More structured feature visualization on FAM: We give more structured
feature visualization in Figure 7. We visualize more FAM outputs with two differ-
ent location: last stages(below the blue line) and next to the last stage(above the
blue line). For both cases, our module aligns the features into more structured
representations with more clear shape and accurate boundaries.
More Visualization of Learned Flow: We also give more learned semantic
flow Visualization in Figure 8.
More training details using Mapillary Vistas [42]: Mapillary Vistas is a
large-scale dataset captured at street scenes, which contains 18K/2K/5K im-
ages for training, validation and testing, respectively. The dataset is similar to
Cityscapes. Due to the larger variance of image resolutions than Cityscapes, we
resize longer side to 2048 before data augmentation. To verify the performance
improvement of SFNet by using more training data, we first pre-train SFNet
on Mapillary Vistas for 50,000 iterations by using both train and val dataset,
then finetune the model on Cityscapes for 50,000 iterations using Cityscapes
fine-annotated data with the same setting in the paper before the submission to
the test server.
Detailed setting about TensorRT The testing environment is TensorRT 6.0.1
with CUDA 10.1 on a single GTX 1080Ti GPU. In addition, we re-implement
grid sampling operator by CUDA to be used together with TensorRT. The opera-
tor is provided by PyTorch and used in warping operation in the Flow Alignment
Module. Also, we also test the our speed on 1080-Ti using pytorch-library [46]
where we report average time of inferencing 100 images.
Detailed results and settings on Cityscapes compared with accurate
models: We give more detailed results in Table 8 for the state-of-the-art model
comparison. For the fair comparison, we adopt multi-scale inference with 7 scales
0.5,0.75, 1, 1.25, 1.5, 1.75, 2.0 with flip operation. As shown in Table 8, our SFNet
achi eves the best performance with less GFlops which has been calculated in
the main paper.
16 X. Li et al.

Input w/o FAM After FAM Output


Fig. 7. More visualization of the aligned feature representation. The figures below the
blue line are the outputs of last stage of FAM while the figures above the blues are the
outputs of next to the last stage FAM with more fine details. Best view it on screen.
Method road swalk build wall fence pole tlight sign veg. terrain sky person rider car truck bus train mbike bike mIoU
BaseLine 98.1 84.9 92.6 54.8 62.2 66.0 72.8 80.8 92.4 60.6 94.8 83.1 66.0 94.9 65.9 83.9 70.5 66.0 78.9 77.6
w/ FAM 98.3 85.9 93.2 62.2 67.2 67.3 73.2 81.1 92.8 60.5 95.6 83.2 65.0 95.7 84.1 89.6 75.1 67.7 78.8 79.8

Table 7. Quantitative per-category comparison results on Cityscapes validation set,


where ResNet-101 backbone with the FPN decoder and PPM head serves as the strong
baseline. Sliding window crop with horizontal flip is used for testing. Obviously, FAM
boosts the performance of almost all the categories.

8 Detailed Experiment Settings on Other Datasets:

PASCAL Context: provides detailed semantic labels for whole scenes, and
contains 4998 images for training and 5105 images for validation. We train the
network for 120 epochs with batch size 16, crop size 512 with initial learning
Semantic Flow for Fast and Accurate Scene Parsing 17

(a) (b) (c) (d)


Fig. 8. More visualization of the learned semantic flow fields. Column (a) lists input
images. Column (b)-(d) show the semantic flow of the three FAMs in an ascending
order of resolution during the decoding process. Best view it on screen.

rate 1e-3. For evaluation, we perform multi-scale testing with horizontal flip
operation.

ADE20k: is a more challenging scene parsing dataset annotated with 150 classes,
and it contains 20K/2K images for training and validation. It has the various
objects in the scene. We train the network for 120 epochs with batch size 16, crop
size 512 and initial learning rate 1e-2. For final testing, we perform multi-scale
testing with horizontal flip operation.
18 X. Li et al.

Method road swalk build wall fence pole tlight sign veg. terrain sky person rider car truck bus train mbike bike mIoU
ResNet38 [53] 98.5 85.7 93.0 55.5 59.1 67.1 74.8 78.7 93.7 72.6 95.5 86.6 69.2 95.7 64.5 78.8 74.1 69.0 76.7 78.4
PSPNet [67] 98.6 86.2 92.9 50.8 58.8 64.0 75.6 79.0 93.4 72.3 95.4 86.5 71.3 95.9 68.2 79.5 73.8 69.5 77.2 78.4
AAF [23] 98.5 85.6 93.0 53.8 58.9 65.9 75.0 78.4 93.7 72.4 95.6 86.4 70.5 95.9 73.9 82.7 76.9 68.7 76.4 79.1
SegModel [15] 98.6 86.4 92.8 52.4 59.7 59.6 72.5 78.3 93.3 72.8 95.5 85.4 70.1 95.6 75.4 84.1 75.1 68.7 75.0 78.5
DFN [57] - - - - - - - - - - - - - - - - - - - 79.3
BiSeNet [56] - - - - - - - - - - - - - - - - - - - 78.9
DenseASPP [55] 98.7 87.1 93.4 60.7 62.7 65.6 74.6 78.5 93.6 72.5 95.4 86.2 71.9 96.0 78.0 90.3 80.7 69.7 76.8 80.6
BFPNet [11] 98.7 87.1 93.5 59.8 63.4 68.9 76.8 80.9 93.7 72.8 95.5 87.0 72.1 96.0 77.6 89.0 86.9 69.2 77.6 81.4
DANet [16] 98.6 87.1 93.5 56.1 63.3 69.7 77.3 81.3 93.9 72.9 95.7 87.3 72.9 96.2 76.8 89.4 86.5 72.2 78.2 81.5
SFNet 98.8 87.1 93.6 63.2 62.7 68.4 75.6 80.3 93.8 71.0 95.7 87.7 73.2 96.5 75.9 92.3 89.5 71.4 78.0 81.8

Table 8. Per-category results on Cityscapes test set. Note that all the models are
trained with only fine annotated data. Our method achieves 81.8% mIoU with much
less GFlops.

CamVid: is a road scene image segmentation dataset, which provides pixel-


wise annotations for 11 semantic categories. There are 367 training images, 101
validation images and 233 testing images. We train the model with 120 epochs
and our crop size is set to 640 and learning rate is 1e-3. The batch size is set to
16. For the final testing, we perform the single scale test for the fair comparison.
Semantic Flow for Fast and Accurate Scene Parsing 19

References
1. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. PAMI (2017)
2. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A
database and evaluation methodology for optical flow. International Journal of
Computer Vision 92(1), 1–31 (Mar 2011). https://doi.org/10.1007/s11263-010-
0390-2, https://doi.org/10.1007/s11263-010-0390-2
3. Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-
definition ground truth database. Pattern Recognition Letters xx(x), xx–xx (2008)
4. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow esti-
mation based on a theory for warping. ECCV (2004)
5. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution,
and fully connected crfs. PAMI (2018)
6. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
7. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with
atrous separable convolution for semantic image segmentation. In: ECCV (2018)
8. Cheng, B., Chen, L.C., Wei, Y., Zhu, Y., Huang, Z., Xiong, J., Huang, T.S., Hwu,
W.M., Shi, H.: Spgnet: Semantic prediction guidance for scene parsing. In: ICCV
(October 2019)
9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: CVPR (2016)
10. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-
lutional networks. In: ICCV (2017)
11. Ding, H., Jiang, X., Liu, A.Q., Magnenat-Thalmann, N., Wang, G.: Boundary-
aware feature propagation for scene segmentation (2019)
12. Ding, H., Jiang, X., Shuai, B., Qun Liu, A., Wang, G.: Context contrasted feature
and gated multi-scale aggregation for scene segmentation. In: CVPR (2018)
13. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van
Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convo-
lutional networks. In: CVPR (2015)
14. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal
visual object classes (voc) challenge. IJCV (2010)
15. Falong Shen, Gan Rui, S.Y., Zeng, G.: Semantic segmentation via structured patch
prediction, context crf and guidance crf. In: CVPR (2017)
16. Fu, J., Liu, J., Tian, H., Fang, Z., Lu, H.: Dual attention network for scene seg-
mentation. arXiv preprint arXiv:1809.02983 (2018)
17. Gadde, R., Jampani, V., Gehler, P.V.: Semantic video cnns through representation
warping. In: ICCV (Oct 2017)
18. He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation.
In: ICCV (October 2019)
19. He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network
for semantic segmentation. In: CVPR (June 2019)
20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
21. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross
attention for semantic segmentation (2019)
20 X. Li et al.

22. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer
networks. ArXiv abs/1506.02025 (2015)
23. Ke, T.W., Hwang, J.J., Liu, Z., Yu, S.X.: Adaptive affinity fields for semantic
segmentation. In: ECCV (2018)
24. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. ArXiv abs/1609.02907 (2016)
25. Kirillov, A., Girshick, R., He, K., Dollar, P.: Panoptic feature pyramid networks.
In: CVPR (June 2019)
26. Kong, S., Fowlkes, C.C.: Recurrent scene parsing with perspective understanding
in the loop. In: CVPR (2018)
27. Li, H., Xiong, P., Fan, H., Sun, J.: Dfanet: Deep feature aggregation for real-time
semantic segmentation. In: CVPR (June 2019)
28. Li, X., Yang, Y., Zhao, Q., Shen, T., Lin, Z., Liu, H.: Spatial pyramid based graph
reasoning for semantic segmentation. In: CVPR (2020)
29. Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization
attention networks for semantic segmentation. In: ICCV (2019)
30. Li, X., Houlong, Z., Lei, H., Yunhai, T., Kuiyuan, Y.: Gff: Gated fully fusion for
semantic segmentation. In: AAAI (2020)
31. Li, X., Zhou, Y., Pan, Z., Feng, J.: Partial order pruning: for best speed/accuracy
trade-off in neural architecture search. In: CVPR (2019)
32. Li, Y., Gupta, A.: Beyond grids: Learning graph representations for visual recog-
nition. In: NIPS (2018)
33. Li, Y., Shi, J., Lin, D.: Low-latency video semantic segmentation. In: CVPR (June
2018)
34. Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature
pyramid networks for object detection. In: CVPR (2017)
35. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: CVPR (2015)
36. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for
efficient cnn architecture design. In: ECCV (September 2018)
37. Mazzini, D.: Guided upsampling network for real-time semantic segmentation. In:
BMVC (2018)
38. Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Effi-
cient spatial pyramid of dilated convolutions for semantic segmentation. In: ECCV
(September 2018)
39. Mehta, S., Rastegari, M., Shapiro, L., Hajishirzi, H.: Espnetv2: A light-weight,
power efficient, and general purpose convolutional neural network. In: CVPR (June
2019)
40. Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R.,
Yuille, A.: The role of context for object detection and semantic segmentation in
the wild. In: CVPR (2014)
41. Nekrasov, V., Chen, H., Shen, C., Reid, I.: Fast neural architecture search of com-
pact semantic segmentation models via auxiliary cells. In: CVPR (June 2019)
42. Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas
dataset for semantic understanding of street scenes. In: ICCV (2017)
43. Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow
propagation. In: CVPR (June 2018)
44. Orsic, M., Kreso, I., Bevandic, P., Segvic, S.: In defense of pre-trained imagenet ar-
chitectures for real-time semantic segmentation of road-driving images. In: CVPR
(June 2019)
Semantic Flow for Fast and Accurate Scene Parsing 21

45. Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network
architecture for real-time semantic segmentation http://arxiv.org/abs/1606.
02147
46. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In:
NIPS-W (2017)
47. Romera, E., Alvarez, J.M., Bergasa, L.M., Arroyo, R.: Erfnet: Efficient residual
factorized convnet for real-time semantic segmentation. IEEE Trans. Intelligent
Transportation Systems pp. 263–272 (2018)
48. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. MICCAI (2015)
49. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large
scale visual recognition challenge. IJCV (2015)
50. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors
with online hard example mining. In: CVPR (2016)
51. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: NIPS (2017)
52. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR
(June 2018)
53. Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: Revisiting the resnet model
for visual recognition. arXiv preprint arXiv:1611.10080 (2016)
54. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene
understanding. In: ECCV (2018)
55. Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation
in street scenes. In: CVPR (2018)
56. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: Bilateral segmen-
tation network for real-time semantic segmentation. In: ECCV (2018)
57. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative
feature network for semantic segmentation. In: CVPR (2018)
58. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. ICLR
(2016)
59. Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv
preprint arXiv:1809.00916 (2018)
60. Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context
encoding for semantic segmentation. In: CVPR (2018)
61. Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic seg-
mentation. In: CVPR (June 2019)
62. Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolu-
tional network for semantic segmentation. In: BMVC (2019)
63. Zhang, L., Xu, D., Arnab, A., Torr, P.H.: Dynamic graph message passing networks.
In: CVPR (2020)
64. Zhang, R., Tang, S., Zhang, Y., Li, J., Yan, S.: Scale-adaptive convolutions for
scene parsing. In: ICCV (2017)
65. Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., Mei, T.: Customizable architecture
search for semantic segmentation. In: CVPR (June 2019)
66. Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation
on high-resolution images. In: ECCV (September 2018)
67. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
CVPR (2017)
22 X. Li et al.

68. Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., Jia, J.: Psanet:
Point-wise spatial attention network for scene parsing. In: ECCV (2018)
69. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic under-
standing of scenes through the ADE20K dataset. arXiv preprint arXiv:1608.05442
(2016)
70. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recog-
nition. In: CVPR (July 2017)
71. Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., Catanzaro, B.:
Improving semantic segmentation via video propagation and label relaxation. In:
CVPR (June 2019)
72. Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks
for semantic segmentation. In: ICCV (2019)

You might also like