2002.10120 SFSeg DF
2002.10120 SFSeg DF
Parsing
Xiangtai Li1 ? , Ansheng You1 * , Zhen Zhu2 , Houlong Zhao3 , Maoke Yang3 ,
Kuiyuan Yang3 , Shaohua Tan1 , and Yunhai Tong1
1
Key Laboratory of Machine Perception, MOE, School of EECS, Peking University
arXiv:2002.10120v3 [cs.CV] 29 Mar 2021
2
Huazhong University of Science and Technology
3
DeepMotion
1 Introduction
85
DF2-Seg SFNet(DF1)
75
RefineNet
Accuracy (mIoU%)
BiSeNet(ResNet-18) DF1-Seg
FRRN
DeepLabv2-CRF ICNet DFANet
70
BiSeNet(Xception-39)
Dilation10
Fast-SCNN
DPN FCN-8s
65
DeepLab
CRF-RNN ESPNetv2
SQ
60
ESPNet ENet
SegNet
55
<0.2 1 10 Real-time 100 >200
Speed (Frames/Second)
Fig. 1. Inference speed versus mIoU performance on test set of Cityscapes. Previous
models are marked as red points, and our models are shown in blue points which achieve
the best speed/accuracy trade-off. Note that our method with ResNet-18 as backbone
even achieves comparable accuracy with all accurate models at much faster speed.
loss. It should be noted that Semantic Flow is apparently different from optical
flow, since Semantic Flow takes feature maps from different levels as input and
assesses the discrepancy within them to find a suitable flow field that will give
dynamic indication about how to align these two feature maps effectively.
Based on the concept of Semantic Flow, we design a novel network module
called Flow Alignment Module (FAM) to utilize Semantic Flow in the scene
parsing task. Feature maps after FAM are embodied with both rich semantics
and abundant spatial information. Because FAM can effectively transmit the
semantic information from deep layers to shallow layers through very simple op-
erations, it shows superior efficacy in both improving the accuracy and keeping
superior efficiency. Moreover, FAM is end-to-end trainable, and can be plugged
into any backbone networks to improve the results with a minor computational
overhead. For simplicity, we call the networks that all incorporate FAM but have
different backbones as SFNet(backbone). As depicted in Figure 1, SFNet with
different backbone networks outperforms other competitors by a large margin
under the same speed. In particular, our method adopting ResNet-18 as back-
bone achieves 80.4% mIoU on the Cityscapes test server with a frame rate of 26
FPS. When adopting DF2 [31] as backbone, our method achieves 77.8% mIoU
with 61 FPS and 74.5% mIoU with 121 FPS when equipped with the DF1 back-
bone. Moreover, when using deeper backbone networks, such as ResNet-101,
SFNet achieves better results(81.8 %mIoU) than the previous state-of-the-art
model DANet [16](81.5 %mIoU), and only requires 33% computation of DANet
during the inference. Besides, the consistent superior efficacy of SFNet across
various datasets also clearly demonstrates its broad applicability.
To conclude, our main contributions are three-fold:
– We introduce the concept of Semantic Flow in the field of scene parsing and
propose a novel flow-based align module (FAM) to learn the Semantic Flow
between feature maps of adjacent levels and broadcast high-level features to
high resolution features more effectively and efficiently.
– We insert FAMs into the feature pyramid framework and build a feature
pyramid aligned network called SFNet for fast and accurate scene parsing.
– Detailed experiments and analysis indicate the efficacy of our proposed mod-
ule in both improving the accuracy and keeping light-weight. We achieve
state-of-the-art results on Cityscapes, Pascal Context, Camvid datasets and
a considerable gain on ADE20K.
2 Related Work
For scene parsing, there are mainly two paradigms for high-resolution semantic
map prediction. One paradigm tries to keep both spatial and semantic infor-
mation along the main network pathway, while the other paradigm distributes
spatial and semantic information to different parts in a network, then merges
them back via different strategies.
The first paradigm mostly relies on some network operations to retain high-
resolution feature maps in the latter network stages. Many state-of-the-art ac-
4 X. Li et al.
curate methods [16, 67, 71] follow this paradigm to design sophisticated head
networks to capture contextual information. PSPNet [67] proposes to leverage
pyramid pooling module (PPM) to model multi-scale contexts, whilst DeepLab
series [5–7, 55] uses astrous spatial pyramid pooling (ASPP). In [16, 18, 19,
21, 29, 59, 72], non-local operator [52] and self-attention mechanism [51] are
adopted to harvest pixel-wise context from the whole image. Meanwhile, several
works [24, 28, 32, 62, 63] use graph convolutional neural networks to propagate
information over the image by projecting features into an interaction space.
Bilinear Upsampling
Low-Resolution
Warp
After Warp
Flow Field
High-Resolution
Fig. 2. Visualization of feature maps and semantic flow field in FAM. Feature maps are
visualized by averaging along the channel dimension. Larger values are denoted by hot
colors and vice versa. We use the color code proposed in [2] to visualize the Semantic
Flow field. The orientation and magnitude of flow vectors are represented by hue and
saturation respectively.
3 Method
In this section, we will first give some preliminary knowledge about scene pars-
ing and introduce the misalignment problem therein. Then, we propose the Flow
Alignment Module (FAM) to resolve the misalignment issue by learning Seman-
tic Flow and warping top-layer feature maps accordingly. Finally, we present
the whole network architecture equipped with FAMs based on the FPN frame-
work [34] for fast and accurate scene parsing.
3.1 Preliminary
The task of scene parsing is to map an RGB image X ∈ RH×W ×3 to a semantic
map Y ∈ RH×W ×C with the same spatial resolution H × W , where C is the
number of predefined semantic categories. Following the setting of FPN [34], the
input image X is firstly mapped to a set of feature maps {Fl }l=2,...,5 from each
network stage, where Fl ∈ RHl ×Wl ×Cl is a Cl -dimensional feature map defined
on a spatial grid Ωl with size of Hl × Wl , Hl = H2l
, Wl = W
2l
. The coarsest fea-
ture map F5 comes from the deepest layer with strongest semantics. FCN-32s
directly predicts upon F5 and achieves over-smoothed results without fine de-
tails. However, some improvements can be achieved by fusing predictions from
lower levels [35]. FPN takes a step further to gradually fuse high-level feature
maps with low-level feature maps in a top-down pathway through 2× bi-linear
upsampling, which was originally proposed for object detection [34] and recently
introduced for scene parsing [25, 54]. The whole FPN framework highly relies on
upsampling operator to upsample the spatially smaller but semantically stronger
feature map to be larger in spatial size. However, the bilinear upsampling re-
covers the resolution of downsampled feature maps by interpolating a set of
uniformly sampled positions (i.e., it can only handle one kind of fixed and pre-
defined misalignment), while the misalignment between feature maps caused
by a residual connection, repeated downsampling and upsampling, is far more
complex. Therefore, position correspondence between feature maps needs to be
explicitly and dynamically established to resolve their actual misalignment.
6 X. Li et al.
<latexit sha1_base64="DtjbI6+eq8wUCosod6aTR7nZxlg=">AAAB8HicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2WhJtLDER0MCF7C17sGF377I7Z0Iu/AobC42x9efY+W9c4AoFXzLJy3szmZkXpVJY9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aNskM4y2WyMQ8RNRyKTRvoUDJH1LDqYok70Tjm5nfeeLGikTf4yTloaJDLWLBKDrpMeihUNySoF+t+XV/DrJKgoLUoECzX/3qDRKWKa6RSWptN/BTDHNqUDDJp5VeZnlK2ZgOeddRTd2aMJ8fPCVnThmQODGuNJK5+nsip8raiYpcp6I4ssveTPzP62YYX4W50GmGXLPFojiTBBMy+54MhOEM5cQRyoxwtxI2ooYydBlVXAjB8surpH1RDxy/82uN6yKOMpzAKZxDAJfQgFtoQgsYKHiGV3jzjPfivXsfi9aSV8wcwx94nz/veI/Q</latexit>
1/4 1/8 1/16 1/32
<latexit sha1_base64="DtjbI6+eq8wUCosod6aTR7nZxlg=">AAAB8HicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2WhJtLDER0MCF7C17sGF377I7Z0Iu/AobC42x9efY+W9c4AoFXzLJy3szmZkXpVJY9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aNskM4y2WyMQ8RNRyKTRvoUDJH1LDqYok70Tjm5nfeeLGikTf4yTloaJDLWLBKDrpMeihUNySoF+t+XV/DrJKgoLUoECzX/3qDRKWKa6RSWptN/BTDHNqUDDJp5VeZnlK2ZgOeddRTd2aMJ8fPCVnThmQODGuNJK5+nsip8raiYpcp6I4ssveTPzP62YYX4W50GmGXLPFojiTBBMy+54MhOEM5cQRyoxwtxI2ooYydBlVXAjB8surpH1RDxy/82uN6yKOMpzAKZxDAJfQgFtoQgsYKHiGV3jzjPfivXsfi9aSV8wcwx94nz/veI/Q</latexit>
|| Up <latexit sha1_base64="DtjbI6+eq8wUCosod6aTR7nZxlg=">AAAB8HicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2WhJtLDER0MCF7C17sGF377I7Z0Iu/AobC42x9efY+W9c4AoFXzLJy3szmZkXpVJY9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aNskM4y2WyMQ8RNRyKTRvoUDJH1LDqYok70Tjm5nfeeLGikTf4yTloaJDLWLBKDrpMeihUNySoF+t+XV/DrJKgoLUoECzX/3qDRKWKa6RSWptN/BTDHNqUDDJp5VeZnlK2ZgOeddRTd2aMJ8fPCVnThmQODGuNJK5+nsip8raiYpcp6I4ssveTPzP62YYX4W50GmGXLPFojiTBBMy+54MhOEM5cQRyoxwtxI2ooYydBlVXAjB8surpH1RDxy/82uN6yKOMpzAKZxDAJfQgFtoQgsYKHiGV3jzjPfivXsfi9aSV8wcwx94nz/veI/Q</latexit>
<latexit sha1_base64="9HhLVcwq98RwEYzeTd5kL3Kpmsg=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5MEcugjWUE84HJEfY2e8mSvb1jd04IR/6FjYUitv4bO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikbeJUM95isYx1N6CGS6F4CwVK3k00p1EgeSeY3M79zhPXRsTqAacJ9yM6UiIUjKKVHmukjyLihtQG5YpbdRcg68TLSQVyNAflr/4wZmnEFTJJjel5boJ+RjUKJvms1E8NTyib0BHvWaqoXeNni4tn5MIqQxLG2pZCslB/T2Q0MmYaBbYzojg2q95c/M/rpRhe+5lQSYpcseWiMJUEYzJ/nwyF5gzl1BLKtLC3EjammjK0IZVsCN7qy+ukfVX1LL93K42bPI4inME5XIIHdWjAHTShBQwUPMMrvDnGeXHenY9la8HJZ07hD5zPH0xOj/4=</latexit>
Input
PPM
Flow Field
Warp
offset warp
+ offset warp
Predict
FAM FAM FAM
(a) Flow Alignment Module (b) Warp Procedure (c) Overview of SFNet
Fig. 3. (a) The details of Flow Alignment Module. We combine the transformed high-
resolution feature map and low-resolution feature map to generate the semantic flow
field, which is utilized to warp the low-resolution feature map to high-resolution feature
map. (b) Warp procedure of Flow Alignment Module. The value of the high-resolution
feature map is the bilinear interpolation of the neighboring pixels in low-resolution
feature map, where the neighborhoods are defined according learned semantic flow
field. (c) Overview of our proposed SFNet. ResNet-18 backbone with four stages is
used for exemplar illustration. FAM: Flow Alignment Module. PPM: Pyramid Pooling
Module [67]. Best view it in color and zoom in.
where cat(·) represents the concatenation operation and convl (·) is the 3×3 con-
volutional layer. Since our network adopts strided convolutions, which could lead
to very low resolution, for most cases, the respective field of the 3×3 convolution
convl is sufficient to cover most large objects of that feature map. Note that,
we discard the correlation layer proposed in FlowNet-C [13], where positional
correspondence is calculated explicitly. Because there exists a huge semantic gap
Semantic Flow for Fast and Accurate Scene Parsing 7
4 Experiments
We first carry out experiments on the Cityscapes [9] dataset, which is comprised
of a large set of high-resolution (2048×1024) images in street scenes. This dataset
has 5,000 images with high quality pixel-wise annotations for 19 classes, which
is further divided into 2975, 500, and 1525 images for training, validation and
testing. To be noted, coarse data are not used in this work. Besides, more exper-
iments on Pascal Context [14], ADE20K [69] and CamVid [3] are summarised
to further prove the generality of our method.
learning rate policy is adopted to decay the initial learning rate by multiplying
iter 0.9
(1 − total iter ) during training. Data augmentation contains random horizontal
flip, random resizing with scale range of [0.75, 2.0], and random cropping with
crop size of 1024 × 1024. During inference, we use the whole picture as input
to report performance unless explicitly mentioned. For quantitative evaluation,
mean of class-wise intersection-over-union (mIoU) is used for accurate compari-
son, and number of float-point operations (FLOPs) and frames per second (FPS)
are adopted for speed comparison.
Comparison with baseline methods: Table 1(a) reports the comparison
results against baselines on the validation set of Cityscapes [9], where ResNet-
18 [20] serves as the backbone. Comparing with the naive FCN, dilated FCN
improves mIoU by 1.1%. By appending the FPN decoder to the naive FCN,
we get 74.8% mIoU by an improvement of 3.2%. By replacing bilinear upsam-
pling with the proposed FAM, mIoU is boosted to 77.2%, which improves the
naive FCN and FPN decoder by 5.7% and 2.4% respectively. Finally, we append
PPM (Pyramid Pooling Module) [67] to capture global contextual information,
which achieves the best mIoU of 78.7 % together with FAM. Meanwhile, FAM is
complementary to PPM by observing FAM improves PPM from 76.6% to 78.7%.
Positions to insert FAM: We insert FAM to different stage positions in the
FPN decoder and report the results as Table 1(b). From the first three rows, FAM
improves all stages and gets the greatest improvement at the last stage, which
demonstrate that misalignment exists in all stages on FPN and is more severe
in coarse layers. This is consistent with the fact that coarse layers containing
stronger semantics but with lower resolution, and can greatly boost segmentation
performance when they are appropriately upsampled to high resolution. The best
result is achieved by adding FAM to all stages in the last row. Note that, for
fast speed, we adopt FAMs only in the adjacent feature pyramids.
10 X. Li et al.
Fig. 4. Visualization of the aligned feature. Compared with DCN, our module outputs
more structural feature representation.
Fig. 5. Visualization of the learned semantic flow fields. Column (a) lists three exem-
plary images. Column (b)-(d) show the semantic flow of the three FAMs in an ascending
order of resolution during the decoding process, following the same color coding of Fig-
ure 2. Column (e) is the arrowhead visualization of flow fields in column (d). Column
(f) contains the segmentation results.
feature in the final stage of ResNet-18. It shows that compared with DCN [10],
our FAM feature is more structural and has much more precise objects bound-
aries which is consistent with the results in Table 2(d). That indicates FAM is
not an attention effect on feature similar to DCN, but actually aligns feature
towards more precise shape as compared in red boxes.
Visualization of Semantic Flow: Figure 5 visualizes semantic flow from
FAM in different stages. Similar with optical flow, semantic flow is visualized
by color coding and is bilinearly interpolated to image size for quick overview.
Besides, vector fields are also visualized for detailed inspection. From the visual-
ization, we observe that semantic flow tends to diffuse out from some positions
inside objects, where these positions are generally near object centers and have
better receptive fields to activate top-level features with pure, strong seman-
tics. Top-level features at these positions are then propagated to appropriate
high-resolution positions following the guidance of semantic flow. In addition,
semantic flows also have coarse-to-fine trends from top level to bottom level,
which phenomenon is consistent with the fact that semantic flows gradually
describe offsets between gradually smaller patterns.
Visual Improvement analysis: Figure 6(a) visualizes the prediction errors by
both methods, where FAM considerably resolves ambiguities inside large objects
12 X. Li et al.
Input Ground Truth w/o FAM w/ FAM Input PSPNet SFNet Ground Truth
(a) (b)
(e.g., truck) and produces more precise boundaries for small and thin objects
(e.g., poles, edges of wall). Figure6 (b) shows our model can better handle the
small objects with shaper boundaries than dilated PSPNet due to the alignment
on lower layers.
Comparison with real-time models: All compared methods are evaluated
by single-scale inference and input sizes are also listed for fair comparison. Our
speed is tested on one GTX 1080Ti GPU with full image resolution 1024 × 2048
as input, and we report speed of two versions, i.e., without and with TensorRT
acceleration. As shown in Table 3, our method based on DF1 achieves a more
accurate result(74.5%) than all methods faster than it. With DF2, our method
outperforms all previous methods while running at 60 FPS. With ResNet-18 as
backbone, our method achieves 78.9% mIoU and even reaches performance of
accurate models which will be discussed in the next experiment. By additionally
using Mapillary [42] dataset for pretraining, our ResNet-18 based model achieves
Semantic Flow for Fast and Accurate Scene Parsing 13
26 FPS with 80.4% mIoU, which sets the new state-of-the-art record on accuracy
and speed trade-off on Cityscapes benchmark. More detailed information are in
the supplementary file.
Comparison with accurate models: State-of-the-art accurate models [16,55,
67, 71] perform multi-scale and horizontal flip inference to achieve better results
on the Cityscapes test server. For fair comparison, we also report multi-scale
with flip testing results following previous methods [16, 67]. Model parameters
and computation FLOPs are also listed for comparison. Table 4 summarizes the
results, where our models achieve state-of-the-art accuracy while costs much less
computation. In particular, our method based on ResNet-18 is 1.1% mIoU higher
than PSPNet [67] while only requiring 11% of its computation. Our ResNet-101
based model achieves better results than DAnet [16] by 0.3% mIoU and only
requires 30% of its computation.
Table 6. Accuracy and Speed comparison with previous state-of-the-art real-time mod-
els on CamVid [3] test set where the input size is 960 × 720 with single scale inference.
CamVid: is another road scene dataset. This dataset involves 367 training
images, 101 validation images and 233 testing images with resolution of 960×720.
We apply our method with different light-weight backbones on this dataset and
report comparison results in Table 6. With DF2 as backbone, FAM improves
its baseline by 3.2% mIoU. Our method based on ResNet-18 performs best with
73.8% mIoU while running at 35.5 FPS.
5 Conclusion
In this paper, we devise to use the learned Semantic Flow to align multi-
level feature maps generated by a feature pyramid to the task of scene parsing.
With the proposed flow alignment module, high-level features are well fused into
low-level feature maps with high resolution. By discarding atrous convolutions
to reduce computation overhead and employing the flow alignment module to
enrich the semantic representation of low-level features, our network achieves
the best trade-off between semantic segmentation accuracy and running time
efficiency. Experiments on multiple challenging datasets illustrate the efficacy of
our method.
Semantic Flow for Fast and Accurate Scene Parsing 15
6 Supplemental Parts
Our supplemental material contains two parts. One is the more details on Cityscapes
datasets and the other is the detailed setting on other datasets. We will open-
source the our codebase.
PASCAL Context: provides detailed semantic labels for whole scenes, and
contains 4998 images for training and 5105 images for validation. We train the
network for 120 epochs with batch size 16, crop size 512 with initial learning
Semantic Flow for Fast and Accurate Scene Parsing 17
rate 1e-3. For evaluation, we perform multi-scale testing with horizontal flip
operation.
ADE20k: is a more challenging scene parsing dataset annotated with 150 classes,
and it contains 20K/2K images for training and validation. It has the various
objects in the scene. We train the network for 120 epochs with batch size 16, crop
size 512 and initial learning rate 1e-2. For final testing, we perform multi-scale
testing with horizontal flip operation.
18 X. Li et al.
Method road swalk build wall fence pole tlight sign veg. terrain sky person rider car truck bus train mbike bike mIoU
ResNet38 [53] 98.5 85.7 93.0 55.5 59.1 67.1 74.8 78.7 93.7 72.6 95.5 86.6 69.2 95.7 64.5 78.8 74.1 69.0 76.7 78.4
PSPNet [67] 98.6 86.2 92.9 50.8 58.8 64.0 75.6 79.0 93.4 72.3 95.4 86.5 71.3 95.9 68.2 79.5 73.8 69.5 77.2 78.4
AAF [23] 98.5 85.6 93.0 53.8 58.9 65.9 75.0 78.4 93.7 72.4 95.6 86.4 70.5 95.9 73.9 82.7 76.9 68.7 76.4 79.1
SegModel [15] 98.6 86.4 92.8 52.4 59.7 59.6 72.5 78.3 93.3 72.8 95.5 85.4 70.1 95.6 75.4 84.1 75.1 68.7 75.0 78.5
DFN [57] - - - - - - - - - - - - - - - - - - - 79.3
BiSeNet [56] - - - - - - - - - - - - - - - - - - - 78.9
DenseASPP [55] 98.7 87.1 93.4 60.7 62.7 65.6 74.6 78.5 93.6 72.5 95.4 86.2 71.9 96.0 78.0 90.3 80.7 69.7 76.8 80.6
BFPNet [11] 98.7 87.1 93.5 59.8 63.4 68.9 76.8 80.9 93.7 72.8 95.5 87.0 72.1 96.0 77.6 89.0 86.9 69.2 77.6 81.4
DANet [16] 98.6 87.1 93.5 56.1 63.3 69.7 77.3 81.3 93.9 72.9 95.7 87.3 72.9 96.2 76.8 89.4 86.5 72.2 78.2 81.5
SFNet 98.8 87.1 93.6 63.2 62.7 68.4 75.6 80.3 93.8 71.0 95.7 87.7 73.2 96.5 75.9 92.3 89.5 71.4 78.0 81.8
Table 8. Per-category results on Cityscapes test set. Note that all the models are
trained with only fine annotated data. Our method achieves 81.8% mIoU with much
less GFlops.
References
1. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. PAMI (2017)
2. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A
database and evaluation methodology for optical flow. International Journal of
Computer Vision 92(1), 1–31 (Mar 2011). https://doi.org/10.1007/s11263-010-
0390-2, https://doi.org/10.1007/s11263-010-0390-2
3. Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-
definition ground truth database. Pattern Recognition Letters xx(x), xx–xx (2008)
4. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow esti-
mation based on a theory for warping. ECCV (2004)
5. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution,
and fully connected crfs. PAMI (2018)
6. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
7. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with
atrous separable convolution for semantic image segmentation. In: ECCV (2018)
8. Cheng, B., Chen, L.C., Wei, Y., Zhu, Y., Huang, Z., Xiong, J., Huang, T.S., Hwu,
W.M., Shi, H.: Spgnet: Semantic prediction guidance for scene parsing. In: ICCV
(October 2019)
9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: CVPR (2016)
10. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-
lutional networks. In: ICCV (2017)
11. Ding, H., Jiang, X., Liu, A.Q., Magnenat-Thalmann, N., Wang, G.: Boundary-
aware feature propagation for scene segmentation (2019)
12. Ding, H., Jiang, X., Shuai, B., Qun Liu, A., Wang, G.: Context contrasted feature
and gated multi-scale aggregation for scene segmentation. In: CVPR (2018)
13. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van
Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convo-
lutional networks. In: CVPR (2015)
14. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal
visual object classes (voc) challenge. IJCV (2010)
15. Falong Shen, Gan Rui, S.Y., Zeng, G.: Semantic segmentation via structured patch
prediction, context crf and guidance crf. In: CVPR (2017)
16. Fu, J., Liu, J., Tian, H., Fang, Z., Lu, H.: Dual attention network for scene seg-
mentation. arXiv preprint arXiv:1809.02983 (2018)
17. Gadde, R., Jampani, V., Gehler, P.V.: Semantic video cnns through representation
warping. In: ICCV (Oct 2017)
18. He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation.
In: ICCV (October 2019)
19. He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network
for semantic segmentation. In: CVPR (June 2019)
20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
21. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross
attention for semantic segmentation (2019)
20 X. Li et al.
22. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer
networks. ArXiv abs/1506.02025 (2015)
23. Ke, T.W., Hwang, J.J., Liu, Z., Yu, S.X.: Adaptive affinity fields for semantic
segmentation. In: ECCV (2018)
24. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. ArXiv abs/1609.02907 (2016)
25. Kirillov, A., Girshick, R., He, K., Dollar, P.: Panoptic feature pyramid networks.
In: CVPR (June 2019)
26. Kong, S., Fowlkes, C.C.: Recurrent scene parsing with perspective understanding
in the loop. In: CVPR (2018)
27. Li, H., Xiong, P., Fan, H., Sun, J.: Dfanet: Deep feature aggregation for real-time
semantic segmentation. In: CVPR (June 2019)
28. Li, X., Yang, Y., Zhao, Q., Shen, T., Lin, Z., Liu, H.: Spatial pyramid based graph
reasoning for semantic segmentation. In: CVPR (2020)
29. Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization
attention networks for semantic segmentation. In: ICCV (2019)
30. Li, X., Houlong, Z., Lei, H., Yunhai, T., Kuiyuan, Y.: Gff: Gated fully fusion for
semantic segmentation. In: AAAI (2020)
31. Li, X., Zhou, Y., Pan, Z., Feng, J.: Partial order pruning: for best speed/accuracy
trade-off in neural architecture search. In: CVPR (2019)
32. Li, Y., Gupta, A.: Beyond grids: Learning graph representations for visual recog-
nition. In: NIPS (2018)
33. Li, Y., Shi, J., Lin, D.: Low-latency video semantic segmentation. In: CVPR (June
2018)
34. Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature
pyramid networks for object detection. In: CVPR (2017)
35. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: CVPR (2015)
36. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for
efficient cnn architecture design. In: ECCV (September 2018)
37. Mazzini, D.: Guided upsampling network for real-time semantic segmentation. In:
BMVC (2018)
38. Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Effi-
cient spatial pyramid of dilated convolutions for semantic segmentation. In: ECCV
(September 2018)
39. Mehta, S., Rastegari, M., Shapiro, L., Hajishirzi, H.: Espnetv2: A light-weight,
power efficient, and general purpose convolutional neural network. In: CVPR (June
2019)
40. Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R.,
Yuille, A.: The role of context for object detection and semantic segmentation in
the wild. In: CVPR (2014)
41. Nekrasov, V., Chen, H., Shen, C., Reid, I.: Fast neural architecture search of com-
pact semantic segmentation models via auxiliary cells. In: CVPR (June 2019)
42. Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas
dataset for semantic understanding of street scenes. In: ICCV (2017)
43. Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow
propagation. In: CVPR (June 2018)
44. Orsic, M., Kreso, I., Bevandic, P., Segvic, S.: In defense of pre-trained imagenet ar-
chitectures for real-time semantic segmentation of road-driving images. In: CVPR
(June 2019)
Semantic Flow for Fast and Accurate Scene Parsing 21
45. Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network
architecture for real-time semantic segmentation http://arxiv.org/abs/1606.
02147
46. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In:
NIPS-W (2017)
47. Romera, E., Alvarez, J.M., Bergasa, L.M., Arroyo, R.: Erfnet: Efficient residual
factorized convnet for real-time semantic segmentation. IEEE Trans. Intelligent
Transportation Systems pp. 263–272 (2018)
48. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. MICCAI (2015)
49. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large
scale visual recognition challenge. IJCV (2015)
50. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors
with online hard example mining. In: CVPR (2016)
51. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: NIPS (2017)
52. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR
(June 2018)
53. Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: Revisiting the resnet model
for visual recognition. arXiv preprint arXiv:1611.10080 (2016)
54. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene
understanding. In: ECCV (2018)
55. Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation
in street scenes. In: CVPR (2018)
56. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: Bilateral segmen-
tation network for real-time semantic segmentation. In: ECCV (2018)
57. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative
feature network for semantic segmentation. In: CVPR (2018)
58. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. ICLR
(2016)
59. Yuan, Y., Wang, J.: Ocnet: Object context network for scene parsing. arXiv
preprint arXiv:1809.00916 (2018)
60. Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context
encoding for semantic segmentation. In: CVPR (2018)
61. Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic seg-
mentation. In: CVPR (June 2019)
62. Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolu-
tional network for semantic segmentation. In: BMVC (2019)
63. Zhang, L., Xu, D., Arnab, A., Torr, P.H.: Dynamic graph message passing networks.
In: CVPR (2020)
64. Zhang, R., Tang, S., Zhang, Y., Li, J., Yan, S.: Scale-adaptive convolutions for
scene parsing. In: ICCV (2017)
65. Zhang, Y., Qiu, Z., Liu, J., Yao, T., Liu, D., Mei, T.: Customizable architecture
search for semantic segmentation. In: CVPR (June 2019)
66. Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation
on high-resolution images. In: ECCV (September 2018)
67. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
CVPR (2017)
22 X. Li et al.
68. Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., Jia, J.: Psanet:
Point-wise spatial attention network for scene parsing. In: ECCV (2018)
69. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic under-
standing of scenes through the ADE20K dataset. arXiv preprint arXiv:1608.05442
(2016)
70. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recog-
nition. In: CVPR (July 2017)
71. Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., Catanzaro, B.:
Improving semantic segmentation via video propagation and label relaxation. In:
CVPR (June 2019)
72. Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks
for semantic segmentation. In: ICCV (2019)