0% found this document useful (0 votes)
6 views

FuseSeg Semantic Segmentation of Urban Scenes Based On RGB and Thermal Data Fusion

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

FuseSeg Semantic Segmentation of Urban Scenes Based On RGB and Thermal Data Fusion

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1000 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO.

3, JULY 2021

FuseSeg: Semantic Segmentation of Urban Scenes


Based on RGB and Thermal Data Fusion
Yuxiang Sun , Member, IEEE, Weixun Zuo , Peng Yun , Hengli Wang ,
and Ming Liu , Senior Member, IEEE

Abstract— Semantic segmentation of urban scenes is an essen- to deep learning-based approaches, in which convolutional
tial component in various applications of autonomous driving. neural networks (CNNs) have been proven to be really effec-
It makes great progress with the rise of deep learning technolo- tive in tackling the semantic segmentation problem. With
gies. Most of the current semantic segmentation networks use
single-modal sensory data, which are usually the RGB images the popularities of autonomous vehicles [1]–[5] and human-
produced by visible cameras. However, the segmentation perfor- assistant driving [6]–[9], semantic segmentation of urban
mance of these networks is prone to be degraded when lighting scenes has attracted great attention. It has become a fun-
conditions are not satisfied, such as dim light or darkness. We find damental component for autonomous driving. For example,
that thermal images produced by thermal imaging cameras it provides contributive information to improve point-cloud
are robust to challenging lighting conditions. Therefore, in this
article, we propose a novel RGB and thermal data fusion network registration [10]–[12], which is the backbone of many local-
named FuseSeg to achieve superior performance of semantic seg- ization and mapping algorithms [13]–[17]. Note that the type
mentation in urban scenes. The experimental results demonstrate of urban scenes we are considering is the street scene because
that our network outperforms the state-of-the-art networks. we use the public data set released in [18] and the data set is
recorded in urban street scenes.
Note to Practitioners—This article investigates the problem of
semantic segmentation of urban scenes when lighting conditions Currently, most of the deep learning-based semantic seg-
are not satisfied. We provide a solution to this problem via mentation networks are designed using single-modal sensory
information fusion with RGB and thermal data. We build an data, which are usually RGB images generated by visible cam-
end-to-end deep neural network, which takes as input a pair eras. However, RGB images could become less informative
of RGB and thermal images and outputs pixel-wise semantic when lighting conditions are not satisfied, such as dim light
labels. Our network could be used for urban scene understanding,
which serves as a fundamental component of many autonomous or total darkness. We found that thermal images are robust
driving tasks, such as environment modeling, obstacle avoidance, to challenging lighting conditions. They are transformed from
motion prediction, and planning. Moreover, the simple design of thermal radiations by thermal imaging cameras. Virtually, any
our network allows it to be easily implemented using various matter with temperature above absolute zero could be seen
deep learning frameworks, which facilitates the applications on with thermal [21]. The spectrum of thermal radiations ranges
different hardware or software platforms.
from 0.1 to 100 μm, whereas the visible light ranges from
Index Terms— Autonomous driving, information fusion, 0.4 to 0.76 μm. Most of the thermal radiations are invisible
semantic segmentation, thermal images, urban scenes. to human eyes or imaging sensors (e.g., CCD or CMOS) but
visible to thermal imaging cameras. Therefore, thermal images
I. I NTRODUCTION
could be helpful to detect and segment objects when lighting

S EMANTIC image segmentation generally refers to


densely label each pixel in an image with a category.
Recent years have witnessed a great trend for semantic seg-
conditions are not satisfactory.
Note that Lidars can also work in unsatisfactory light-
ing conditions. The advantages of using thermal imaging
mentation shifting from traditional computer vision algorithms cameras lie in fourfold. First, thermal imaging cameras are
Manuscript received January 9, 2020; revised April 1, 2020; accepted May 4, expensive than visible cameras, but they are still much
2020. Date of publication June 4, 2020; date of current version July 2, 2021. cheaper than Lidars. For price-sensitive applications, such as
This article was recommended for publication by Associate Editor C. Yang driver assistance systems, solutions based on thermal imaging
and Editor D. O. Popa upon evaluation of the reviewers’ comments. This work
was supported in part by the National Natural Science Foundation of China cameras would be more attractive. Second, thermal images
under Project U1713211, and in part by the Research Grant Council of Hong are grayscale visual images in nature. Therefore, technol-
Kong under Project 11210017. (Corresponding author: Ming Liu.) ogy advancements in computer vision could directly benefit
Yuxiang Sun, Weixun Zuo, Hengli Wang, and Ming Liu are with the
Department of Electronic and Computer Engineering, The Hong Kong Uni- thermal imaging applications. For example, successful CNNs
versity of Science and Technology, Hong Kong (e-mail: [email protected], could be directly used on thermal images to extract features
[email protected]; [email protected]; [email protected]; without any modification. While Lidar point clouds have
[email protected]).
Peng Yun is with the Department of Computer Science and Engineering, different data structures from images, they are sparse point lists
The Hong Kong University of Science and Technology, Hong Kong (e-mail: instead of dense arrays [22]–[24]. Computer vision techniques
[email protected]). might not be directly used on Lidar point clouds [23]. Third,
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. thermal imaging cameras can provide real-time dense images,
Digital Object Identifier 10.1109/TASE.2020.2993143 such as visible cameras. For instance, the FLIR automotive
1545-5955 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1001

thermal cameras1 could stream thermal images with the reso-


lution of 512×640 and can run at 60 Hz. However, Lidar point
clouds are much sparser than thermal images, and the frame
rates are slow. For example, the Velodyne HDL-64E S3 can
only rotate up to 20 Hz [25]. As a semantic understanding
device, the sparse measurements (64 lines) may overlook
object details or far-distance small objects, and the slow frame
rate may introduce artifacts or motion distortions that may
hinder the perception. Finally, current spinning Lidars are
mechanically complex, which mainly stems from the optical
beam deflection unit. The mechanical parts, such as motors
and gears, are subject to friction and abrasion, making Lidars
less durable in long-term operation. In addition, autonomous
vehicles usually require Lidars to be installed outside, which Fig. 1. Qualitative comparison with two state-of-the-art networks in an almost
total darkness lighting condition. A person on a bike is almost invisible in the
may directly expose them under adverse weather conditions RGB image but can be clearly seen in the thermal image. We can see that
and hence shorten the life expectancy, whereas thermal imag- both the SegHRNet [19] and DFN [20] fail to correctly segment the objects,
ing cameras are only electronic devices and could be placed whereas our FuseSeg can give an acceptable result. The yellow and blue
colors in the mask images represent person and bike, respectively. The other
inside vehicles, such as visible cameras. They could work in colors represent other classes. The figure is best viewed in color. (a) RGB
long term without extra maintenance. image. (b) Thermal image. (c) Ground truth. (d) SegHRNet. (e) DFN. (f) Our
Many researchers resort to Lidar-camera fusion to overcome FuseSeg.
the limitations of solely using visible cameras. For example,
Gao et al. [26] proposed a CNN-based method for object decoder. The example in Fig. 1 shows that a person is clearly
classification with Lidar-camera fusion. They convert the visible in the thermal image even the environment is with
sparse Lidar point clouds to front-view depth images and almost total darkness. We can see that our FuseSeg provides
upsample the depth images to dense ones. Then, the depth an acceptable segmentation result for the person, whereas the
images and RGB images can be registered and processed other two networks fail to segment the person. The example
by CNN. Qi et al. [27] proposed a cascade Lidar-camera demonstrates that the networks relying only on RGB data
fusion pipeline, in which 2-D region proposals are extracted could be degraded when lighting conditions are not satisfied,
from front-view RGB images with a 2-D image-based object and our data fusion-based network could be a solution to
detector, and then, the region proposals are projected to address the problem. The contributions of this article are listed
3-D frustums in point clouds. The points in the frustums as follows.
are processed by PointNet to get the instance segmentation 1) We develop a novel RGB-thermal data fusion network
results. Despite the success of Lidar-camera fusion methods, for semantic segmentation in urban scenes. The net-
we still think that RGB-thermal fusion would be more work can be used to get accurate results when lighting
suitable than Lidar-camera fusion for semantic reasoning conditions are not satisfied, for instance, dim light,
in autonomous driving. Because vulnerable road users, total darkness, or on-coming headlights, which is an
such as pedestrians, normally have higher temperatures advantage over the single-modal networks.
than surrounding environments, they are more discernible 2) We construct our Bayesian FuseSeg using the Monte
in thermal images, which could provide strong signals for Carlo (MC) dropout technique [29] to analyze the
segmentation. In addition, thermal imaging cameras can uncertainty for the semantic segmentation results. The
work at 60 Hz or higher, which allows semantic reasoning performance with different dropout rates is compared.
to be performed space intensively. Taking 70-km/h vehicle 3) We evaluate our network on a public data set released
speed as an example, the vehicle moving distance between in [18]. The results demonstrate our superiority over the
two consecutive images from a 60-Hz camera is around state of the arts. We also evaluate our network on the
(70 × 103 /60 × 3600) ≈ 0.3 m. Such a distance between two SUN-RGBD v1 data set [30]. The results demonstrate
times of semantic reasoning would be sufficient for most cases. our generalization capability to RGB-D data.
In this article, we fuse both the RGB and thermal data The remainder of this article is organized as follows.
in a novel deep neural network to achieve superior perfor- In Section II, we review the related work. In Section III,
mance in urban scenes. In particular, from the probabilistic we describe our network in detail. Sections IV–VI present the
data fusion theory [28], we have to find P(Seg|x 1 , x 2 ) = experimental results and discussions. Conclusions and future
P(x 2 |Seg, x 1 )P(x 1 |Seg)P(Seg)/P(x 1 x 2 ), where P(·) repre- work are drawn in Section VII.
sents the probability functions, Seg represents the segmenta-
tion results, x 1 and x 2 represent the RGB and thermal data,
respectively, and P(x 1 x 2 ) is usually the constant normalization II. R ELATED W ORK
term. The main novelty of this article lies in the network archi- The related work to this article includes single-modal and
tecture, especially the data fusion strategy and the proposed data fusion semantic segmentation networks, as well as com-
puter vision applications using thermal imaging. We review
1 https://www.flir.com/products/adk several representative works in each field.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1002 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021

A. Single-Modal Semantic Segmentation Networks was designed to provide a sizeable receptive field. An attention
The first work addressing the semantic segmentation refinement module was developed in the context path for
problem end-to-end was the fully convolutional performance refinement. Sun et al. [42] developed HRNet that
networks (FCNs) proposed by Shelhamer et al. [31]. was able to keep high-resolution representations through the
They modified image classification networks, such as whole encoding process. The network was designed for human
VGG-16 [32], into the fully convolutional form to achieve pose estimation but can be utilized as a general CNN backbone
pixel-wise image segmentation. Noh et al. [33] developed for other computer vision tasks. They improved HRNet by
DeconvNet that consists of a convolutional module for feature upsampling low-resolution representations to high resolution
extraction and a deconvolutional module for resolution [19], with which semantic segmentation maps could be
restoration. Badrinarayanan et al. [34] introduced the estimated.
encoder–decoder concept in SegNet. The functionalities
of encoder and decoder are analogous to those of the B. Data Fusion Semantic Segmentation Networks
convolutional and deconvolutional modules in DeconvNet. Apart from using the single-modal RGB data, depth data
Ronneberger et al. [35] developed UNet by introducing skip from RGB-D cameras [43] have been exploited for seman-
connections between the encoder and the decoder. It was tic segmentation. Hazirbas et al. [44] proposed FuseNet by
proven to be effective to keep the spatial information by the fusing RGB and depth data in an encoder–decoder struc-
skip connections. Although UNet was initially designed for ture. In FuseNet, two encoders using VGG-16 as backbone
biomedical imaging, it generalizes well to other domains. were designed to take as inputs the RGB and depth data,
Paszke et al. [36] designed ENet for efficient semantic respectively. The feature maps from the depth encoder were
segmentation by speeding up the inference process of the gradually fused into the RGB encoder. Wang and Neumann
initial block. They proposed a pooling operation in parallel [45] fused RGB and depth information for semantic segmen-
with a convolutional operation with a stride of 2. Moreover, tation by introducing the depth-aware convolution and depth-
the asymmetric convolutions were employed in its bottleneck aware average pooling operations, which incorporate geometry
module to reduce the redundancy of convolutional weights. information in conventional CNN. They computed the depth
Zhao et al. [37] observed that context information could be similarities between the center pixel and neighboring pixels.
helpful to improve semantic segmentation performance. Based The neighboring pixels with close depth values were weighted
on this observation, they introduced the pyramid pooling to contribute more in the operations. For semantic segmenta-
module (PPM) in PSPNet to extract local and global context tion of urban scenes, MFNet [18] and RTFNet [46] were both
information at different scales. Wang et al. [38] designed the proposed to use RGB and thermal data. Ha et al. [18] designed
dense upsampling convolution (DUC) and the hybrid dilated MFNet by fusing RGB and thermal data in an encoder–
convolution (HDC) for decoder and encoder, respectively. decoder structure. Two identical encoders were employed to
Compared with bilinear upsampling and deconvolution extract features from RGB and thermal data, respectively.
networks, DUC is learnable and free of zero padding. A mini-inception block was designed for the encoder. RTFNet
HDC can alleviate the gridding issue during downsampling. [46] was also designed with two encoders and one decoder.
Pohlen et al. [39] proposed FRRN for semantic segmentation, In the decoder of RTFNet, two types of upception blocks
which consists of two processing streams. One stream were designed to extract features and gradually restore the
maintains the feature map resolution at the input level. The resolution.
other one performs pooling operations to increase the size
of the receptive field. The two streams are coupled in the
proposed FRRU block. Yu et al. [20] proposed DFN to C. Computer Vision Applications Using Thermal Imaging
address two common challenges in semantic segmentation: Apart from semantic segmentation, thermal imaging has
the intraclass inconsistency problem and the interclass been used in other computer vision applications, such as
indistinction problem. It mainly consists of a smooth network facial expression recognition [48]. Wang et al. [49] proposed
to capture the multiscale and global context information, a thermal-augmented facial expression recognition method.
as well as a border network to discriminate the adjacent They designed a similarity constraint to jointly train the
patches with similar appearances but different class labels. visible and thermal expression classifiers. During the testing
ERFNet was developed by Romera et al. [40] for efficient stage, only visible images are used, which could reduce the
semantic segmentation. The core components of ERFNet are cost of the system. Yoon et al. [50] utilized thermal images
the proposed 1-D convolutional layers with the kernel sizes of for drivable road detection at nighttime. A Gabor filter was
3 × 1 and 1 × 3. The 1-D convolutional layers are combined applied to thermal images to find textureless areas that were
with skip connections to form a residual block, which is considered as the rough detection results for the drivable road.
integrated with an encoder–decoder architecture. Yu et al. [41] Superpixel algorithms were employed on thermal images to
proposed BiSeNet that mainly consists of a spatial path and smooth the segmentation results. Knapik and Cyganek [51]
a context path. The spatial path was designed to preserve the developed a yawn detection-based fatigue recognition method
spatial information. It contains three sequential downsampling using thermal imaging for driver assistance systems. The
convolutional operations, which reduces the feature map method consists of a face detection module, an eye-corner
resolution to 1/8 of the original input size. The context path localization module, and a yawn detection module. The yawn

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1003

Fig. 2. Overall architecture of our FuseSeg. It consists of an RGB encoder, a thermal encoder, and a decoder. We employ DenseNet [47] as the backbone
of the encoders. In the first stage of two-stage fusion (TSF), the thermal feature maps are hierarchically added with the RGB feature maps in the RGB
encoder. The fused feature maps are then concatenated with the corresponding decoder feature maps in the second fusion stage. The blue rectangles represent
the feature maps. The white rectangles represent the fused features maps copied from the RGB encoder. The purple and green arrows represent the feature
extractor and the upsampler in the decoder, respectively. s represents the input resolution of the RGB and thermal images. s = 480 × 640 in this article. The
feature maps at the same level share the same resolution. cn represents the number of channels of the feature maps at different levels. Cat, Conv, Trans Conv,
and BN are short for concatenation, convolution, transposed convolution, and batch normalization. The figure is best viewed in color.

detection is inferred from a thermal-anomaly detection model, to other transition layers after the fourth dense block. The
which is based on the temperature change measurement from dense blocks in DenseNet keep the feature map resolution
the thermal imaging camera. unchanged, whereas the initial block, the max-pooling layer,
and the transition layers reduce the feature map resolution by
III. P ROPOSED N ETWORK a factor of 2. Note that the feature map resolution has been
reduced to 15 × 20 (given the input resolution of 480 × 640)
A. Overall Architecture
before the final transition layer. Because we disable the ceiling
We propose a novel data fusion network named FuseSeg. mode of the average pooling operation in the final transition
It generally consists of two encoders to extract features from layer, the feature map resolution after the final transition layer
input images and one decoder to restore the resolution. The is reduced to 7 × 10 (not 8 × 10). There are four architectures
two encoders take as input the three-channel RGB and one- for DenseNet: DenseNet-121, DenseNet-169, DenseNet-201,
channel thermal images, respectively. Fig. 2 shows the overall and DenseNet-161. The complexity increases from 121 to
structure of our FuseSeg. We employ DenseNet [47] as the 161. DenseNet-161 possesses the most number of parameters
backbone of the encoders. We innovatively propose a TSF because it is grown with the largest rate of 48, whereas the
strategy in our network. As shown in Fig. 2, in the first stage, others share the growth rate of 32. We refer readers to [47] for
we hierarchically fuse the corresponding thermal and RGB the details of DenseNet. Our FuseSeg follows the same naming
feature maps through elementwise summation in the RGB rule of DenseNet. The number of channels cn in Fig. 2 varies
encoder. Inspired by [35], the fused feature maps except the with different DenseNet architectures. Detailed numbers are
bottom one are then fused again in the second stage with the listed in Table I.
corresponding feature maps in the decoder through tensor con-
catenation. The bottom one is directly copied to the decoder
instead of concatenation. With our TSF strategy, the loss of C. Decoder
spatial information through the intensive downsampling could
be recovered. The decoder is designed to gradually restore the feature map
resolution to the original. We design a decoder that mainly
consists of three modules: a feature extractor that sequentially
B. Encoders contains two convolutional layers, an upsampler, and an out
The RGB and thermal encoders are designed with the block that both contain one transposed convolutional layer.
same structure except for the input dimension because the Note that there are a batch normalization layer and a ReLu
input data are with different channels. As aforementioned, activation layer followed by the convolutional and transposed
we use DenseNet as the backbone. We first delete the clas- convolutional layers in the feature extractor and the upsampler.
sification layer in DenseNet to avoid excessive loss of spatial The detailed configurations for the convolutional and trans-
information. Then, we add a transition layer that is similar posed convolutional layers are displayed in Table II.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1004 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021

TABLE I images, whereas the validation and test sets both consist of
N UMBER OF F EATURE M AP C HANNELS cn AT D IFFERENT L EVELS 25% of the daytime images and 25% of the nighttime images.
A CCORDING TO D IFFERENT D ENSE N ET A RCHITECTURES
The training set is augmented with the flip technique. Our
FuseSeg is implemented with the PyTorch. The convolutional
and transposed convolutional layers in the decoder are initial-
ized using the Xavier scheme [52]. The encoder layers are
initialized using the pretrained weights provided by PyTorch.
We use the stochastic gradient descent (SGD) optimization
solver and the cross-entropy loss for training. The learning
rate is decayed exponentially. The networks are trained until
no further decrease in the loss is observed.
TABLE II
C ONFIGURATIONS FOR THE C ONVOLUTIONAL (C ONV ) AND T RANSPOSED
C ONVOLUTIONAL (T RANS C ONV ) L AYERS IN THE I NDIVIDUAL M OD - C. Evaluation Metrics
ULES OF THE D ECODER
For the quantitative evaluation, we use the same metrics
from [46]: Accuracy (Acc) and intersection over union (IoU).
Let Acci and IoUi denote the Acc and IoU for class i . They
are computed in the formulas
K k
θii
Acci =  K k k=1 K N
, (1)
k=1 θii + j =1, j  =i θi j
k
k=1
K k
θ
The feature extractor is employed to extract features from IoUi =  K k  K  N k=1 iik  K  N ,
k=1 θii + j =1, j  =i θ j i + j =1, j  =i θi j
k
k=1 k=1
the fused feature maps. It keeps the resolution of the feature
(2)
maps unchanged. Both the upsampler and out block increase
the resolution by a factor of 2. The out block outputs the where θiik , θikj , and θ kji represent in the image k the number
final prediction results with the channel number of 9, which of pixels of class i that are correctly classified as class i ,
is the number of classes. We add a softmax layer after the the number of pixels of class i that are wrongly classified as
output to get the probability map for the segmentation results. class j , and the number of pixels of class j that are wrongly
As aforementioned, the feature map resolution is 7 × 10 at classified as class i , respectively. K and N represent the
the end of the encoder. To restore it to 15 × 20, we employ a number of test images and the number of classes, respectively.
padding technique at this level of upsampler. The upsampled N = 9 in this article. We use mAcc and mIoU to represent
feature map is concatenated with the one from the RGB the arithmetic average values of Acc and IoU across the nine
encoder during the second stage of our TSF. The number of classes.
feature channels doubles after the concatenation.
V. A BLATION S TUDY
IV. E XPERIMENTAL S ETUP
A. Ablation for Encoders
A. Data Set
1) Encoder Backbone: Since ResNet [53], WideResNet
In this article, we use the public data set released by
[54], ResNext [55], and HourglassNet [56] have similar struc-
Ha et al. [18]. It was recorded in urban street scenes, which
tures as DenseNet, we replace DenseNet with these networks
contains common objects: car, person, bike, curve (road lanes),
and compare their performance with ours. The quantitative
car stop, guardrail, color cone, and bump. The images are cap-
results are listed in Table III. As we can see, using DenseNet-
tured at the 480 × 640 resolution by an InfReC R500 camera,
161 achieves the best performance, which confirms the effec-
which can provide RGB and thermal images simultaneously.
tiveness of our choice.
There are 1569 registered RGB and thermal images in the data
2) Single-Modal Performance: We delete the thermal
set, among which 749 are taken at nighttime and 820 are taken
encoder of FuseSeg to see the performance without using
at daytime. The data set was provided with hand-labeled pixel-
the thermal information. We name this variant as no thermal
wise ground truth, including the aforementioned eight classes
encoder (NTE). Similarly, we delete the RGB encoder to see
of common objects and one unlabeled background class.
how the network performs given only the thermal information.
The variant is termed no RGB encoder (NRE). In these two
B. Training Details variants, the first-stage fusion in our TSF strategy is canceled
We train the networks on a PC with an Intel i7 CPU since there is only one encoder in the networks. We display
and an NVIDIA 1080 Ti graphics card, including 11-GB the results with respect to different DenseNet architectures
graphics memories. We accordingly adjust the batch sizes for in Table IV. We can see that all the networks using DenseNet-
the networks to fit the graphics memories. We employ the data 161 gain more accuracy than the others. The superior perfor-
set splitting scheme used in [18]. The training set consists mance is expected because DenseNet-161 presents the best
of 50% of the daytime images and 50% of the nighttime image classification performance among the four DenseNet

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1005

TABLE III
R ESULTS (%) OF A BLATION S TUDY FOR E NCODERS U SING D IFFERENT BACKBONES ON THE T EST S ET. W E U SE D ENSE N ET-161 IN O UR N ETWORK .
B OLD F ONT H IGHLIGHTS THE B EST R ESULTS

TABLE IV TABLE V
R ESULTS (%) OF A BLATION S TUDY FOR E NCODERS ON THE T EST S ET. R ESULTS (%) OF A BLATION S TUDY FOR F USION S TRATEGY ON THE T EST
O URS D ISPLAYS THE R ESULTS OF O UR F USE S EG . B OLD F ONT H IGH - S ET. A LL THE VARIANTS U SE THE D ENSE N ET-161 AS THE E NCODER
LIGHTS THE B EST R ESULTS BACKBONE . O URS D ISPLAYS THE R ESULTS OF O UR F USE S EG -161.
B OLD F ONT H IGHLIGHTS THE B EST R ESULTS

fusion (RSF). The input dimension of the feature extrac-


tor in the decoder is correspondingly modified to take
as input the summarized feature map.
architectures. Moreover, our FuseSeg outperforms NTE and 7) CSF: This variant combines RSF and RCF. Therefore,
NRE, proving that the data fusion is a benefit here. Comparing it is performed with the concatenation and summation
NTE and NRE, we find that all the NRE results are better than fusion (CSF) at the first and second stages, respectively.
those of NTE. This indicates that thermal information plays a
The results are displayed in Table V. Our FuseSeg with the
significant role in our network.
proposed TSF strategy presents the best performance, which
confirms the effectiveness of TSF. We find that OEF and
B. Ablation for Fusion Strategy NSC both provide low performance. The reason for the OEF
For the ablation of fusion strategy, we compare the TSF performance could be that the features are not well extracted
proposed in our FuseSeg with seven variants. The former four with only one encoder even it is fed with four-channel data.
variants modify the first stage of our TSF strategy. The next The inferior performance of NSC proves that the second-
two modify the second stage. The last one modifies both the stage fusion between the encoder and the decoder in our TSF
first stage and the second stage. The detailed descriptions for strategy is critical to improve the performance. We find from
the variants are listed as follows. the HEF results that having the early fusion at the input level
1) OEF: This variant deletes all the fusion connections could degrade the performance. The OLF results show that
between the two encoders and keeps only one encoder. the fusions between the two encoders at different levels are
The encoder is fed with four-channel RGB-thermal data, necessary for our network. From the results of RCF, RSF, and
so it is a version of only early fusion (OEF). CSF, we could find that using summation for the first stage
2) HEF: This variant keeps network unchanged, except that and concatenation for the second stage would be a superior
the RGB encoder is fed with four-channel RGB-thermal choice here.
data, so it has early fusion (HEF).
3) OLF: This variant deletes the fusion connections
C. Ablation for Decoder
between the two encoders except the last one, so the
encoder feature maps are performed with only late In our FuseSeg, the feature extractor in the decoder consists
fusion (OLF). Since there is no fusion between RGB of two sequential Conv-BN-ReLu blocks, which is shown
and thermal at other levels, only the RGB feature maps in Fig. 2. We compare our FuseSeg with five variants that
are fused to the decoder at those levels. have different feature extractors in the decoder. We list the
4) RCF: The summation fusion between the encoders is detailed information as follows.
replaced with concatenation fusion (RCF). To keep the 1) TPC: The feature extractor mainly consists of two
number of channels unchanged, the concatenated feature parallel convolutional (TPC) layers.
maps are processed with a 1 × 1 convolution layer to 2) OC: The feature extractor consists of only One Conv-
reduce the number of channels. BN-ReLu (OC) block.
5) NSC: This variant deletes all the fusion connec- 3) TSC: The feature extractor consists of three sequential
tions between the encoder and the decoder. Therefore, Conv-BN-ReLu (TSC) blocks.
the variant has no skip connection NSC) between the 4) THC: The feature extractor mainly consists of three
encoder and the decoder except at the bottom level. hybrid-organized convolutional (THC) layers.
6) RSF: The concatenation fusion between the encoder 5) FSC: The feature extractor consists of four sequential
and the decoder is replaced with the summation Conv-BN-ReLu (FSC) blocks.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1006 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021

TABLE VI
R ESULTS (%) OF A BLATION S TUDY FOR THE D ECODER ON THE T EST
S ET. A LL THE VARIANTS U SE THE D ENSE N ET-161 AS THE E NCODER
BACKBONE . O URS D ISPLAYS THE R ESULTS OF O UR F USE S EG -161.
B OLD F ONT H IGHLIGHTS THE B EST R ESULTS

Fig. 3. Detailed structures for TPC and THC. The figure is best viewed in
color.

The detailed structures for TPC and THC are shown


in Fig. 3. All the convolutional layers in the different feature directly imported from [46] to facilitate comparison. We use
extractors share the same kernel size, stride, and padding with RTFNet-152, FRRN model B and HRNetV2-W48 here. The
ours. We also build three variants that replace the TranConv- results of SegNet [34], UNet [35], ENet [36], PSPNet [37],
BN-ReLu block in the upsampler with different structures. The DUC-HDC [38], and ERFNet [40] can be found in [46].
detailed descriptions are listed as follows. Our FuseSeg outperforms these networks. As FuseSeg uses
1) OCOI: The upsampler sequentially consists of One four-channel RGB-thermal data, to make fair comparisons,
Conv-BN-ReLu block and One Interpolation function we modify the input layers of the single-modal networks to
(OCOI). The stride of the convolutional layer in the take as input the four-channel data. We train and compare them
Conv-BN-ReLu block is 1. The scale factor for the using the three- and four-channel data, respectively.
interpolation function is 2.
2) TPOI: The upsampler sequentially consists of two par-
A. Overall Results
allel convolutional layers and one interpolation function
(TPOI). The two parallel convolutional layers are similar Table VII displays the quantitative results for the compari-
to those in TPC. The stride for the convolutional layers son. We can see that our FuseSeg-161 outperforms the other
is 1. The scale factor for the interpolation function is 2. networks in terms of mAcc and mIoU. Among the single-
3) OCOT: The upsampler sequentially consists of One modal networks, both the DFN and SegHRNet present rela-
Conv-BN-ReLu block and One TranConv-BN-ReLu tively good results, which shows the generalization capabilities
block (OCOT). The strides of the convolutional layers of the networks. Comparing the three- and four-channel results
in the Conv-BN-ReLu block and the TranConv-BN-ReLu of the single-modal networks, we find that almost all the four-
block are 1 and 2, respectively. channel results are better than the three-channel ones. This
demonstrates that using thermal information is beneficial to
Table VI displays the results. For the feature extractor,
the overall performance.
our FuseSeg with the simple two sequential Conv-BN-ReLu
blocks presents the best performance. OC, TSC, and FSC
also have a sequential structure. Their results show that the B. Daytime and Nighttime Results
performance decreases with the increasing number of layers We evaluate the networks under the daytime and nighttime
in the sequential structure. We find that the THC results are lighting conditions, respectively. The comparative results are
close to ours. The reason could be that the summation of the displayed in Table VIII. We find that FuseSeg outperforms
two parallel convolutional layers in THC actually resembles most of the other networks. For the daytime condition, some
the one convolutional layer in ours. It can be imagined as of the single-modal networks using the three-channel data
breaking one convolutional layer into two convolutional layers are better than those using four-channel data. We conjecture
and then adding them together. This increases the number of that the reason is that the registration errors [18] between
parameters, but the results show that it could not increase the the RGB and thermal images confuse the prediction. In the
performance. A similar case happens to TPC and OC. The daytime, both RGB and thermal images encode strong fea-
two parallel convolutional layers in TPC resemble the one tures, so temporal or spatial misalignments between the two-
convolutional layer in OC, so they share a similar performance, modal data would give contradict information and thus degrade
but TPC is slightly worse than OC. For the upsampler, the performance. For the nighttime condition, almost all the
we find that using the interpolation function (i.e., OCOI and single-modal networks provide superior performance when
TPOI) to increase the feature map resolution presents infe- using the four-channel data. This is expected because RGB
rior performance. Comparing the results of ours and OCOT, images are less informative when lighting conditions are not
we find that only using one transposed convolutional layer to well satisfied. Incorporating visible thermal images could help
simultaneously change the feature map dimension and increase the segmentation.
the feature map resolution would be sufficient for our network.
C. Inference Speed
VI. C OMPARATIVE S TUDY Table IX displays the approximate number of parameters
We compare our FuseSeg with FRRN [39], BiSeNet [41], and the inference speed for each network. The speed is
DFN [20], SegHRNet [19], MFNet [18], FuseNet [44], evaluated on an NVIDIA GTX 1080 Ti and an NVIDIA Jetson
DepthAwareCNN [45], and RTFNet [46] in this section. The TX2 (Tegra X2). For the single-modal networks, we only test
results of MFNet [18], FuseNet [44], and RTFNet [46] are with four-channel data. We find that almost all the networks

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1007

TABLE VII
C OMPARATIVE R ESULTS (%) ON THE T EST S ET. 3c AND 4c R EPRESENT T HAT THE N ETWORKS A RE T ESTED W ITH THE T HREE -C HANNEL
RGB D ATA AND F OUR -C HANNEL RGB-T HERMAL D ATA , R ESPECTIVELY. N OTE T HAT THE M A CC AND M I O U A RE C ALCULATED
W ITH THE U NLABELED C LASS , BUT THE R ESULTS FOR THE U NLABELED C LASS A RE N OT D ISPLAYED . T HE B OLD
F ONT H IGHLIGHTS THE B EST R ESULT IN E ACH C OLUMN

TABLE VIII TABLE IX


C OMPARATIVE R ESULTS (%) IN D AYTIME AND N IGHTTIME . T HE B OLD N UMBER OF PARAMETERS AND I NFERENCE S PEED FOR E ACH
F ONT H IGHLIGHTS THE B EST R ESULT IN E ACH C OLUMN N ETWORK . ms AND FPS R EPRESENT M ILLISECOND AND
F RAMES PER S ECOND , R ESPECTIVELY

results under various challenging lighting conditions. Specifi-


cally, in the second column, two persons behind the far bikes
run real-timely on 1080 Ti (i.e., greater than 30 Hz), but most
are almost invisible in the RGB image due to the limited
of them cannot run real-timely on TX2. Our FuseSeg reaches
dynamic range of the RGB camera, but they can be seen in
only 1.7 Hz on TX2, making it not practical for real-time
the thermal image. Our FuseSeg could take advantage of the
applications on such low-level computing devices. In addition,
contributive thermal information to correctly segment the two
it would also be not practical to run our network on the
persons. In the seventh column, the bikes are almost invisible
low-cost NVIDIA Jetson Nano and Intel Movidius, which are
in the thermal image, which may be caused by a similar
weaker than TX2 [57]. We think that the double processing
temperature with the environment. They can be seen a little
(two-branch encoder) of images using complex backbones
in the RGB image. Our FuseSeg could make use of the two-
(e.g., in the table ResNet-152 for RTFNet, DenseNet-161 for
modal information and correctly find the three bikes.
ours) might be the major factor leading to the low speed.
We also find that the results of FuseSeg and RTFNet are
very close to each other, but FuseSeg performs better because
D. Qualitative Demonstrations it provides sharper object boundaries, especially in the first
Fig. 4 shows sample qualitative results for the data fusion column. By comparing FuseSeg and RTFNet, we conjecture
networks. We can see that our FuseSeg can provide superior that this may be benefited from our connections between

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1008 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021

Fig. 4. Qualitative demonstrations for the fusion networks in typical daytime and nighttime scenarios, which are shown in the left four and right four
columns, respectively. We can see that our network can provide acceptable results in various lighting conditions. The comparative results demonstrate our
superiority. The figure is best viewed in color.

the encoder and the decoder. The detailed spatial informa- an important capability to ensure safe decision-making for
tion could be retained through the short connections at each autonomous vehicles. MC dropout has been successfully
level. This can also explain the unsatisfactory performance of employed to infer posterior distributions for the model parame-
FuseNet because both FuseNet and RTFNet have no such short ters of Bayesian networks. This article adopts the MC dropout
connections. Note that although MFNet has such connections, technique for uncertainty estimation [29]. We construct the
it still presents inferior performance compared with FuseSeg. Bayesian FuseSeg by inserting dropout layers after the initial
We think that the reason may stem from the tiny and weak blocks, max-pooling layers, and No.1 − 4 transition layers of
decoder of MFNet, in which only one convolutional layer is the RGB and thermal encoders. During runtime, we sample
contained at each stage. In addition, they use concatenation the model T times, and here, we set T = 50. The uncertainty
for the encoder fusion and summation for the encoder–decoder ζ for each pixel is calculated by
fusion, which we have proven inferior to our fusion strategy
1 
N
(see the CSF variant in the ablation study). DepthAwareCNN
ζ =− p(ln |I, θ ) log p(ln |I, θ ) (3)
assumes that the pixels on the same object share similar depth N n=1
(replaced by temperature) values. However, this assumption is
violated here, which may explain its inferiority. For example, where I, θ , and ln represent the input image, network para-
the temperature of the car in the fifth column does not distrib- meters, and label for the nth class, N is the number of
ute evenly so that the car cannot be completely segmented. classes (N = 9), and p(·) is the average softmax output of
the network for each pixel over T times. The uncertainty
E. Uncertainty Estimation
ζ actually calculates the entropy that measures the disorder
Estimating uncertainty for semantic segmentation can help of different class probabilities at a pixel [58]. Large entropy
to know how much the predictions could be trusted. It is means large disorder and, hence, large uncertainty.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1009

Fig. 5. Uncertainty maps of the Bayesian FuseSeg-161 for the results shown in Fig. 4. The first and second rows are with dropout rates 10−4 and 10−2 ,
respectively. Uncertainties increase from blue to red. The figure is best viewed in color.

The thermal images are replaced by the depth images in the


experiment. Table X displays the results. We can see that our
FuseSeg-161 also achieves better performance, indicating that
FuseSeg could generalize well to RGB-D data.

VII. C ONCLUSION AND F UTURE W ORK


This article proposed a novel deep neural network for
RGB and thermal data fusion. We aimed to achieve superior
Fig. 6. Performance of Bayesian FuseSeg-161 according to different dropout semantic segmentation performance under various lighting
rates. We find that the semantic segmentation performance severely degrades conditions, and the experimental results confirmed the superi-
when the dropout rate is larger than 10−2 . ority over the state of the arts. We performed intensive ablation
TABLE X studies, which showed that the data fusion was a benefit here.
Q UANTITATIVE R ESULTS (%) ON THE test S ET OF THE SUN-RGBD The ablation also proved the effectiveness of our network
V 1 D ATA S ET. B EST R ESULTS A RE H IGHLIGHTED W ITH B OLD F ONT design, including the encoder, decoder, and fusion strategy.
We also estimated the uncertainties of our network predictions
using the MC dropout technique. As aforementioned, our
inference speed on low-level computing platforms, such as
NVIDIA TX2, is slow. This may restrict the moving speed of
autonomous vehicles that are equipped with such platforms.
We consider it as our major limitation. In the future, we would
Fig. 6 plots the semantic segmentation performance of the like to boost the runtime speed of the network using weight
Bayesian FuseSeg-161 according to different dropout rates. pruning techniques. We will also design encoder backbones
We find that the performance degrades severely when the that are more efficient and powerful than the general-purpose
dropout rate is larger than 0.01. The reason could be that a backbones for our data fusion network. In addition, the data set
large dropout rate could dramatically change the structure of that we use is class imbalanced. We will tackle this problem
the network and hence severely influence the performance. using focal-loss techniques [23] to improve our results.
Fig. 5 shows the uncertainty maps of our Bayesian FuseSeg- To enable further studies, we list three promising research
161 for different dropout rates. We observe that most of directions. First, current fusion operations are not aware of
the large uncertainties concentrate on object boundaries. This the image quality. For the case that one modal of data is
indicates the ambiguities around the areas where the semantic more informative than the other, fusion should give more
labels change from one to another. We also find that when the considerations for the data that are more informative. Thus,
model predicts wrong labels or objects are visually difficult how to determine the image quality and smartly do the fusion
to identify, the uncertainties at these pixels are larger, for is an open question. Second, the data set that we use is
example, the left person in the seventh column. Moreover, not recorded as video sequences. We believe that previous
the uncertainties for the 10−2 dropout rate are generally frames in a video sequence could provide stronger signals to
larger than the 10−4 dropout rate, indicating that uncertainties correct wrong segmentations and lower the uncertainties of
increase when the segmentation accuracy decreases. the segmentation in the current frame because they are visually
similar. Therefore, recording a new data set as video sequences
and improving the overall performance of networks given as
F. Generalization to RGB-D Data input more than one image is a research direction. Finally,
In order to validate the generalization capability of our current low-cost off-the-shelf RGB-D cameras, such as Intel
FuseSeg, we train and test the networks using the SUN-RGBD RealSense D435, can work in outdoor environments, so they
v1 scene parsing benchmark data set [30]. We split the data can be used for autonomous vehicles. Different from thermal
set into the train, validation, and test sets, which account for imaging cameras that discriminate objects with temperature,
around 51.14%, 24.43%, and 24.43%, respectively. All the depth cameras differentiate objects by the measured pixelwise
images are resized to 400 × 528 to increase training efficiency. distances to the camera. They can provide a totally different

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1010 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021

modality of information. Therefore, recording a new data [21] M. Vollmer et al., Infrared Thermal Imaging: Fundamentals, Research
set together with an RGB-D camera and a thermal imaging and Applications. Berlin, Germany: Wiley, 2017.
[22] X. Sun, H. Ma, Y. Sun, and M. Liu, “A novel point cloud compression
camera, and fusing RGB, thermal, as well as depth images in algorithm based on clustering,” IEEE Robot. Autom. Lett., vol. 4, no. 2,
a network to improve the segmentation performance is also a pp. 2132–2139, Apr. 2019.
research direction. [23] P. Yun, L. Tai, Y. Wang, C. Liu, and M. Liu, “Focal loss in 3D object
detection,” IEEE Robot. Autom. Lett., vol. 4, no. 2, pp. 1263–1270,
Apr. 2019.
R EFERENCES [24] S. Wang, Y. Sun, C. Liu, and M. Liu, “PointTrackNet: An End-to-End
network for 3-D object detection and tracking from point clouds,” IEEE
[1] D. Li and H. Gao, “A hardware platform framework for an intelli- Robot. Autom. Lett., vol. 5, no. 2, pp. 3206–3212, Apr. 2020.
gent vehicle based on a driving brain,” Engineering, vol. 4, no. 4, [25] F. Wu, B. He, L. Zhang, S. Chen, and J. Zhang, “Vision-and-Lidar
pp. 464–470, Aug. 2018. based real-time outdoor localization for unmanned ground vehicles
[2] P. Cai, X. Mei, L. Tai, Y. Sun, and M. Liu, “High-speed autonomous without GPS,” in Proc. IEEE Int. Conf. Inf. Autom. (ICIA), Aug. 2018,
drifting with deep reinforcement learning,” IEEE Robot. Autom. Lett., pp. 232–237.
vol. 5, no. 2, pp. 1247–1254, Apr. 2020.
[26] H. Gao, B. Cheng, J. Wang, K. Li, J. Zhao, and D. Li, “Object classifi-
[3] H. Wang, Y. Sun, and M. Liu, “Self-supervised drivable area and road
cation using CNN-based fusion of vision and LIDAR in autonomous
anomaly segmentation using RGB-D data for robotic wheelchairs,” IEEE
vehicle environment,” IEEE Trans. Ind. Informat., vol. 14, no. 9,
Robot. Autom. Lett., vol. 4, no. 4, pp. 4386–4393, Oct. 2019.
pp. 4224–4231, Sep. 2018.
[4] P. Cai, Y. Sun, Y. Chen, and M. Liu, “Vision-based trajectory planning
via imitation learning for autonomous vehicles,” in Proc. IEEE Intell. [27] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets
Transp. Syst. Conf. (ITSC), Oct. 2019, pp. 2736–2742. for 3D object detection from RGB-D data,” in Proc. IEEE/CVF Conf.
[5] H. Chen, C. Xue, S. Liu, Y. Sun, and Y. Chen, “Multiple-object tracking Comput. Vis. Pattern Recognit., Jun. 2018, pp. 918–927.
based on monocular camera and 3-D lidar fusion for autonomous vehi- [28] I. Bloch, “Information combination operators for data fusion: A com-
cles,” in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019, parative review with classification,” IEEE Trans. Syst., Man, Cybern. A,
pp. 456–460. Syst., Humans, vol. 26, no. 1, pp. 52–67, 1996.
[6] X. Wu, Z. Li, Z. Kan, and H. Gao, “Reference trajectory reshaping [29] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian SegNet:
optimization and control of robotic exoskeletons for human-robot co- Model uncertainty in deep convolutional encoder-decoder architectures
manipulation,” IEEE Trans. Cybern., early access, Aug. 30, 2019, doi: for scene understanding,” 2015, arXiv:1511.02680. [Online]. Available:
10.1109/TCYB.2019.2933019. https://arxiv.org/abs/1511.02680
[7] Z. Li, B. Huang, A. Ajoudani, C. Yang, C.-Y. Su, and A. Bicchi, “Asym- [30] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D
metric bimanual control of dual-arm exoskeletons for human-cooperative scene understanding benchmark suite,” in Proc. IEEE Conf. Comput.
manipulations,” IEEE Trans. Robot., vol. 34, no. 1, pp. 264–271, Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 567–576.
Feb. 2018. [31] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
[8] Y. Sun, W. Zuo, and M. Liu, “See the future: A semantic segmenta- for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,
tion network predicting ego-vehicle trajectory with a single monocular vol. 39, no. 4, pp. 640–651, Apr. 2017.
camera,” IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 3066–3073, [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks
Apr. 2020. for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
[9] Y. Sun, L. Wang, Y. Chen, and M. Liu, “Accurate lane detection with Available: https://arxiv.org/abs/1409.1556
atrous convolution and spatial pyramid pooling for autonomous driving,” [33] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019, semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
pp. 642–647. Dec. 2015, pp. 1520–1528.
[10] A. Zaganidis, L. Sun, T. Duckett, and G. Cielniak, “Integrating deep [34] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
semantic segmentation into 3-D point cloud registration,” IEEE Robot. convolutional encoder-decoder architecture for image segmentation,”
Autom. Lett., vol. 3, no. 4, pp. 2942–2949, Oct. 2018. IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
[11] Z. Min, H. Ren, and M. Q.-H. Meng, “Statistical model of total target Dec. 2017.
registration error in image-guided surgery,” IEEE Trans. Autom. Sci. [35] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
Eng., vol. 17, no. 1, pp. 151–165, Jan. 2020. works for biomedical image segmentation,” in Medical Image Com-
[12] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in puting and Computer-Assisted Intervention—(MICCAI). Cham, Switzer-
dynamic environments: A motion removal approach,” Robot. Auto. Syst., land: Springer, 2015, pp. 234–241.
vol. 89, pp. 110–122, Mar. 2017. [36] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A deep
[13] W. S. Grant, R. C. Voorhies, and L. Itti, “Efficient velodyne SLAM with neural network architecture for real-time semantic segmentation,” 2017,
point and plane features,” Auto. Robots, vol. 43, no. 5, pp. 1207–1224, arXiv:1606.02147. [Online]. Available:https://arxiv.org/abs/1606.02147
Jun. 2019.
[37] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
[14] J. Cheng, Y. Sun, and M. Q.-H. Meng, “Robust semantic mapping
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
in challenging environments,” Robotica, vol. 38, no. 2, pp. 256–270,
Jul. 2017, pp. 6230–6239.
Feb. 2020.
[15] H. Huang, Y. Sun, H. Ye, and M. Liu, “Metric monocular localization [38] P. Wang et al., “Understanding convolution for semantic segmentation,”
using signed distance fields,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018,
Syst. (IROS), Nov. 2019, pp. 1195–1201. pp. 1451–1460.
[16] J. Cheng, Y. Sun, and M. Q.-H. Meng, “Improving monocular visual [39] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution
SLAM in dynamic environments: An optical-flow-based approach,” Adv. residual networks for semantic segmentation in street scenes,” in
Robot., vol. 33, no. 12, pp. 576–589, Jun. 2019. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
[17] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable pp. 3309–3318.
RGB-D SLAM in dynamic environments,” Robot. Auto. Syst., vol. 108, [40] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet:
pp. 115–128, Oct. 2018. Efficient residual factorized ConvNet for real-time semantic segmenta-
[18] Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: tion,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 1, pp. 263–272,
Towards real-time semantic segmentation for autonomous vehicles with Jan. 2018.
multi-spectral scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. [41] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral
(IROS), Sep. 2017, pp. 5108–5115. segmentation network for real-time semantic segmentation,” in Proc.
[19] K. Sun et al., “High-resolution representations for labeling pix- Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 325–341.
els and regions,” 2019, arXiv:1904.04514. [Online]. Available: [42] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
https://arxiv.org/abs/1904.04514 tation learning for human pose estimation,” in Proc. IEEE/CVF Conf.
[20] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learn- Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5693–5703.
ing a discriminative feature network for semantic segmentation,” in [43] Y. Sun, M. Liu, and M. Q.-H. Meng, “Active perception for foreground
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, segmentation: An RGB-D data-based background modeling method,”
pp. 1857–1866. IEEE Trans. Autom. Sci. Eng., vol. 16, no. 4, pp. 1596–1609, Oct. 2019.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1011

[44] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: Incorpo- Weixun Zuo received the bachelor’s degree from
rating depth into semantic segmentation via fusion-based CNN architec- Anhui University, Hefei, China, in 2016, and the
ture,” in Computer Vision—ACCV. Cham, Switzerland: Springer, 2017, master’s degree from The Hong Kong University of
pp. 213–228. Science and Technology, Hong Kong, in 2017.
[45] W. Wang and U. Neumann, “Depth-aware CNN for RGB-D segmenta- He is currently a Research Assistant with the
tion,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 135–150. Department of Electronic and Computer Engineer-
[46] Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network ing, Robotics Institute, The Hong Kong University of
for semantic segmentation of urban scenes,” IEEE Robot. Autom. Lett., Science and Technology. His current research inter-
vol. 4, no. 3, pp. 2576–2583, Jul. 2019. ests include mobile robots, semantic segmentation,
[47] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely deep learning, and autonomous vehicles.
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[48] T. T. D. Pham, S. Kim, Y. Lu, S.-W. Jung, and C.-S. Won, “Facial
action units-based image retrieval for facial expression recognition,”
IEEE Access, vol. 7, pp. 5200–5207, 2019.
[49] S. Wang, B. Pan, H. Chen, and Q. Ji, “Thermal augmented expression
recognition,” IEEE Trans. Cybern., vol. 48, no. 7, pp. 2203–2214, Peng Yun received the B.Sc. degree from the
Jul. 2018. Huazhong University of Science and Technology,
[50] J. S. Yoon et al., “Thermal-infrared based drivable region detection,” in Wuhan, China, in 2017. He is currently pursuing the
Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2016, pp. 978–985. Ph.D. degree with the Department of Computer Sci-
[51] M. Knapik and B. Cyganek, “Driver’s fatigue recognition based on yawn ence and Engineering, The Hong Kong University
detection in thermal images,” Neurocomputing, vol. 338, pp. 274–292, of Science and Technology, Hong Kong.
Apr. 2019. His current research interests include computer
[52] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep vision, machine learning, and autonomous driving.
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell.
Statist., Mar. 2010, pp. 249–256.
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
[54] S. Zagoruyko and N. Komodakis, “Wide residual networks,”
2016, arXiv:1605.07146. [Online]. Available: https://arxiv.org/abs/1605.
07146
[55] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual Hengli Wang received the B.E. degree from Zhe-
transformations for deep neural networks,” in Proc. IEEE Conf. Comput. jiang University, Hangzhou, China, in 2018. He is
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1492–1500. currently pursuing the Ph.D. degree with the Depart-
[56] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for ment of Electronic and Computer Engineering, The
human pose estimation,” in Proc. Eur. Conf. Comput. Vis. Springer, 2016, Hong Kong University of Science and Technology,
pp. 483–499. Hong Kong.
[57] M. Modasshir, A. Quattrini Li, and I. Rekleitis, “Deep neural networks: His current research interests include robot nav-
A comparison on different computing platforms,” in Proc. 15th Conf. igation, autonomous driving, computer vision, and
Comput. Robot Vis. (CRV), May 2018, pp. 383–389. deep learning.
[58] P.-Y. Huang, W.-T. Hsu, C.-Y. Chiu, T.-F. Wu, and M. Sun, “Efficient
uncertainty estimation for semantic segmentation in videos,” in Proc.
Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 520–535.

Yuxiang Sun (Member, IEEE) received the bach- Ming Liu (Senior Member, IEEE) received the B.A.
elor’s degree from the Hefei University of Tech- degree from Tongji University, Shanghai, China,
nology, Hefei, China, in 2009, the master’s degree in 2005, and the Ph.D. degree from ETH Zürich,
from the University of Science and Technology of Zürich, Switzerland, in 2013.
China, Hefei, in 2012, and the Ph.D. degree from He stayed one year at the University of Erlangen-
The Chinese University of Hong Kong, Hong Kong, Nuremberg, Erlangen, Germany, and Fraunhofer
in 2017. Institute IISB, Erlangen, as Visiting Scholar. He is
He is currently a Research Associate with the involved in several NSF projects, and National 863-
Department of Electronic and Computer Engineer- Hi-Tech-Plan projects in China. He is a Principal
ing, The Hong Kong University of Science and Tech- Investigator of over 20 projects, including projects
nology, Hong Kong. His current research interests funded by RGC, NSFC, ITC, SZSTI, and so on.
include autonomous driving, deep learning, robotics and autonomous systems, His current research interests include dynamic environment modeling, 3-D
and semantic scene understanding. mapping, machine learning, and visual control.
Dr. Sun was a recipient of the Best Paper Award in Robotics at the Dr. Liu was the General Chair of ICVS 2017, the Program Chair of the IEEE
IEEE ROBIO 2019 and the Best Student Paper Finalist Award at the IEEE RCAR 2016, and the Program Chair of the International Robotic Alliance
ROBIO 2015. Conference 2017.

Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.

You might also like