FuseSeg Semantic Segmentation of Urban Scenes Based On RGB and Thermal Data Fusion
FuseSeg Semantic Segmentation of Urban Scenes Based On RGB and Thermal Data Fusion
3, JULY 2021
Abstract— Semantic segmentation of urban scenes is an essen- to deep learning-based approaches, in which convolutional
tial component in various applications of autonomous driving. neural networks (CNNs) have been proven to be really effec-
It makes great progress with the rise of deep learning technolo- tive in tackling the semantic segmentation problem. With
gies. Most of the current semantic segmentation networks use
single-modal sensory data, which are usually the RGB images the popularities of autonomous vehicles [1]–[5] and human-
produced by visible cameras. However, the segmentation perfor- assistant driving [6]–[9], semantic segmentation of urban
mance of these networks is prone to be degraded when lighting scenes has attracted great attention. It has become a fun-
conditions are not satisfied, such as dim light or darkness. We find damental component for autonomous driving. For example,
that thermal images produced by thermal imaging cameras it provides contributive information to improve point-cloud
are robust to challenging lighting conditions. Therefore, in this
article, we propose a novel RGB and thermal data fusion network registration [10]–[12], which is the backbone of many local-
named FuseSeg to achieve superior performance of semantic seg- ization and mapping algorithms [13]–[17]. Note that the type
mentation in urban scenes. The experimental results demonstrate of urban scenes we are considering is the street scene because
that our network outperforms the state-of-the-art networks. we use the public data set released in [18] and the data set is
recorded in urban street scenes.
Note to Practitioners—This article investigates the problem of
semantic segmentation of urban scenes when lighting conditions Currently, most of the deep learning-based semantic seg-
are not satisfied. We provide a solution to this problem via mentation networks are designed using single-modal sensory
information fusion with RGB and thermal data. We build an data, which are usually RGB images generated by visible cam-
end-to-end deep neural network, which takes as input a pair eras. However, RGB images could become less informative
of RGB and thermal images and outputs pixel-wise semantic when lighting conditions are not satisfied, such as dim light
labels. Our network could be used for urban scene understanding,
which serves as a fundamental component of many autonomous or total darkness. We found that thermal images are robust
driving tasks, such as environment modeling, obstacle avoidance, to challenging lighting conditions. They are transformed from
motion prediction, and planning. Moreover, the simple design of thermal radiations by thermal imaging cameras. Virtually, any
our network allows it to be easily implemented using various matter with temperature above absolute zero could be seen
deep learning frameworks, which facilitates the applications on with thermal [21]. The spectrum of thermal radiations ranges
different hardware or software platforms.
from 0.1 to 100 μm, whereas the visible light ranges from
Index Terms— Autonomous driving, information fusion, 0.4 to 0.76 μm. Most of the thermal radiations are invisible
semantic segmentation, thermal images, urban scenes. to human eyes or imaging sensors (e.g., CCD or CMOS) but
visible to thermal imaging cameras. Therefore, thermal images
I. I NTRODUCTION
could be helpful to detect and segment objects when lighting
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1001
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1002 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021
A. Single-Modal Semantic Segmentation Networks was designed to provide a sizeable receptive field. An attention
The first work addressing the semantic segmentation refinement module was developed in the context path for
problem end-to-end was the fully convolutional performance refinement. Sun et al. [42] developed HRNet that
networks (FCNs) proposed by Shelhamer et al. [31]. was able to keep high-resolution representations through the
They modified image classification networks, such as whole encoding process. The network was designed for human
VGG-16 [32], into the fully convolutional form to achieve pose estimation but can be utilized as a general CNN backbone
pixel-wise image segmentation. Noh et al. [33] developed for other computer vision tasks. They improved HRNet by
DeconvNet that consists of a convolutional module for feature upsampling low-resolution representations to high resolution
extraction and a deconvolutional module for resolution [19], with which semantic segmentation maps could be
restoration. Badrinarayanan et al. [34] introduced the estimated.
encoder–decoder concept in SegNet. The functionalities
of encoder and decoder are analogous to those of the B. Data Fusion Semantic Segmentation Networks
convolutional and deconvolutional modules in DeconvNet. Apart from using the single-modal RGB data, depth data
Ronneberger et al. [35] developed UNet by introducing skip from RGB-D cameras [43] have been exploited for seman-
connections between the encoder and the decoder. It was tic segmentation. Hazirbas et al. [44] proposed FuseNet by
proven to be effective to keep the spatial information by the fusing RGB and depth data in an encoder–decoder struc-
skip connections. Although UNet was initially designed for ture. In FuseNet, two encoders using VGG-16 as backbone
biomedical imaging, it generalizes well to other domains. were designed to take as inputs the RGB and depth data,
Paszke et al. [36] designed ENet for efficient semantic respectively. The feature maps from the depth encoder were
segmentation by speeding up the inference process of the gradually fused into the RGB encoder. Wang and Neumann
initial block. They proposed a pooling operation in parallel [45] fused RGB and depth information for semantic segmen-
with a convolutional operation with a stride of 2. Moreover, tation by introducing the depth-aware convolution and depth-
the asymmetric convolutions were employed in its bottleneck aware average pooling operations, which incorporate geometry
module to reduce the redundancy of convolutional weights. information in conventional CNN. They computed the depth
Zhao et al. [37] observed that context information could be similarities between the center pixel and neighboring pixels.
helpful to improve semantic segmentation performance. Based The neighboring pixels with close depth values were weighted
on this observation, they introduced the pyramid pooling to contribute more in the operations. For semantic segmenta-
module (PPM) in PSPNet to extract local and global context tion of urban scenes, MFNet [18] and RTFNet [46] were both
information at different scales. Wang et al. [38] designed the proposed to use RGB and thermal data. Ha et al. [18] designed
dense upsampling convolution (DUC) and the hybrid dilated MFNet by fusing RGB and thermal data in an encoder–
convolution (HDC) for decoder and encoder, respectively. decoder structure. Two identical encoders were employed to
Compared with bilinear upsampling and deconvolution extract features from RGB and thermal data, respectively.
networks, DUC is learnable and free of zero padding. A mini-inception block was designed for the encoder. RTFNet
HDC can alleviate the gridding issue during downsampling. [46] was also designed with two encoders and one decoder.
Pohlen et al. [39] proposed FRRN for semantic segmentation, In the decoder of RTFNet, two types of upception blocks
which consists of two processing streams. One stream were designed to extract features and gradually restore the
maintains the feature map resolution at the input level. The resolution.
other one performs pooling operations to increase the size
of the receptive field. The two streams are coupled in the
proposed FRRU block. Yu et al. [20] proposed DFN to C. Computer Vision Applications Using Thermal Imaging
address two common challenges in semantic segmentation: Apart from semantic segmentation, thermal imaging has
the intraclass inconsistency problem and the interclass been used in other computer vision applications, such as
indistinction problem. It mainly consists of a smooth network facial expression recognition [48]. Wang et al. [49] proposed
to capture the multiscale and global context information, a thermal-augmented facial expression recognition method.
as well as a border network to discriminate the adjacent They designed a similarity constraint to jointly train the
patches with similar appearances but different class labels. visible and thermal expression classifiers. During the testing
ERFNet was developed by Romera et al. [40] for efficient stage, only visible images are used, which could reduce the
semantic segmentation. The core components of ERFNet are cost of the system. Yoon et al. [50] utilized thermal images
the proposed 1-D convolutional layers with the kernel sizes of for drivable road detection at nighttime. A Gabor filter was
3 × 1 and 1 × 3. The 1-D convolutional layers are combined applied to thermal images to find textureless areas that were
with skip connections to form a residual block, which is considered as the rough detection results for the drivable road.
integrated with an encoder–decoder architecture. Yu et al. [41] Superpixel algorithms were employed on thermal images to
proposed BiSeNet that mainly consists of a spatial path and smooth the segmentation results. Knapik and Cyganek [51]
a context path. The spatial path was designed to preserve the developed a yawn detection-based fatigue recognition method
spatial information. It contains three sequential downsampling using thermal imaging for driver assistance systems. The
convolutional operations, which reduces the feature map method consists of a face detection module, an eye-corner
resolution to 1/8 of the original input size. The context path localization module, and a yawn detection module. The yawn
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1003
Fig. 2. Overall architecture of our FuseSeg. It consists of an RGB encoder, a thermal encoder, and a decoder. We employ DenseNet [47] as the backbone
of the encoders. In the first stage of two-stage fusion (TSF), the thermal feature maps are hierarchically added with the RGB feature maps in the RGB
encoder. The fused feature maps are then concatenated with the corresponding decoder feature maps in the second fusion stage. The blue rectangles represent
the feature maps. The white rectangles represent the fused features maps copied from the RGB encoder. The purple and green arrows represent the feature
extractor and the upsampler in the decoder, respectively. s represents the input resolution of the RGB and thermal images. s = 480 × 640 in this article. The
feature maps at the same level share the same resolution. cn represents the number of channels of the feature maps at different levels. Cat, Conv, Trans Conv,
and BN are short for concatenation, convolution, transposed convolution, and batch normalization. The figure is best viewed in color.
detection is inferred from a thermal-anomaly detection model, to other transition layers after the fourth dense block. The
which is based on the temperature change measurement from dense blocks in DenseNet keep the feature map resolution
the thermal imaging camera. unchanged, whereas the initial block, the max-pooling layer,
and the transition layers reduce the feature map resolution by
III. P ROPOSED N ETWORK a factor of 2. Note that the feature map resolution has been
reduced to 15 × 20 (given the input resolution of 480 × 640)
A. Overall Architecture
before the final transition layer. Because we disable the ceiling
We propose a novel data fusion network named FuseSeg. mode of the average pooling operation in the final transition
It generally consists of two encoders to extract features from layer, the feature map resolution after the final transition layer
input images and one decoder to restore the resolution. The is reduced to 7 × 10 (not 8 × 10). There are four architectures
two encoders take as input the three-channel RGB and one- for DenseNet: DenseNet-121, DenseNet-169, DenseNet-201,
channel thermal images, respectively. Fig. 2 shows the overall and DenseNet-161. The complexity increases from 121 to
structure of our FuseSeg. We employ DenseNet [47] as the 161. DenseNet-161 possesses the most number of parameters
backbone of the encoders. We innovatively propose a TSF because it is grown with the largest rate of 48, whereas the
strategy in our network. As shown in Fig. 2, in the first stage, others share the growth rate of 32. We refer readers to [47] for
we hierarchically fuse the corresponding thermal and RGB the details of DenseNet. Our FuseSeg follows the same naming
feature maps through elementwise summation in the RGB rule of DenseNet. The number of channels cn in Fig. 2 varies
encoder. Inspired by [35], the fused feature maps except the with different DenseNet architectures. Detailed numbers are
bottom one are then fused again in the second stage with the listed in Table I.
corresponding feature maps in the decoder through tensor con-
catenation. The bottom one is directly copied to the decoder
instead of concatenation. With our TSF strategy, the loss of C. Decoder
spatial information through the intensive downsampling could
be recovered. The decoder is designed to gradually restore the feature map
resolution to the original. We design a decoder that mainly
consists of three modules: a feature extractor that sequentially
B. Encoders contains two convolutional layers, an upsampler, and an out
The RGB and thermal encoders are designed with the block that both contain one transposed convolutional layer.
same structure except for the input dimension because the Note that there are a batch normalization layer and a ReLu
input data are with different channels. As aforementioned, activation layer followed by the convolutional and transposed
we use DenseNet as the backbone. We first delete the clas- convolutional layers in the feature extractor and the upsampler.
sification layer in DenseNet to avoid excessive loss of spatial The detailed configurations for the convolutional and trans-
information. Then, we add a transition layer that is similar posed convolutional layers are displayed in Table II.
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1004 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021
TABLE I images, whereas the validation and test sets both consist of
N UMBER OF F EATURE M AP C HANNELS cn AT D IFFERENT L EVELS 25% of the daytime images and 25% of the nighttime images.
A CCORDING TO D IFFERENT D ENSE N ET A RCHITECTURES
The training set is augmented with the flip technique. Our
FuseSeg is implemented with the PyTorch. The convolutional
and transposed convolutional layers in the decoder are initial-
ized using the Xavier scheme [52]. The encoder layers are
initialized using the pretrained weights provided by PyTorch.
We use the stochastic gradient descent (SGD) optimization
solver and the cross-entropy loss for training. The learning
rate is decayed exponentially. The networks are trained until
no further decrease in the loss is observed.
TABLE II
C ONFIGURATIONS FOR THE C ONVOLUTIONAL (C ONV ) AND T RANSPOSED
C ONVOLUTIONAL (T RANS C ONV ) L AYERS IN THE I NDIVIDUAL M OD - C. Evaluation Metrics
ULES OF THE D ECODER
For the quantitative evaluation, we use the same metrics
from [46]: Accuracy (Acc) and intersection over union (IoU).
Let Acci and IoUi denote the Acc and IoU for class i . They
are computed in the formulas
K k
θii
Acci = K k k=1 K N
, (1)
k=1 θii + j =1, j =i θi j
k
k=1
K k
θ
The feature extractor is employed to extract features from IoUi = K k K N k=1 iik K N ,
k=1 θii + j =1, j =i θ j i + j =1, j =i θi j
k
k=1 k=1
the fused feature maps. It keeps the resolution of the feature
(2)
maps unchanged. Both the upsampler and out block increase
the resolution by a factor of 2. The out block outputs the where θiik , θikj , and θ kji represent in the image k the number
final prediction results with the channel number of 9, which of pixels of class i that are correctly classified as class i ,
is the number of classes. We add a softmax layer after the the number of pixels of class i that are wrongly classified as
output to get the probability map for the segmentation results. class j , and the number of pixels of class j that are wrongly
As aforementioned, the feature map resolution is 7 × 10 at classified as class i , respectively. K and N represent the
the end of the encoder. To restore it to 15 × 20, we employ a number of test images and the number of classes, respectively.
padding technique at this level of upsampler. The upsampled N = 9 in this article. We use mAcc and mIoU to represent
feature map is concatenated with the one from the RGB the arithmetic average values of Acc and IoU across the nine
encoder during the second stage of our TSF. The number of classes.
feature channels doubles after the concatenation.
V. A BLATION S TUDY
IV. E XPERIMENTAL S ETUP
A. Ablation for Encoders
A. Data Set
1) Encoder Backbone: Since ResNet [53], WideResNet
In this article, we use the public data set released by
[54], ResNext [55], and HourglassNet [56] have similar struc-
Ha et al. [18]. It was recorded in urban street scenes, which
tures as DenseNet, we replace DenseNet with these networks
contains common objects: car, person, bike, curve (road lanes),
and compare their performance with ours. The quantitative
car stop, guardrail, color cone, and bump. The images are cap-
results are listed in Table III. As we can see, using DenseNet-
tured at the 480 × 640 resolution by an InfReC R500 camera,
161 achieves the best performance, which confirms the effec-
which can provide RGB and thermal images simultaneously.
tiveness of our choice.
There are 1569 registered RGB and thermal images in the data
2) Single-Modal Performance: We delete the thermal
set, among which 749 are taken at nighttime and 820 are taken
encoder of FuseSeg to see the performance without using
at daytime. The data set was provided with hand-labeled pixel-
the thermal information. We name this variant as no thermal
wise ground truth, including the aforementioned eight classes
encoder (NTE). Similarly, we delete the RGB encoder to see
of common objects and one unlabeled background class.
how the network performs given only the thermal information.
The variant is termed no RGB encoder (NRE). In these two
B. Training Details variants, the first-stage fusion in our TSF strategy is canceled
We train the networks on a PC with an Intel i7 CPU since there is only one encoder in the networks. We display
and an NVIDIA 1080 Ti graphics card, including 11-GB the results with respect to different DenseNet architectures
graphics memories. We accordingly adjust the batch sizes for in Table IV. We can see that all the networks using DenseNet-
the networks to fit the graphics memories. We employ the data 161 gain more accuracy than the others. The superior perfor-
set splitting scheme used in [18]. The training set consists mance is expected because DenseNet-161 presents the best
of 50% of the daytime images and 50% of the nighttime image classification performance among the four DenseNet
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1005
TABLE III
R ESULTS (%) OF A BLATION S TUDY FOR E NCODERS U SING D IFFERENT BACKBONES ON THE T EST S ET. W E U SE D ENSE N ET-161 IN O UR N ETWORK .
B OLD F ONT H IGHLIGHTS THE B EST R ESULTS
TABLE IV TABLE V
R ESULTS (%) OF A BLATION S TUDY FOR E NCODERS ON THE T EST S ET. R ESULTS (%) OF A BLATION S TUDY FOR F USION S TRATEGY ON THE T EST
O URS D ISPLAYS THE R ESULTS OF O UR F USE S EG . B OLD F ONT H IGH - S ET. A LL THE VARIANTS U SE THE D ENSE N ET-161 AS THE E NCODER
LIGHTS THE B EST R ESULTS BACKBONE . O URS D ISPLAYS THE R ESULTS OF O UR F USE S EG -161.
B OLD F ONT H IGHLIGHTS THE B EST R ESULTS
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1006 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021
TABLE VI
R ESULTS (%) OF A BLATION S TUDY FOR THE D ECODER ON THE T EST
S ET. A LL THE VARIANTS U SE THE D ENSE N ET-161 AS THE E NCODER
BACKBONE . O URS D ISPLAYS THE R ESULTS OF O UR F USE S EG -161.
B OLD F ONT H IGHLIGHTS THE B EST R ESULTS
Fig. 3. Detailed structures for TPC and THC. The figure is best viewed in
color.
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1007
TABLE VII
C OMPARATIVE R ESULTS (%) ON THE T EST S ET. 3c AND 4c R EPRESENT T HAT THE N ETWORKS A RE T ESTED W ITH THE T HREE -C HANNEL
RGB D ATA AND F OUR -C HANNEL RGB-T HERMAL D ATA , R ESPECTIVELY. N OTE T HAT THE M A CC AND M I O U A RE C ALCULATED
W ITH THE U NLABELED C LASS , BUT THE R ESULTS FOR THE U NLABELED C LASS A RE N OT D ISPLAYED . T HE B OLD
F ONT H IGHLIGHTS THE B EST R ESULT IN E ACH C OLUMN
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1008 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021
Fig. 4. Qualitative demonstrations for the fusion networks in typical daytime and nighttime scenarios, which are shown in the left four and right four
columns, respectively. We can see that our network can provide acceptable results in various lighting conditions. The comparative results demonstrate our
superiority. The figure is best viewed in color.
the encoder and the decoder. The detailed spatial informa- an important capability to ensure safe decision-making for
tion could be retained through the short connections at each autonomous vehicles. MC dropout has been successfully
level. This can also explain the unsatisfactory performance of employed to infer posterior distributions for the model parame-
FuseNet because both FuseNet and RTFNet have no such short ters of Bayesian networks. This article adopts the MC dropout
connections. Note that although MFNet has such connections, technique for uncertainty estimation [29]. We construct the
it still presents inferior performance compared with FuseSeg. Bayesian FuseSeg by inserting dropout layers after the initial
We think that the reason may stem from the tiny and weak blocks, max-pooling layers, and No.1 − 4 transition layers of
decoder of MFNet, in which only one convolutional layer is the RGB and thermal encoders. During runtime, we sample
contained at each stage. In addition, they use concatenation the model T times, and here, we set T = 50. The uncertainty
for the encoder fusion and summation for the encoder–decoder ζ for each pixel is calculated by
fusion, which we have proven inferior to our fusion strategy
1
N
(see the CSF variant in the ablation study). DepthAwareCNN
ζ =− p(ln |I, θ ) log p(ln |I, θ ) (3)
assumes that the pixels on the same object share similar depth N n=1
(replaced by temperature) values. However, this assumption is
violated here, which may explain its inferiority. For example, where I, θ , and ln represent the input image, network para-
the temperature of the car in the fifth column does not distrib- meters, and label for the nth class, N is the number of
ute evenly so that the car cannot be completely segmented. classes (N = 9), and p(·) is the average softmax output of
the network for each pixel over T times. The uncertainty
E. Uncertainty Estimation
ζ actually calculates the entropy that measures the disorder
Estimating uncertainty for semantic segmentation can help of different class probabilities at a pixel [58]. Large entropy
to know how much the predictions could be trusted. It is means large disorder and, hence, large uncertainty.
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1009
Fig. 5. Uncertainty maps of the Bayesian FuseSeg-161 for the results shown in Fig. 4. The first and second rows are with dropout rates 10−4 and 10−2 ,
respectively. Uncertainties increase from blue to red. The figure is best viewed in color.
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
1010 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 18, NO. 3, JULY 2021
modality of information. Therefore, recording a new data [21] M. Vollmer et al., Infrared Thermal Imaging: Fundamentals, Research
set together with an RGB-D camera and a thermal imaging and Applications. Berlin, Germany: Wiley, 2017.
[22] X. Sun, H. Ma, Y. Sun, and M. Liu, “A novel point cloud compression
camera, and fusing RGB, thermal, as well as depth images in algorithm based on clustering,” IEEE Robot. Autom. Lett., vol. 4, no. 2,
a network to improve the segmentation performance is also a pp. 2132–2139, Apr. 2019.
research direction. [23] P. Yun, L. Tai, Y. Wang, C. Liu, and M. Liu, “Focal loss in 3D object
detection,” IEEE Robot. Autom. Lett., vol. 4, no. 2, pp. 1263–1270,
Apr. 2019.
R EFERENCES [24] S. Wang, Y. Sun, C. Liu, and M. Liu, “PointTrackNet: An End-to-End
network for 3-D object detection and tracking from point clouds,” IEEE
[1] D. Li and H. Gao, “A hardware platform framework for an intelli- Robot. Autom. Lett., vol. 5, no. 2, pp. 3206–3212, Apr. 2020.
gent vehicle based on a driving brain,” Engineering, vol. 4, no. 4, [25] F. Wu, B. He, L. Zhang, S. Chen, and J. Zhang, “Vision-and-Lidar
pp. 464–470, Aug. 2018. based real-time outdoor localization for unmanned ground vehicles
[2] P. Cai, X. Mei, L. Tai, Y. Sun, and M. Liu, “High-speed autonomous without GPS,” in Proc. IEEE Int. Conf. Inf. Autom. (ICIA), Aug. 2018,
drifting with deep reinforcement learning,” IEEE Robot. Autom. Lett., pp. 232–237.
vol. 5, no. 2, pp. 1247–1254, Apr. 2020.
[26] H. Gao, B. Cheng, J. Wang, K. Li, J. Zhao, and D. Li, “Object classifi-
[3] H. Wang, Y. Sun, and M. Liu, “Self-supervised drivable area and road
cation using CNN-based fusion of vision and LIDAR in autonomous
anomaly segmentation using RGB-D data for robotic wheelchairs,” IEEE
vehicle environment,” IEEE Trans. Ind. Informat., vol. 14, no. 9,
Robot. Autom. Lett., vol. 4, no. 4, pp. 4386–4393, Oct. 2019.
pp. 4224–4231, Sep. 2018.
[4] P. Cai, Y. Sun, Y. Chen, and M. Liu, “Vision-based trajectory planning
via imitation learning for autonomous vehicles,” in Proc. IEEE Intell. [27] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets
Transp. Syst. Conf. (ITSC), Oct. 2019, pp. 2736–2742. for 3D object detection from RGB-D data,” in Proc. IEEE/CVF Conf.
[5] H. Chen, C. Xue, S. Liu, Y. Sun, and Y. Chen, “Multiple-object tracking Comput. Vis. Pattern Recognit., Jun. 2018, pp. 918–927.
based on monocular camera and 3-D lidar fusion for autonomous vehi- [28] I. Bloch, “Information combination operators for data fusion: A com-
cles,” in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019, parative review with classification,” IEEE Trans. Syst., Man, Cybern. A,
pp. 456–460. Syst., Humans, vol. 26, no. 1, pp. 52–67, 1996.
[6] X. Wu, Z. Li, Z. Kan, and H. Gao, “Reference trajectory reshaping [29] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian SegNet:
optimization and control of robotic exoskeletons for human-robot co- Model uncertainty in deep convolutional encoder-decoder architectures
manipulation,” IEEE Trans. Cybern., early access, Aug. 30, 2019, doi: for scene understanding,” 2015, arXiv:1511.02680. [Online]. Available:
10.1109/TCYB.2019.2933019. https://arxiv.org/abs/1511.02680
[7] Z. Li, B. Huang, A. Ajoudani, C. Yang, C.-Y. Su, and A. Bicchi, “Asym- [30] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D
metric bimanual control of dual-arm exoskeletons for human-cooperative scene understanding benchmark suite,” in Proc. IEEE Conf. Comput.
manipulations,” IEEE Trans. Robot., vol. 34, no. 1, pp. 264–271, Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 567–576.
Feb. 2018. [31] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
[8] Y. Sun, W. Zuo, and M. Liu, “See the future: A semantic segmenta- for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,
tion network predicting ego-vehicle trajectory with a single monocular vol. 39, no. 4, pp. 640–651, Apr. 2017.
camera,” IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 3066–3073, [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks
Apr. 2020. for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
[9] Y. Sun, L. Wang, Y. Chen, and M. Liu, “Accurate lane detection with Available: https://arxiv.org/abs/1409.1556
atrous convolution and spatial pyramid pooling for autonomous driving,” [33] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019, semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
pp. 642–647. Dec. 2015, pp. 1520–1528.
[10] A. Zaganidis, L. Sun, T. Duckett, and G. Cielniak, “Integrating deep [34] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
semantic segmentation into 3-D point cloud registration,” IEEE Robot. convolutional encoder-decoder architecture for image segmentation,”
Autom. Lett., vol. 3, no. 4, pp. 2942–2949, Oct. 2018. IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
[11] Z. Min, H. Ren, and M. Q.-H. Meng, “Statistical model of total target Dec. 2017.
registration error in image-guided surgery,” IEEE Trans. Autom. Sci. [35] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
Eng., vol. 17, no. 1, pp. 151–165, Jan. 2020. works for biomedical image segmentation,” in Medical Image Com-
[12] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in puting and Computer-Assisted Intervention—(MICCAI). Cham, Switzer-
dynamic environments: A motion removal approach,” Robot. Auto. Syst., land: Springer, 2015, pp. 234–241.
vol. 89, pp. 110–122, Mar. 2017. [36] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A deep
[13] W. S. Grant, R. C. Voorhies, and L. Itti, “Efficient velodyne SLAM with neural network architecture for real-time semantic segmentation,” 2017,
point and plane features,” Auto. Robots, vol. 43, no. 5, pp. 1207–1224, arXiv:1606.02147. [Online]. Available:https://arxiv.org/abs/1606.02147
Jun. 2019.
[37] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
[14] J. Cheng, Y. Sun, and M. Q.-H. Meng, “Robust semantic mapping
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
in challenging environments,” Robotica, vol. 38, no. 2, pp. 256–270,
Jul. 2017, pp. 6230–6239.
Feb. 2020.
[15] H. Huang, Y. Sun, H. Ye, and M. Liu, “Metric monocular localization [38] P. Wang et al., “Understanding convolution for semantic segmentation,”
using signed distance fields,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018,
Syst. (IROS), Nov. 2019, pp. 1195–1201. pp. 1451–1460.
[16] J. Cheng, Y. Sun, and M. Q.-H. Meng, “Improving monocular visual [39] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution
SLAM in dynamic environments: An optical-flow-based approach,” Adv. residual networks for semantic segmentation in street scenes,” in
Robot., vol. 33, no. 12, pp. 576–589, Jun. 2019. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
[17] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable pp. 3309–3318.
RGB-D SLAM in dynamic environments,” Robot. Auto. Syst., vol. 108, [40] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet:
pp. 115–128, Oct. 2018. Efficient residual factorized ConvNet for real-time semantic segmenta-
[18] Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet: tion,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 1, pp. 263–272,
Towards real-time semantic segmentation for autonomous vehicles with Jan. 2018.
multi-spectral scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. [41] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral
(IROS), Sep. 2017, pp. 5108–5115. segmentation network for real-time semantic segmentation,” in Proc.
[19] K. Sun et al., “High-resolution representations for labeling pix- Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 325–341.
els and regions,” 2019, arXiv:1904.04514. [Online]. Available: [42] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
https://arxiv.org/abs/1904.04514 tation learning for human pose estimation,” in Proc. IEEE/CVF Conf.
[20] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learn- Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5693–5703.
ing a discriminative feature network for semantic segmentation,” in [43] Y. Sun, M. Liu, and M. Q.-H. Meng, “Active perception for foreground
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, segmentation: An RGB-D data-based background modeling method,”
pp. 1857–1866. IEEE Trans. Autom. Sci. Eng., vol. 16, no. 4, pp. 1596–1609, Oct. 2019.
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: FuseSeg: SEMANTIC SEGMENTATION OF URBAN SCENES BASED ON RGB AND THERMAL DATA FUSION 1011
[44] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: Incorpo- Weixun Zuo received the bachelor’s degree from
rating depth into semantic segmentation via fusion-based CNN architec- Anhui University, Hefei, China, in 2016, and the
ture,” in Computer Vision—ACCV. Cham, Switzerland: Springer, 2017, master’s degree from The Hong Kong University of
pp. 213–228. Science and Technology, Hong Kong, in 2017.
[45] W. Wang and U. Neumann, “Depth-aware CNN for RGB-D segmenta- He is currently a Research Assistant with the
tion,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 135–150. Department of Electronic and Computer Engineer-
[46] Y. Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network ing, Robotics Institute, The Hong Kong University of
for semantic segmentation of urban scenes,” IEEE Robot. Autom. Lett., Science and Technology. His current research inter-
vol. 4, no. 3, pp. 2576–2583, Jul. 2019. ests include mobile robots, semantic segmentation,
[47] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely deep learning, and autonomous vehicles.
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.
[48] T. T. D. Pham, S. Kim, Y. Lu, S.-W. Jung, and C.-S. Won, “Facial
action units-based image retrieval for facial expression recognition,”
IEEE Access, vol. 7, pp. 5200–5207, 2019.
[49] S. Wang, B. Pan, H. Chen, and Q. Ji, “Thermal augmented expression
recognition,” IEEE Trans. Cybern., vol. 48, no. 7, pp. 2203–2214, Peng Yun received the B.Sc. degree from the
Jul. 2018. Huazhong University of Science and Technology,
[50] J. S. Yoon et al., “Thermal-infrared based drivable region detection,” in Wuhan, China, in 2017. He is currently pursuing the
Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2016, pp. 978–985. Ph.D. degree with the Department of Computer Sci-
[51] M. Knapik and B. Cyganek, “Driver’s fatigue recognition based on yawn ence and Engineering, The Hong Kong University
detection in thermal images,” Neurocomputing, vol. 338, pp. 274–292, of Science and Technology, Hong Kong.
Apr. 2019. His current research interests include computer
[52] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep vision, machine learning, and autonomous driving.
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell.
Statist., Mar. 2010, pp. 249–256.
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
[54] S. Zagoruyko and N. Komodakis, “Wide residual networks,”
2016, arXiv:1605.07146. [Online]. Available: https://arxiv.org/abs/1605.
07146
[55] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual Hengli Wang received the B.E. degree from Zhe-
transformations for deep neural networks,” in Proc. IEEE Conf. Comput. jiang University, Hangzhou, China, in 2018. He is
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1492–1500. currently pursuing the Ph.D. degree with the Depart-
[56] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for ment of Electronic and Computer Engineering, The
human pose estimation,” in Proc. Eur. Conf. Comput. Vis. Springer, 2016, Hong Kong University of Science and Technology,
pp. 483–499. Hong Kong.
[57] M. Modasshir, A. Quattrini Li, and I. Rekleitis, “Deep neural networks: His current research interests include robot nav-
A comparison on different computing platforms,” in Proc. 15th Conf. igation, autonomous driving, computer vision, and
Comput. Robot Vis. (CRV), May 2018, pp. 383–389. deep learning.
[58] P.-Y. Huang, W.-T. Hsu, C.-Y. Chiu, T.-F. Wu, and M. Sun, “Efficient
uncertainty estimation for semantic segmentation in videos,” in Proc.
Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 520–535.
Yuxiang Sun (Member, IEEE) received the bach- Ming Liu (Senior Member, IEEE) received the B.A.
elor’s degree from the Hefei University of Tech- degree from Tongji University, Shanghai, China,
nology, Hefei, China, in 2009, the master’s degree in 2005, and the Ph.D. degree from ETH Zürich,
from the University of Science and Technology of Zürich, Switzerland, in 2013.
China, Hefei, in 2012, and the Ph.D. degree from He stayed one year at the University of Erlangen-
The Chinese University of Hong Kong, Hong Kong, Nuremberg, Erlangen, Germany, and Fraunhofer
in 2017. Institute IISB, Erlangen, as Visiting Scholar. He is
He is currently a Research Associate with the involved in several NSF projects, and National 863-
Department of Electronic and Computer Engineer- Hi-Tech-Plan projects in China. He is a Principal
ing, The Hong Kong University of Science and Tech- Investigator of over 20 projects, including projects
nology, Hong Kong. His current research interests funded by RGC, NSFC, ITC, SZSTI, and so on.
include autonomous driving, deep learning, robotics and autonomous systems, His current research interests include dynamic environment modeling, 3-D
and semantic scene understanding. mapping, machine learning, and visual control.
Dr. Sun was a recipient of the Best Paper Award in Robotics at the Dr. Liu was the General Chair of ICVS 2017, the Program Chair of the IEEE
IEEE ROBIO 2019 and the Best Student Paper Finalist Award at the IEEE RCAR 2016, and the Program Chair of the International Robotic Alliance
ROBIO 2015. Conference 2017.
Authorized licensed use limited to: Indian Institute of Technology (Ropar). Downloaded on September 24,2024 at 20:48:00 UTC from IEEE Xplore. Restrictions apply.