0% found this document useful (0 votes)
25 views

Efficient Multi Task Progressive Learning For Semantic Segm - 2024 - Pattern Rec

Uploaded by

wasimintrouble
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Efficient Multi Task Progressive Learning For Semantic Segm - 2024 - Pattern Rec

Uploaded by

wasimintrouble
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Pattern Recognition 154 (2024) 110601

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Efficient multi-task progressive learning for semantic segmentation and


disparity estimation
Hanz Cuevas-Velasquez a , Alejandro Galán-Cuenca b , Robert B. Fisher c , Antonio Javier Gallego b ,∗
a
Max Planck Institute for Intelligent Systems, 72076, Max Planck Ring 4, Tuebingen, Germany
b
University Institute for Computing Research, University of Alicante, E-03690, San Vicente del Raspeig, Alicante, Spain
c
The University of Edinburgh, School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK

ARTICLE INFO ABSTRACT

Keywords: Scene understanding is an important area in robotics and autonomous driving. To accomplish these tasks, the
Computer vision 3D structures in the scene have to be inferred to know what the objects and their locations are. To this end,
Stereo vision semantic segmentation and disparity estimation networks are typically used, but running them individually is
Semantic segmentation
inefficient since they require high-performance resources. A possible solution is to learn both tasks together
Joint learning
using a multi-task approach. Some current methods address this problem by learning semantic segmentation
3D modeling
Multi-task
and monocular depth together. However, monocular depth estimation from single images is an ill-posed
Disparity estimation problem. A better solution is to estimate the disparity between two stereo images and take advantage of
this additional information to improve the segmentation. This work proposes an efficient multi-task method
that jointly learns disparity and semantic segmentation. Employing a Siamese backbone architecture for
multi-scale feature extraction, the method integrates specialized branches for disparity estimation and coarse
and refined segmentations, leveraging progressive task-specific feature sharing and attention mechanisms
to enhance accuracy for solving both tasks concurrently. The proposal achieves state-of-the-art results for
joint segmentation and disparity estimation on three distinct datasets: Cityscapes, TrimBot2020 Garden, and
S-ROSeS, using only 1∕3 of the parameters of previous approaches.

1. Introduction segmentation and disparity estimation [12,13], unlike segmentation


and monocular depth estimation [13–15].
Disparity estimation and semantic segmentation are fundamental In the literature, many methods for disparity estimation and seg-
problems in computer vision. The goal of disparity estimation is to mentation can be found [2]. However, compared to the extensive
find the pixel correspondences from a rectified pair of stereo images. literature that addresses these tasks separately (which will be reviewed
On the other hand, semantic segmentation assigns class labels to each
in the next section), there are few works about solving them simulta-
pixel in the image. Both tasks, individually, have been intensively
neously using data-driven techniques [12,13,15]. This is mainly due
investigated [1,2]. Currently, solutions based on Deep Learning and,
more specifically, on Convolutional Neural Networks (CNN), are the to two reasons: segmenting an image and finding the disparity map
predominant approaches for both tasks due to their good performance are complex tasks per se that require many training examples to learn
[3,4]. meaningful representations. This problem is worsened by the scarcity
Both disparity estimation and semantic segmentation are heavily of datasets composed of stereo pairs that have both depth and semantic
used in scene understanding, autonomous driving, and robotics [5,6]. ground truths, with which to train multi-task proposals.
However, running state-of-the-art methods from both tasks indepen- This paper presents a multi-task approach for disparity estimation
dently [7,8] essentially doubles the computational load, which may and semantic segmentation that takes advantage of the information
be prohibitive for some applications [9,10]. Therefore, a method that extracted from the different tasks and combines them progressively
combines both tasks into a single model can potentially reduce the at the feature level using self-attention in order to improve the final
computational burden, and thus allow it to be embedded in portable results and make the training process more efficient. As will be seen in
devices or executed in real-time. This type of approach is called multi-
the Experiments section (Section 5), learning both tasks simultaneously
task learning [11], an area that is almost unexplored in the field of joint

∗ Corresponding author.
E-mail addresses: [email protected] (H. Cuevas-Velasquez), [email protected] (A. Galán-Cuenca), [email protected] (R.B. Fisher),
[email protected] (A.J. Gallego).

https://doi.org/10.1016/j.patcog.2024.110601
Received 24 August 2023; Received in revised form 27 March 2024; Accepted 13 May 2024
Available online 17 May 2024
0031-3203/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 1. Overview of the proposed architecture. Stereo pair features are extracted from a Siamese backbone structure and used to feed three specialized branches for a multi-scale
and multi-task prediction. The first branch generates a coarse segmentation using features from the deepest backbone layer, enriched by high-level image features to enhance
context. The second branch computes the disparity map by correlating features from an earlier backbone layer, incorporating coarse segmentation features (𝐹𝑠𝑒𝑔 ) and Spatial Pyramid
Pooling for broader feature correlation. The third branch refines the segmentation outcome by leveraging both disparity and segmentation features (𝐹𝑠𝑒𝑔 and 𝐹𝑑𝑖𝑠𝑝 ), employing
attention mechanisms to focus processing on relevant features.

allows using additional information from the stereo pair to improve the The rest of the paper is structured as follows: Section 2 reviews
segmentation and, at the same time, using the segmentation to improve the related literature to contextualize the work; Section 3 presents
the calculation of disparity. the proposed methodology; Section 4 details the experimental setup
The proposed method uses a Siamese architecture [16] of backbones considered; Section 5 shows the results obtained with the three datasets
to extract feature information at different scales from the left and considered and makes a comparison with the state of the art; Section 6
right input images (see Fig. 1). From these features, three specialized analyzes in detail the results obtained, and, finally, Section 7 concludes
branches are connected to carry out the calculation of disparity and the work by summarizing the main insights obtained and proposing
segmentation. The first branch extracts a coarse segmentation using future research lines to address.
low-level features from the backbone, optimized by a 1 × 1 convolution
to reduce parameter depth and hourglass blocks [17] for enhanced 2. Related work
context. The second branch calculates the disparity map through fea-
ture correlation between stereo images, task-specific features from the As previously discussed, many methods for disparity estimation
coarse segmentation, and the use of Spatial Pyramid Pooling [18] and image segmentation can be found in the literature [1,2]. The
for comprehensive context. The third branch obtains the refined final general approach—in both cases—is the use of an encoder–decoder
segmentation by combining high-level features from the backbone and CNN architecture, where the encoder part extracts feature information
the output of the two previous branches. This last branch employs a by reducing (encoding) the dimension of the input image, and the
self-attention mechanism [19] to focus on the most relevant features decoder part uses these intermediate encoded representations to predict
from the coarse segmentation and the disparity map. As will be shown, the disparity map or the semantic segmentation.
this progressive multi-scale learning allows the segmentation and the For disparity estimation, two main approaches are currently fol-
disparity estimation to refine each other. lowed in the state of the art. The first one treats disparity estimation
In summary, the main contributions of this paper are: as a regression problem, using solely 2D convolutions [20]. The second
approach obtains a cost volume by extracting 3D features and discretiz-
• An end-to-end novel architecture that successfully learns the se- ing the disparity [21,22]. A complete review on this topic can be found
mantic segmentation and disparity map together from an input in Zhou et al. [23].
stereo pair. To achieve this, it extracts task-specific features that For semantic segmentation, the standard approach is to use a pre-
are shared along with the two tasks progressively. trained network as a backbone (encoder) and add specialized layers
• State-of-the-art results for multi-task joint learning of disparity (decoder) to obtain the segmentation [24]. Long et al. [25] was one of
and segmentation, and with competitive results against other the first successful works using this approach. Later, Deeplab networks
methods that solve each task individually. introduced Atrous convolutions and greatly improved the results with
• A network capable of solving both tasks simultaneously using different datasets like Cityscapes [26–28]. Currently, the state of the
less than 1∕3 of the parameters that previous works use to solve art usually employs Deeplab or similar architectures for the encoder
only one of these tasks. This reduced number of parameters is part and follows two possible approaches to perform segmentation in
useful for systems that have limited resources like robots and the decoder part. One of these is the use of multi-scale images as
autonomous vehicles. input in order to optimize the network so it can extract information

2
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

at different scales, which improves the generality and precision of a refined progressive learning approach rather than concatenating the
the results [8,29,30]. The other common solution is the use of at- disparity and segmentation features only once. While they focus on
tention layers, which have been popularized by large language and multi-scale refinement for both tasks at the same time, which yields
text-to-image models [31,32], to capture more context with the relation good predictions but is computationally expensive, our method predicts
between the pixels and the classes of the objects to be segmented. This one task per scale. Another main difference with these methods is
information is usually concatenated to the network or used to weight that our approach employs stereo pairs to obtain the disparity map
certain layers [33]. instead of using a single image, thus avoiding the ill-posed problem
Compared to the number of proposals that address these topics and giving more context to the network which can help not only with
separately, there are only a few works focused on their simultaneous the calculation of depth but also with segmentation.
solution [12,13,15]. However, as previously argued, for certain ap- Our approach also takes inspiration from the single-task network
plications it can be prohibitive to run two separate methods of this RefineNet [40], where the output is refined using different feature
kind in parallel, and assuming unlimited resources is not a realistic scales. Similar to this method, our method separates the backbone into
or practical approach. Therefore, in these cases, a joint solution that blocks. Each block outputs features at different scales which are further
optimizes resources may be of great interest. processed by specialized convolutional layers. RefineNet focuses only
One of the main reasons why these tasks are usually approached in- on segmenting an image, therefore, they use these convolutional layers
dividually is because of their complexity, which makes it very difficult to learn residual information at different scales that is subsequently
to overcome the state of the art by following a joint approach. Also, added. In our case, each scale outputs either a disparity map or a seg-
training an end-to-end architecture of this kind requires many ground- mentation, so each convolutional layer provides feature information for
truth samples to learn meaningful and generalizable representations. a specific task. To share this information with other scales (organized
This last requirement worsens the problem since there are very few in branches in our implementation) and guarantee feature propagation,
stereo pair datasets that have both ground truths (i.e., both semantic these features are concatenated rather than being used like a residual
segmentation and disparity values). Among the few existing datasets connection, similar to a DenseNet block in [41].
with these characteristics, we can find Cityscapes [34], KITTI [35], One last differential aspect to highlight is that the entire archi-
TrimBot2020 Garden [36], and S-ROSeS [9]. tecture is trained to solve both tasks at the same time, following an
This scarcity of data means that most of the multi-task proposals end-to-end fashion and propagating the information from the obtained
for this purpose are based on monocular depth estimation [13–15,37], disparity to calculate the segmentation (and vice-versa). It allows the
which is an ill-posed problem from the geometry perspective [38]. A architecture to combine the specific information calculated for each
better solution is to calculate the disparity between two stereo images task, searching for commonalities and differences, as proposed by
and take advantage of this information to improve the segmenta- Baxter [39], in order to collaboratively improve the final result.
tion [12]. The approaches based on these two paradigms are described
in detail below.
3. Proposed method
Kendall et al. [13] proposed a multi-task likelihood to improve
the joint learning of monocular disparity, semantic segmentation, and
This section describes the proposed architecture and elaborates on
instance segmentation. They used Deeplab v3 [28] as the backbone and
how to use feature information from one task to improve other tasks
three separate branches as decoders, one for each task. Zhang et al. [14]
using a coarse-to-fine structure with a binocular stereo input.
also combines feature information from a backbone to estimate monoc-
ular depth and segmentation. In this case, they performed a sequence
of task-level interactions that evolve along a coarse-to-fine scale space 3.1. Network architecture
so that the required features can be progressively reconstructed. The
approach by Nekrasov et al. [15] follows a similar structure for the Before describing the network in detail, a brief overview of its main
prediction of monocular depth and segmentation. They introduced a parts is given, as well as the general details of the different layers used.
teacher–student approach to be able to learn both tasks when one of The proposed network consists of a Siamese architecture of backbones
them does not have ground truth. Xu et al. [37] proposed PAD-Net, a and a decoder with three specialized branches, as shown in Fig. 1. The
multi-task guided prediction-and-distillation network that leverages a first branch obtains a coarse segmentation and outputs task-specific
series of intermediate auxiliary tasks, ranging from low to high-level, features (𝐹𝑠𝑒𝑔 ) that are used as input for subsequent branches. The
whose predictions are then distilled to enhance the performance of second branch outputs the disparity of the stereo pair and its own task-
the final tasks, demonstrating effectiveness across challenging datasets. specific features (𝐹𝑑𝑖𝑠𝑝 ). The third branch uses the previous features
Yang et al. [12] is the only approach that exploits the information of (𝐹𝑠𝑒𝑔 and 𝐹𝑑𝑖𝑠𝑝 ) to calculate the refined segmentation. As is common in
a stereo pair. It first uses a segmentation network as a backbone to stereo vision, disparity is calculated with respect to the left image and,
extract features from the left and right images of the stereo input and for consistency, also segmentation. This is why these high-level features
then takes advantage of this information to calculate the disparity and are extracted only from the left image, although information from the
also to refine the segmentation. right image is also reinforced in early stages of the architecture using
All of these approaches demonstrate that combining features from the 𝐹1𝑅 , 𝐹2𝑅 , and 𝐹4𝑅 features of the backbone.
different domains can help to improve similar tasks. However, they do Unless otherwise stated, each convolution layer used in the network
not fully exploit the additional information that a task-specific decoder has a kernel size of 3 × 3 and is followed by a ReLU activation
can learn to improve the other tasks (and vice-versa, that is, combine function [42] and a batch normalization operation [43]. Following [22,
the information obtained from those other tasks to improve the task 44], after concatenating any 2 or more features, they are processed
solved by the decoder). For example, Yang et al. [12] first trains one with a series of top-down/bottom-up convolutions known as hourglass
task and then freezes part of the network and builds on these learned (encoder–decoder) to learn more context information. Each hourglass
features to solve the other task. According to Baxter [39], learning block is composed of 3 convolutions and 3 deconvolutions with residual
multiple tasks simultaneously within an environment of related tasks connections [17].
can potentially give much better generalization than learning a single It should also be noted that in order to obtain a prediction with
task. This is because models can take advantage of commonalities and almost the same dimension as the input without this implying an
differences between tasks. increase in the necessary resources, convolutions with a large field of
Our proposal pursues this motivation to learn the segmentation view are used to extract high-level features that are concatenated with
and disparity jointly. Similar to Zhang et al. [14], our method uses each of the branches before generating the prediction.

3
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 2. Shared weights backbones structure used to process the input stereo pair. See
more details in Appendix.

Fig. 3. Coarse segmentation branch. It receives as input 𝐹4𝐿 and 𝐹4𝑅 from Block 4 of
The following sections describe the different parts of the proposed the backbone. Before predicting the coarse segmentation, the left image is convolved
and concatenated to add low-level information. See more details in Appendix.
architecture in detail. Appendix provides an in-depth illustration of the
complete architectural framework, showcasing the detailed composi-
tion of its layers. Furthermore, it includes tables that precisely specify
the characteristics of each component. high-level features are obtained by processing the left image with
a 5 × 5 convolution (i.e., a larger receptive field). This output is
3.2. Backbone feature extraction concatenated with the rest of the features in the branch (which are
rescaled to match the size of the high-level features) and convolved
As previously indicated, the network initially uses a Siamese struc- one more time through the last hourglass block (see Fig. 3) to output
ture [16] for feature extraction. To determine this architecture as the final segmentation. The Softmax activation function is used at the
well as the backbone to be used, a series of preliminary experiments pixel level to calculate the probability of each class.
were carried out (including networks such as DenseNet-121 [41], The segmentation branch extracts task-specific features (denoted as
EfficientNet-B2 [45], MobileNet v3 [46], ResNet-101, and ResNet- 𝐹𝑠𝑒𝑔 in Fig. 3), which are used to compute both the disparity and the
50 [47]; some of these results will be analyzed in the experimentation final refined segmentation. In this way, the network explicitly passes
section), eventually selecting DenseNet-121 to be used as the backbone. the features learned by this branch for a progressive learning and
Therefore, two instances of DenseNet-121 [41] with shared weights refinement.
are used to obtain low and high-level features from the left and right in-
put images. The shared weights approach reduces the number of param- 3.4. Disparity branch
eters needed by the network [48] and facilitates obtaining consistent
features for both inputs.
The second branch calculates the disparity map from the stereo
The backbones are divided into 5 blocks (see Fig. 2) to extract
images (see Fig. 4). The inputs of this branch are the features 𝐹2𝐿
features at different scales. Each block reduces the dimension of its
and 𝐹2𝑅 obtained from Block 2 of the corresponding left and right
input in half. The dimensions of blocks 0, 1, 2, 3, and 4 respectively
backbones. The correlation between these two input features is com-
are 1∕2, 1∕4, 1∕8, 1∕16, and 1∕32, with reference to the size of the input
image. The features extracted from these blocks will be denoted as 𝐹𝑖𝐿 puted using a correlation layer [20], which takes a block of 𝐹2𝐿 and
and 𝐹𝑖𝑅 in the rest of the paper, where 𝑖 is the block and 𝐿 and 𝑅 convolves it in a neighborhood window around 𝐹2𝑅 . However, it does
indicate if the feature belongs to the left or right image. Specifically, the not capture correlations between distant features because the window
𝐿∕𝑅 𝐿∕𝑅 𝐿∕𝑅 size is limited. For this reason, Spatial Pyramid Pooling (SPP) [18] is
features 𝐹1 , 𝐹2 and 𝐹4 (highlighted in Fig. 2) are the ones that
used on the two feature planes, before extracting their correlations,
will be used in the specialized branches described in the next sections.
to obtain multi-scale information, and thus, allow to find distant cor-
3.3. Coarse segmentation branch relations. The depth of the correlation block is then reduced using a
1 × 1 convolution and concatenated with the features 𝐹𝑠𝑒𝑔 computed
The first branch (see Fig. 3) extracts a coarse segmentation that is by the coarse segmentation branch (which have the same size thanks
used as an auxiliary loss in order to optimize the quality of the final to the upscaling applied in the first branch). In this way, the features
segmentation and also to aid in the disparity calculation [49]. learned by the segmentation are included to add more information for
The input of this branch is the concatenated feature maps 𝐹4𝐿 and the disparity estimation.
𝐹4𝑅 , from Block 4 of the backbone. These features undergo an upscaling As before, and following [22,44], an hourglass block is used after
by a factor of two, followed by a convolution with a 1 × 1 kernel this concatenation to extract more context information. Then, high-
to reduce the depth of this concatenation, and thus the number of level features are added to the network by processing the left input
parameters needed. This information is then processed through an image with a 5 × 5 convolution, concatenated with the rest of the
hourglass block in order to learn more context information, whose features in the branch (which are rescaled to match the size), and
intermediate output is propagated to the next layer as well as used as processed through another hourglass block to calculate the final dis-
a task-specific feature (𝐹𝑠𝑒𝑔 ). parity. Similar to the previous decoder, this branch also outputs an
Block 4 returns features with 1∕32 of the original image size, causing intermediate task-specific feature (𝐹𝑑𝑖𝑠𝑝 ), which is used by the refined
a loss of detailed information. Following [50], to reduce this problem, segmentation branch.

4
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 4. Scheme of the disparity branch. It extracts multi-scale information from


backbone blocks 𝐹2𝐿 and 𝐹2𝑅 using Spatial Pyramid Pooling (SPP). This information
Fig. 5. Refined segmentation branch. It concatenates the multi-scale feature informa-
is processed further and concatenated with the features from the coarse segmentation.
tion from 𝐹1𝐿 and 𝐹1𝑅 . The features from the previous coarse segmentation branch (𝐹𝑠𝑒𝑔 )
See more details in Appendix.
and from the disparity branch (𝐹𝑑𝑖𝑠𝑝 ) are also added to the network using an attention
layer. See more details in Appendix.

3.5. Refined segmentation branch


image. The final loss consists of the sum of the losses of the 3 branches:
The third decoder computes the refined segmentation (see Fig. 5).  = 𝑠𝑒𝑔𝑐𝑟 + 𝑑𝑖𝑠𝑝 + 𝑠𝑒𝑔𝑟𝑒𝑓 .
In this branch, Spatial Pyramid Pooling is first used on the backbone The gradients of each branch loss are propagated across the whole
features 𝐹1𝐿 and 𝐹1𝑅 . The resulting features are concatenated and pro- network, for which the standard back-propagation algorithm can be
cessed by an hourglass block. From this result, two subbranches are used [52]. Simultaneously optimizing all three losses for the two target
created: one part is combined with 𝐹𝑠𝑒𝑔 and the other with 𝐹𝑑𝑖𝑠𝑝 in order tasks allows the network to take advantage of the additional infor-
to add task-specific and high-level information from previous branches mation provided for the other task. For example, knowing the depth
(note that these features are resized to match the dimensions of the can help to differentiate between two objects and, in the same way,
branch’s features). A 1 × 1 convolution with sigmoid activation is used knowing the segmentation and class of an object can help to find their
on each part to compute an attention map. This operation focuses on correspondences for the calculation of the disparity.
selecting important features from each task (an example can be seen in
Fig. 6). Each of these subbranches is then multiplied by the features of
4. Experimental setup
the previous hourglass block and concatenated with high-level features
obtained by processing the left input image with a 5 × 5 convolution
(note that scaling is also applied to adjust the dimensions). Finally, an This section describes the datasets used for the experiments as well
encoder–decoder block with an output dimension equal to the number as the details of the model implementation and the training process.
of labels and followed by a Softmax activation function is used to obtain The implementation code is publicly available at: https://github.com/
the refined segmentation. cuevhv/PMT_learning_for_semantic_segmentation_and_disparity.
All experiments were carried out using the Python programming
language (v. 3.7) with the public library PyTorch (v. 1.9). The machine
3.6. Model optimization
used consists of an Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz with
16 GB RAM and an NVIDIA GeForce RTX 2070 with 6 GB GDDR6
Rather than training each branch separately or a part of the net-
Graphics Processing Unit (GPU).
work, freezing the weights and then adjusting the rest (as in the work of
Yang et al. [12]), the optimization of the complete model is end-to-end.
Each branch optimizes an objective function, either to reduce the 4.1. Datasets
segmentation error or the disparity error. For the loss of the two
segmentation branches (𝑠𝑒𝑔𝑐𝑟 and 𝑠𝑒𝑔𝑟𝑒𝑓 ), the Cross-entropy and the For the training and evaluation of the proposed method, three
Lovasz loss are combined [51] to optimize the intersection-over-union datasets were used: Cityscapes [34], TrimBot Garden [36], and S-
between the prediction and the classes labeled in the ground truth ROSeS [9]. These datasets are described in detail below. Table 1
of the left input image. For the disparity branch (𝑑𝑖𝑠𝑝) the L1 loss includes a summary of their characteristics and some random image
is used between the disparity prediction and the labeling of the left examples can be seen in Fig. 7.

5
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 6. Example output for an attention map. The intensity of the attention is represented by a color gradient, with red indicating the areas with the highest attention. The map
obtained for the coarse segmentation (b) focuses on the inner part of the element to be segmented, where the network is more confident. On the other hand, the attention for
the disparity (c) marks textured areas of the background on which the network relies to calculate disparity. (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)

Fig. 7. Random examples of the three datasets considered: Cityscapes, TrimBot2020 Garden, and S-ROSeS.

Table 1
Summary of the characteristics of the datasets considered, including the type of images, the number of samples, their resolution, the number
of segmented classes, and the type of disparity ground-truth (GT) provided.
Type # samples Resolution (px.) Segmented classes Type of disparity GT
Cityscapes Real 5000 2048 × 1024 19 SGM stereo sparse map
TrimBot2020 Garden Synthetic 12,500 640 × 480 9 Synthetic dense 3D mesh
S-ROSeS Synthetic 5760 720 × 480 2 Synthetic dense disparity map

Cityscapes [34]: It is an urban scene understanding dataset for pairs with a spatial resolution of 640 × 480 px, from which 10,000
pixel-wise semantic segmentation that contains 19 classes plus 1 unde- (80%) were used for training and 2500 for testing (20%). In this case,
termined class, which—as in the original work—is not considered for 10% of the training set was used for validation, and, as before, the same
evaluation. The 19 classes are grouped into 7 categories: flat (including partitions were kept for all compared methods. The ground truth has 9
classes such as road and sidewalk), human, vehicles, constructions, pixel-wise segmented classes (grass, ground, pavement, hedge, topiary,
objects (with poles, traffic signs, traffic lights, etc.), nature (vegetation rose, obstacle, tree, and background) and their corresponding disparity
and terrain), sky, and void. The disparity ground truth is a sparse map maps.
obtained by SGM stereo. This corpus contains 5000 high-resolution S-ROSeS (Synthetic dataset of Roses for Object Segmentation and
2048 × 1024 px images, from which 2975 (60%) were used for training, Skeletonization) [9]: It comprises 5760 synthetic stereo images of rose
500 (10%) for validation, and 1525 (30%) for testing (these parti- bushes, divided into 80% for training and 20% for testing. Each stereo
tions are those proposed by the authors and have been maintained pair corresponds to a different view of 36 models of synthetic rose
for experimentation with all the methods considered to make a fair bushes. These plants were generated by imitating the morphology and
comparison). appearance of real rose bushes with a mean height of 0.6 ± 0.2 m.
TrimBot2020 Garden [36]: It consists of images from a synthetic Four backgrounds of exterior environments were created, which vary
garden under 5 weather conditions. The dataset contains 12,500 stereo in background, light conditions, and the position of the sun. Both the

6
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Table 2 5. Results
Transformations considered during the data augmentation process. The dimensions
for random cropping vary by dataset† : 512 × 512 for Cityscapes, and 480 × 480
for both TrimBot2020 and S-ROSeS. Additionally, class uniform sampling is tailored
This section presents the results obtained with the proposed method
to balance the representation of each class within the dataset, without a predefined following the experimental setup described in Section 4. The first part
parameterization range. is devoted to the optimization of the different hyper-parameters of the
Transformation Range scheme. After that, the individual results obtained for the three datasets
Brightness [0.5, 1.5] are analyzed and compared with those of the state of the art.
Contrast [0.8, 1.2]
Saturation [0.5, 1.5] 5.1. Hyper-parameter tuning
Gaussian blur [0.25, 1.15]
Random cropping {512 × 512, 480 × 480}†
Zooming in and out [0.7, 1.5] For simplicity, the initial analysis is presented with Cityscapes as
Class uniform sampling – it is the most challenging dataset of those selected, although these
experiments were also carried out with the other two with similar
results. Table 3 shows a summary of the most relevant results obtained
regarding the input and batch size, the data augmentation techniques
left and right input images, as well as the ground truths with the
considered, and the different topologies for the network’s backbone.
segmentation and the disparity map, were obtained with a resolution
Results are reported in terms of mean class-wise Intersection over
of 720 × 480 px. The depth map was calculated from the simulation of Union (mIoU) for the evaluation of semantic segmentation and D-1
a stereo camera with parallel stereoscopic configuration, sensor size of error for disparity estimation. The latter calculates the percentage of
32 mm, focal length of 0.032 m, and baseline of 0.03 m. pixel-wise disparity errors below a threshold, which—as in the related
For the assessment of each dataset, we employed the metrics pro- literature—is set to 3 pixels of difference between the ground truth and
posed in the original papers to ensure that our results are directly the estimated disparity.
comparable using the same benchmarks as others in the field. This Similar to other works that use backbones with batch normalization
approach allows for a consistent reference point across different stud- layers [2,26], our experiments show that small training batches lead
ies. These metrics will be detailed in the results section as they are to unstable batch-norm statistics and to poor learning performance.
employed. Results show a progressive improvement in the disparity and segmen-
tation when the batch size is increased up to 16, and gets worse after
that.
4.2. Implementation details The results also show that larger input sizes provide better overall
performance, which may be due to the following two reasons. First, the
The backbones (DenseNet-121 [41]) were first initialized using the input size is obtained by randomly cropping the original image. This
pre-trained weights obtained with the ILSVRC dataset [53,54]. A fine- means that the bigger the extracted chunk, the more context the net-
tuning process was then applied to the entire network using the specific work sees, which translates into learning better relationships between
corpus to be evaluated. objects. The second reason is related to the reduction in dimension due
to the backbone topology. In the case of ResNet and DenseNet, the
The training lasted a maximum of 70K iterations with a mini-
backbones are divided into blocks, and each block reduces by half its
batch size of 16 samples and early stopping when the loss did not
input size. This means that the feature dimensions of blocks 1 to 5 are
decrease during 15 epochs. The training of the network parameters was
1∕2 to 1∕32 of the input image size. Therefore, if we use an input size
made using Stochastic Gradient Descent (SGD) [52] with Adam [55] as
of 256 × 256 px, the output feature from block 5 will be only 8 × 8.
optimizer (learning rate 1e−3, 𝛽1 = 0.9, 𝛽2 = 0.999, and 𝜖 = 1e−8). Regarding data augmentation, it is observed how the different tech-
Data augmentation was used to artificially increase the size of the niques progressively improve the segmentation result by up to 4.5% for
training set by randomly applying different types of transformations the IoU and reduce the disparity error by a total of 0.042. Specifically,
to the original training samples [53,56]. Specifically, the transfor- the different augmentation processes provide a similar improvement
mations applied include changes in brightness, contrast, and satura- for segmentation. However, for disparity (see Table 3), zooming seems
tion, the addition of gaussian blur (with a standard deviation in the to help the most (reducing the error by 0.018), then class uniform
range [0.25, 1.15]), random cropping (512 × 512 px for Cityscapes and sampling (with a decrease of 0.016), and finally brightness changes
480 × 480 px for TrimBot2020 and S-ROSeS), zooming in and out (lowering the error by 0.008).
(in the range [0.7, 1.5] with respect to the original image size), and The proposal was also evaluated considering six different back-
class uniform sampling to handle the class imbalance, as proposed by bones: EfficientNet-B2, MobileNet v3, ResNet-101, ResNet-50, ResNet-
Choi et al. [57]. These augmentations are detailed in Table 2. In each 18, and DenseNet-121. As can be seen, the latter is the one that reports
iteration of the training process, and for every image being processed, the best results both for segmentation and for calculating the disparity.
a set of transformations is randomly selected from the permitted range For this reason, to carry out the rest of the experiments, we will
and applied to the image. This ensures that the data augmentation use this backbone together with the best hyper-parameters previously
determined: a batch size of 16, an input size of 512 × 512 px, and the
process introduces a diverse and varied set of modifications, enhancing
use all the types of data augmentation proposed.
the robustness and generalizability of the model.
Note that when optimizing two simultaneous tasks, the transfor-
5.2. Cityscapes
mations have to be applied to both ground truths. In this sense, the
operations considered do not alter the segmentation or disparity label- Once the best configuration of hyper-parameters has been deter-
ing, with the exception of the zoom, which does modify the disparity. mined, we proceed to evaluate the results obtained with the considered
For this reason, and to the best of our knowledge, it is not usually datasets, starting with Cityscapes.
used when training disparity networks (as opposed to segmentation Table 4 compares the proposal with the state of the art in the
networks, where it is always used). In this work, we propose the use Cityscapes ranking. For comparison, the highest-scored methods for
of this augmentation criterion by resorting to simple geometry: The segmentation were considered, including Panoptic-DeepLab [30], RC-
disparity values were re-scaled based on the zooming scale factor, with Net [3], PSPNet [49], as well as the multi-task approach with the
the new disparity being 𝑑𝑖𝑠𝑝𝑛𝑒𝑤 = 𝑠𝑐𝑎𝑙𝑒 ∗ 𝑑𝑖𝑠𝑝𝑜𝑙𝑑 . highest score for both tasks [13], which is called MTU. As our approach

7
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Table 3
Comparison of the results obtained using different hyperparameters when using the Cityscapes validation set. Coarse segmentation (Seg cr.) and refined segmentation (Seg ref.) are
evaluated considering the IoU figure of merit, and the disparity using the D1-error. The best result per metric is highlighted in bold.
Input size Backbone Batch Random Zoom Brightness Class Seg cr. Seg ref. Disp. D-1
(px) size cropping in/out variation sampling IoU IoU error
256 × 256 DenseNet-121 4 ✓ 63.2 66.0 0.353
256 × 256 DenseNet-121 8 ✓ 64.7 68.3 0.345
Batch & 256 × 256 DenseNet-121 16 ✓ 65.2 69.6 0.311
input size
512 × 512 DenseNet-121 8 ✓ 67.3 71.3 0.185
512 × 512 DenseNet-121 16 ✓ 70.1 73.2 0.082
512 × 512 DenseNet-121 32 ✓ 68.3 71.1 0.092
512 × 512 DenseNet-121 16 ✓ ✓ 71.7 74.6 0.064
Data aug. 512 × 512 DenseNet-121 16 ✓ ✓ ✓ 73.0 76.1 0.056
512 × 512 DenseNet-121 16 ✓ ✓ ✓ ✓ 74.6 77.6 0.040
512 × 512 EfficientNet-B2 16 ✓ ✓ ✓ ✓ 72.2 74.7 0.090
512 × 512 MobileNet v3 16 ✓ ✓ ✓ ✓ 69.1 72.0 0.124
Backbones 512 × 512 ResNet-101 16 ✓ ✓ ✓ ✓ 60.4 64.0 0.453
512 × 512 ResNet-50 16 ✓ ✓ ✓ ✓ 66.6 70.0 0.173
512 × 512 ResNet-18 16 ✓ ✓ ✓ ✓ 62.1 66.7 0.326

Table 4
Comparison of the results obtained with the Cityscapes dataset in terms of class and category segmentation IoU, and disparity error. Categories
group the 19 classes into 7 groups (see Section 4.1). Results are included for both the test and validation partitions, the latter being the most
used in the literature. The best result per metric is marked in bold type and the second-best result is underlined.
Method Backbone Train input size (px) # Params. Seg. mIoU class Seg. mIoU category Disp. RMSE
Val. Test Test Test
Multi-task methods
Ours DenseNet-121 512 × 512 18.0M 77.6 76.1 91.0 3.28
MTU [13] Deeplab v3 512 × 512 64.2M 78.1 78.5 89.9 5.88
Semantic segmentation methods
Panoptic [30] Deeplab v3 2049 × 1025 46.7M 81.5 84.5 92.9 N/A
RCNet [3] ResNet-50 512 × 1024 14.1M 70.9 – – N/A
PSPNet [49] ResNet-101 2048 × 1024 54.7M 73.9 – – N/A
Ladder-style [58] DenseNet-121 2048 × 1024 8.2M 62.3 – – N/A
Ladder-style [58] DenseNet-169 2048 × 1024 15.6M 75.7 74.3 89.7 N/A
Deeplab v3 [27] Xception-71 513 × 513 46.7M 79.5 82.1 – N/A
Deeplab v3 [57] ResNet-101 768 × 768 64.2M 79.2 – – N/A
Deeplab v3 [57] ResNet-50 768 × 768 45.1M 77.8 – – N/A

is based on DenseNet, the work with the highest score that uses this If we compare it with the multi-task alternative that obtains the
backbone is also included, which is Ladder-style DenseNet [58]. In best result in the Cityscapes ranking—i.e., MTU [13]—our network
addition, given that the best-ranked methods are based on Deeplab v3, demonstrates a reduced disparity error and superior category-wise IoU,
three versions of this architecture using different backbones (Xception- while being only 2.4% less effective in class-wise IoU on the test set and
71, ResNet-101, and ResNet-50) were also included in the comparison. 0.5% on the validation set. Moreover, it is noteworthy that this level of
Please note that the column referring to input size denotes the size used performance is achieved with fewer than a third of the parameters used
during training, where crops of this size are extracted from the original by MTU. To determine the source of performance differences—whether
training images. For evaluation, images are processed at their original due to the backbone, training process, or loss function—we conducted
size without cropping or scaling, ensuring a fair comparison. two additional experiments (refer to Table 5): (1) We applied their loss
When analyzing the results of this table, it is important to keep in function within our architecture (see the second row in the table) and
mind the column with the number of parameters, as it represents an (2) We replaced the MTU backbone with our method’s DenseNet-121,
efficient solution that addresses both tasks simultaneously. Most of the while retaining its decoder and uncertainty loss (see the third row).
proposed approaches that yield better results are based on Deeplab v3, The outcomes indicate that our approach remarkably surpasses MTU
which ranges from 41M to 64.2M parameters, depending on the imple- in both segmentation and disparity metrics under identical evaluation
mentation [57]. In our case, the DenseNet-121 backbone [41] has only and training conditions. The integration of its loss function negatively
8M parameters, and with the addition of specialized decoders for dis- impacts the performance of our model, and the use of our backbone
parity and semantic segmentation, it reaches 18M parameters. This is in MTU leads to a significant deterioration in their results. Hence, the
between 2.6 to 3.6 times fewer than the top-performing state-of-the-art superior class-wise IoU of MTU presented in Table 4 should be directly
solutions (i.e., Panoptic-DeepLab and MTU). attributed to its backbone, which has a considerably higher parameter
If we compare the proposal with the best single-task solution count.
(Panoptic), we can observe that it is only 3.9 points lower on the class- For a qualitative analysis of the results, Fig. 8 shows four examples
level validation set and 1.9 points lower on the category-level test (columns) of the segmentation and disparity estimation obtained with
set. As mentioned before, we must highlight that it is a much more our network. The first two rows show the input images, followed by
efficient network (with 2.6 times fewer parameters) and that it also the disparity and segmentation results compared with the GT. In these
addresses two tasks simultaneously. When compared to other more images, it can be seen that the errors made occur mainly at edges,
efficient single-task solutions, such as Ladder-style or RCNet, with a isolated pixels, or in the lower area that corresponds to the hood
similar number of parameters, our proposal achieves notably better of the car, for which the sensor does not seem to provide adequate
results. information.

8
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 8. Four examples of results with Cityscapes. The input stereo pair is shown in the first two rows. The ground truth and predicted disparities are shown in rows 3 and 4. The
error of the predicted disparity (row 5) encodes the error 𝑒 in 3 colors: blue (𝑒 < 3), green (𝑒 < 6), and red (𝑒 ≥ 6). Rows 6 and 7 show the ground truth and the prediction obtained
for the segmentation task. The black color in the disparity and segmentation images means that the ground truth has no disparity or class assigned to that pixel, respectively.
(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 5 Table 6
Comparison of our proposal and MTU [13] on the Cityscapes validation set when using TrimBot2020 dataset comparison. The segmentation accuracy (Seg acc.) and complete-
the same backbone (DenseNet-121) and changing the decoder and the loss function. ness (comp.) are in percentages. The 3D reconstruction accuracy (Acc. (m)) reports the
The changes applied are underlined. The original results of our approach and MTU are error in meters, therefore, the lower the value the better. The best result per metric is
included for reference (1st and 4th rows). highlighted in bold.
Method (decoder) Backbone Loss Seg. Disp. Method # Params. Seg. acc. 3D reconstruction
mIoU class D-1 error
Acc. (m) Comp.
Ours DenseNet-121 Ours 77.6 0.040
Ours 18.0M 91.9 0.061 77.8
Ours DenseNet-121 MTU 72.2 0.067 DTIS [59] 207.8M 91.9 0.122 66.2
MTU DenseNet-121 MTU 59.6 0.347 HAB [36] 64.2Ma 79.0 0.069 74.0
LAPSI [60] N/A – 0.164 23.9
MTU Deeplab v3 MTU 78.1 0.102
a
The number of parameters refers only to the segmentation algorithm, an iterative
classical vision algorithm is used for the disparity.

5.3. TrimBot2020 Garden

Following the original HAB paper [36], pixel-wise accuracy was


In this section, the proposed method is evaluated with the Trim- used to evaluate the semantic segmentation. Using the TrimBot2020
Bot2020 Garden dataset and compared with the best-performing state- Garden dataset allows evaluation of the quality of the 3D reconstruction
of-the-art methods for semantic segmentation and depth estimation: instead of finding the error of the 3D depth or disparity. To evaluate the
DTIS [59], HAB [36], and LAPSI360 [60]. DTIS uses a network similar goodness of these reconstructions, 2 metrics were used: completeness
to FuseNet [61], which has as input an RGB image and a depth map, and 3D accuracy. Completeness is calculated by considering a predicted
point as correct if it differs from the ground truth by ≤0.05 m, similar
and outputs both the semantic segmentation and a refined version
to the D-1 error. The 3D accuracy is the distance 𝑑 in meters, such that
of the depth map. HAB uses ELAS stereo [62] to produce a dense
90% of the reconstruction is within 𝑑 meters of the ground truth mesh.
point cloud and DeepLab v3 [28] to obtain the segmentation. The In our case, disparity predictions were transformed into depth maps
resulting point cloud is denoised with class-specific filters based on the and back-projected into the 3D space using the camera parameters
3D geometry. LAPSI360 focuses on the 3D reconstruction; for this it (provided in the original paper).
generates the geometric mesh of the garden using the information from Table 6 shows the results of this comparison. As can be seen,
the 10 cameras provided in the dataset. our method obtains a segmentation accuracy similar to DTIS—using

9
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Table 7
Comparison of the results obtained for the calculation of the disparity and segmentation with the S-ROSeS
dataset. The best results per metric is highlighted in bold.
Method # Params. Segmentation Disparity
Precision Recall F1 mIoU RMSE Sq-Rel
Multi-task methods
Ours 18.0M 98.49 93.28 95.80 97.52 0.05 0.0134
FCSN [9] 2.9Ma 90.47 93.18 91.70 94.16 0.5648 0.0917
FCSN-Comb. [9] 2.9Ma 92.86 92.59 92.96 95.07 0.5082 0.0630
Semantic segmentation methods
U-Net [63] 1.9M 91.08 88.50 89.77 90.21 N/A N/A
SegNet [64] 5.5M 88.84 77.07 81.81 83.60 N/A N/A
DeepLab v3 [28] 11.8M 76.94 84.63 80.59 79.64 N/A N/A
a The number of parameters refers only to the segmentation algorithm. An iterative classical vision algorithm
is used for the disparity.

11.5 times fewer parameters—and outperforms HAB. Regarding the 3D only a few mistakes are made at the edges of the finer branches. The
reconstruction, our network achieves the best score in both metrics reported improvement, especially in the detection of thin branches, is
considered. It is 3.8% more complete than the best result (i.e., HAB), very important for the original task of this dataset, since it will allow a
and the reconstruction accuracy is almost 0.01 m better. It is important better reconstruction of the rose bush and, in this way, avoid possible
to highlight that, unlike HAB, our method obtains both results at the errors in the pruning process.
same time without applying any heavy point cloud filtering or using
3D features. 6. Discussion
Fig. 9 shows some results of our network for different views of the
TrimBot2020 Garden. The proposed approach, captures details like thin This section presents a more in-depth analysis of the selected archi-
branches not only in the segmentation but also in the disparity. In tecture and the results obtained. First, we evaluate the improvement
addition, and for a better qualitative analysis, Fig. 10 shows the 3D provided by the use of progressive feature learning with auxiliary loss
reconstruction obtained for the entire garden. Here you can see how functions. We also study the relationship between the efficiency and
most errors are made on the edges, corners, or in areas where many precision of the method and compare it with the other state-of-the-art
fine branches appear. In the latter case, it is mainly due to the error methods.
accumulated by combining multiple views in which the completeness
of the reconstruction of the branches is not exact. 6.1. Effect of task-specific features

5.4. S-ROSeS dataset A series of experiments were conducted to test the effectiveness
of sharing task-specific features among the branches of the proposed
This section focuses on the evaluation of the proposed method using network. Recall that it concatenates feature information from the coarse
the third dataset—i.e., the S-ROSeS dataset—and on the comparison segmentation branch (𝐹𝑠𝑒𝑔 ) with the disparity, and then uses the seg-
with different methods from the state of the art for this task: U- mentation and disparity features (𝐹𝑠𝑒𝑔 and 𝐹𝑑𝑖𝑠𝑝 ) to refine the final
segmentation (see Fig. 1).
Net [63], SegNet [64], and DeepLab v3 [28], as well-known algorithms
To test if this progressive sharing of feature information helps
for segmentation, and the methods FCSN and FCSN-Combined [9],
the network to improve the calculation of disparity and/or semantic
which also compute the disparity. These last two methods are based
segmentation, three variations of the proposed method (generated by
on a custom architecture for segmentation and on the Block Matching
removing the different connections between the task branches) were
algorithm [65] for disparity. The FCSN-Combined variant performs
trained and tested on the validation set of Cityscapes1 : (1) All the con-
an iterative process that combines the segmentation and the disparity
nections between the task-specific features and branches were removed,
obtained to improve each other.
(2) The connection between the features from the coarse segmentation
The results of this comparison are shown in Table 7. As in the
(𝐹𝑠𝑒𝑔 ) and the disparity decoder was kept, but the connections of the
original paper [9], we resort to Precision, Recall, F1, and IoU to
disparity and the coarse segmentation features (𝐹𝑠𝑒𝑔 and 𝐹𝑑𝑖𝑠𝑝 ) with the
assess segmentation, and to Root Mean Squared Error (RMSE) and
refined segmentation branch were removed, and (3) The connection
Squared Relative difference (Sq-Rel) to evaluate disparity. As observed
𝐹𝑠𝑒𝑔 from the disparity branch was removed and the connections 𝐹𝑠𝑒𝑔
in Table 7, the proposed method obtains the best results in all the
and 𝐹𝑑𝑖𝑠𝑝 with the refined segmentation branch were kept. Table 8
metrics considered, remarkably surpassing the rest of the proposals,
shows the results of this experiment, where the fourth row corresponds
especially in Precision. The only exception is the number of parameters.
to the original model with all the connections.
However, it is important to note that in the case of U-Net (as well
The experiments show that the connection between the coarse
as for SegNet and DeepLab), only the segmentation is calculated, and
segmentation features 𝐹𝑠𝑒𝑔 and the disparity branch helps to reduce
in FCSN the number of parameters refers only to the segmentation
the disparity error from 0.064 to 0.051 (results from the first and
architecture, since it requires the use of another iterative and slow
second row of the table). The lowest disparity error, 0.040, is obtained
method for calculating the disparity. when 𝐹𝑠𝑒𝑔 and 𝐹𝑑𝑖𝑠𝑝 are also shared with the refined segmentation
As before, for a qualitative analysis of the results, Fig. 11 shows branch (fourth row). This result points out that, even though this last
four examples (columns) of the segmentation and disparity estimation connection does not affect the disparity prediction at inference time,
obtained with our network. The first two rows show the input images, the progressive connection between task-specific features does help in
followed by the disparity and segmentation results compared with the the training process to learn better descriptors.
GT. Regarding the disparity, it is observed that errors are only made
at the edges of the fine branches. This is better understood if we also
analyze the segmentation result, for which both coarse and refined 1
We show the results for this dataset as it is the most challenging of
segmentations are shown. The auxiliary loss detects the main parts of the three, although similar experiments were performed with the others with
the bush, which are later refined to obtain the final segmentation where similar results.

10
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 9. Four examples of the results obtained for the TrimBot2020 Garden. The input stereo pair is included in the first two rows. The ground truth and predicted disparities are
shown in rows 3 and 4, respectively. Row 5 shows the error 𝑒 of the predicted disparity using 3 colors: blue (𝑒 < 3), green (𝑒 < 6), and red (𝑒 ≥ 6). Rows 6 and 7 show the ground
truth and the prediction obtained for the segmentation task. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

Table 8 These experiments also demonstrate that the refined segmentation is


Task-specific feature sharing effect. The first two columns show if the features 𝐹𝑑𝑖𝑠𝑝
improved when it receives the task-specific features 𝐹𝑠𝑒𝑔 and 𝐹𝑑𝑖𝑠𝑝 . The
and 𝐹𝑠𝑒𝑔 were concatenated to the disparity or refined segmentation branches. Seg cr.
and Seg ref. IoU show the results for coarse and refined segmentation. The best result improvement ranges from 64.8% to 75.7% when the disparity branch
per metric is highlighted in bold. does not receive 𝐹𝑠𝑒𝑔 , and to 77.6% when it does. This also shows that if
𝐹𝑠𝑒𝑔 𝐹𝑠𝑒𝑔 ∧ 𝐹𝑑𝑖𝑠𝑝 Seg cr. IoU Seg ref. IoU Disp D-1 error the disparity branch receives information from the coarse segmentation
↓ ↓ task, it can learn more meaningful features that can be passed to the
Br. disp Br. seg ref.
refined segmentation branch.
✗ ✗ 73.3 64.8 0.064
✓ ✗ 73.3 64.7 0.051 Finally, the experiments show that the refined segmentation per-
✗ ✓ 72.5 75.7 0.060 forms worse than the coarse segmentation (first two rows of Table 8)
✓ ✓ 74.6 77.6 0.040 when it does not receive the features 𝐹𝑠𝑒𝑔 and 𝐹𝑑𝑖𝑠𝑝 . This is probably
because it can only use the features at the beginning of the backbone
(low-level features), which are not enough for such a complex task as
semantic segmentation.

11
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 10. 3D reconstruction obtained for the entire TrimBot2020 Garden dataset. Cold colors indicate well-reconstructed segments. Hot colors indicate missing parts (completeness).
(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

6.2. Effectiveness vs. efficiency to its dependence on factors not related to the method itself, such as
the programming language, code optimization, external libraries, or
Another important aspect to discuss is the objectives of high effi- hardware configuration, among others. Therefore, we consider using
ciency (measured in the number of parameters) and precision of the the number of parameters more fair, as it is directly related to the time
proposed method, because they are, quite often, opposite objectives, neural architectures will take to perform a forward pass.
since attempting to improve one of them usually implies a deterioration However, in this section, we analyze the execution time to give an
in the other. From this point of view, this task can be seen as a Multi- intuitive idea of the efficiency of the proposal according to the available
objective Optimization Problem (MOP) in which two functions are resources. For this, an experiment was carried out on three different
optimized simultaneously. machines: one with very high resources (equipped with a DGX A100
The means commonly employed to evaluate this type of problem GPU with 40 Gb of RAM), one more modest (RTX 4090 with 24 Gb
is the use of the concept of non-dominance: One solution is said to of RAM), and one with low resources (GTX 1650 Super with 4 Gb of
dominate another if, and only if, it is better or equal in each objective RAM). Table 9 shows the average training times per epoch and infer-
function and, at least, strictly better in one of them. The Pareto frontier ence times. This was calculated using the S-ROSeS dataset, averaging
stands for the set of all non-dominated elements and represents the over 100 training epochs and 100 forward passes for inference. These
different optimal solutions to the MOP. The strategies within this set times were obtained for a non-optimized implementation in Python,
can be considered the best without any particular priority or order
using the PyTorch library. Nonetheless, acceptable times of about 2
among them.
FPS are observed for the low-resource architecture, which could be
Fig. 12 compares the semantic segmentation precision and the
substantially improved by optimizing the code.
efficiency of the algorithms evaluated assuming a MOP scenario. In
Comparing these results with some state-of-the-art methods, we can
addition, the Pareto frontier is marked with the non-dominated results,
confirm the argument regarding the efficiency relationship with the
in which four combinations are found to be non-dominated for the
number of parameters. For example, the compared architecture with
Cityscapes dataset (Our proposal, Ladder D121, Ladder D169, and
the most similar number of parameters, Ladder-169, which has 15.6M
Panoptic), two for the Garden dataset (Our proposal and DTIS), and
(vs. 18M in our case), has an average inference time of 0.56 s in the
three for S-ROSeS (Our proposal, U-Net, and FCSN-Comb.). Therefore,
for all datasets, our solution is the multi-task approach (marked with low-resource scenario. Heavier architectures like DTIS or FCSN-Comb.
triangles in the graph) with the best combination of efficiency and report much slower inference times, 4.23 and 5.67 s, respectively.
precision. The rest of the non-dominated results are optimized for Meanwhile, more efficient architectures, such as SegNet or U-Net,
a single task (Ladder, Panoptic, and U-Net), or they are multi-task manage to execute in average times of 0.1 s However, in return, they
solutions but much less efficient (DTIS and FCSN-Comb.). In the case of achieve worse results and only predict segmentation. Thus, it should
FCSN-Comb., it is important to remember the considerable slowdown be added the execution time for disparity calculation, a more complex
produced by the combination of segmentation and disparity in an and demanding task than segmentation.
iterative algorithm.
6.4. Limitations
6.3. Computational complexity
Having discussed the strengths of the proposal, this final section
To assess the efficiency of the compared architectures, we have addresses its limitations, which are mainly related to the training data,
focused on the number of parameters, disregarding execution time due its type, and the types of errors, as detailed below.

12
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 11. Four examples of the results obtained for S-RoSES. The input stereo pair is included in the first two rows. The GT and predicted disparities are shown in rows 3 and
4. Row 5 shows the error 𝑒 of the predicted disparity using 3 colors: blue (𝑒 < 3), green (𝑒 < 6), and red (𝑒 ≥ 6). Rows 6 and 7 respectively show the coarse and the refined
segmentation. In this case, the differences with the GT have been marked with two colors: blue for false positives (predicted branch pixels that do not appear in the GT) and red
for false negatives (branch pixels in the GT that have not been detected). (For interpretation of the references to color in this figure legend, the reader is referred to the web
version of this article.)

Table 9 models learned in one domain to new application domains. This latter
Comparison of execution time (in seconds) for the proposed model across three technique would also allow the method to generalize to new domains
computer architectures with varying resource levels, presenting the mean time plus
standard deviation for both training over 100 epochs and inference on 100 images.
or types of scenes, which is especially relevant in robotics and au-
tonomous driving applications where environmental conditions can
GPU model Epoch training time (s) Inference time (s)
vary significantly.
DGX A100 133.74 ± 13.81 0.15 ± 0.01
RTX 4090 386.32 ± 48.73 0.49 ± 0.06
Another potential limitation of the proposal is its dependency on a
GTX 1650 S 601.46 ± 37.27 0.60 ± 0.02 very specialized type of data for disparity estimation, such as stereo-
scopic images. This dependency could limit its applicability in situa-
tions where only monocular images are available or where capturing
stereoscopic images is not feasible. In these cases, adapting the ar-
Although the method is proposed as efficient in terms of parame- chitecture to implement depth estimation techniques from a single
ters and computational resources, it still requires a significant volume image, which can leverage monocular depth cues such as the relative
of annotated data with information on both disparity and seman- size of objects, perspective, and shadows, to infer depth without the
tic segmentation for training, which can be challenging in domains need for a stereo pair, could be studied [15]. In this case, most of the
where data annotation is costly or difficult to obtain. As a solution architecture already operates with only the left image, and it would
to this limitation, future work could explore the integration of Self- only be necessary to integrate these types of techniques replacing the
Supervised Learning [66] to improve weight initialization and reduce depth calculation when using the right image.
the amount of annotated data required. Alternatively, the integration On the other hand, the method benefits from the use of attention
of Domain Adaptation techniques [67] could be considered to adapt mechanisms and the progressive sharing of task-specific features across

13
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. 12. Analysis of efficiency (measured in the number of parameters) and effectiveness (considering IoU for the Cityscapes and S-ROSeS datasets and Pixel Accuracy for
TrimBot2020 Garden) as a Multi-objective Optimization Problem (MOP). Non-dominated elements and multi-task approaches are highlighted.

its branches. However, this strategy might not be optimal for all object generate variations at the edges of objects, or the inclusion of other
classes or objects at different scales, which could affect the precision techniques in the architecture like Atrous or dilated convolutions [26].
of the segmentation and disparity estimation in complex scenes. This
limitation has been observed in the qualitative analysis conducted for 7. Conclusions
coarse segmentation in Fig. 11, where the main errors appear at the
This work presents an efficient end-to-end method based on multi-
edges of thin elements. However, these errors seem to be resolved
task learning that successfully learns the semantic segmentation and
when refining the segmentation by leveraging shared information from disparity map together from an input stereo pair. To increase efficiency
the different branches and the details of higher-level features. Further without sacrificing accuracy, a combination of a series of techniques
research is intended on how to improve this limitation by including is proposed, such as the use of Siamese architecture, decoders that
other types of techniques, such as a loss function that penalizes errors perform progressive multi-scale learning and share task-specific fea-
at the edges, the use of specific data augmentation techniques that tures, the inclusion of attention layers to focus on the relevant parts,

14
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Fig. A.13. Comprehensive visual overview of the proposed architecture including the detail of each of its parts. Refer to Tables A.10, A.11, A.12, and A.13 for specific information
on the backbones and branches related to coarse segmentation, disparity, and refined segmentation, respectively.

or Spatial Pyramid Pooling to capture correlations between distant It is experimentally demonstrated that using separate decoders for
features, among others. each task is not sufficient to achieve good performance, and that
The proposal is evaluated with 3 datasets—Cityscapes, TrimBot2020 sharing the knowledge of segmentation and disparity tasks between
Garden, and S-ROSeS—and compared with 21 state-of-the-art methods, them, in a progressive fashion, improves the results of both tasks.
outperforming the multi-task learning solutions and obtaining compet- Since the features of each task are obtained at different resolutions,
itive results against the methods that solve each task individually but these task-specific features also provide multi-scale information for
using only 1∕3 of their parameters. coarse-to-fine learning. In addition, sharing the learned features also

15
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Table A.10 Table A.13


Backbone settings. It shows the input image (Left and Right) for each backbone and Refined segmentation branch modules. The brackets in the 𝑆𝑃 𝑃 module follow the
the features that are extracted. The size of each feature is shown with respect to the same notation as in Table A.12.
input size (height 𝐻 and width 𝑊 ). The last dimension of the output represents the Input Layer settings Output
number of channels.
Left conv2D(5 × 5, 1) L_conv_2
Input Layer settings Output
⎡avgPool(64, 64), conv2D(3 × 3, 32)⎤
Name Dim ⎢avgPool(32, 32), conv2D(3 × 3, 32)⎥
𝐻 𝑊 ⎢ ⎥
𝐹1𝐿 × × 128 𝐹1𝐿 ⎢avgPool(16, 16), conv2D(3 × 3, 32)⎥ 𝑆𝑃 𝑃1𝐿
4 4
𝐻 𝑊 ⎢ avgPool(8, 8), conv2D(3 × 3, 32) ⎥
Left Densenet121 𝐹2𝐿 × × 256 ⎢ ⎥
8 8 ⎣ 𝐹1𝐿 ⎦
𝐻 𝑊
𝐹4𝐿 × × 1024
32 32
⎡avgPool(64, 64), conv2D(3 × 3, 32)⎤
𝐹1𝑅 𝐻
× 𝑊
× 128 ⎢avgPool(32, 32), conv2D(3 × 3, 32)⎥
4 4 ⎢ ⎥
Right Densenet121 𝐹2𝑅 𝐻
× 𝑊
× 256 𝐹1𝑅 ⎢avgPool(16, 16), conv2D(3 × 3, 32)⎥ 𝑆𝑃 𝑃1𝑅
8 8 ⎢ avgPool(8, 8), conv2D(3 × 3, 32) ⎥
𝐹4𝑅 𝐻
× 𝑊
× 1024 ⎢ ⎥
32 32 ⎣ 𝐹1𝑅 ⎦
𝑆𝑃 𝑃1𝐿 concat cat_5
Table A.11 𝑆𝑃 𝑃1𝑅
Coarse segmentation branch modules. For the convolutional layers conv_1d and cat_5 conv1D(1 × 1, 128) conv1d_4
conv_2d, the layer settings 𝑘 × 𝑘, 𝑓 represent the kernel size 𝑘 and the number of conv1d_4 hourglass(3 × 3, 64) convdec_5
filters 𝑓 . 𝐹𝑠𝑒𝑔 resize to convdec_5 dim 𝐹𝑠𝑒𝑔
Input Layer settings Output 𝐹𝑑𝑖𝑠𝑝 resize to convdec_5 dim 𝐹𝑑𝑖𝑠𝑝
Left conv2D(5 × 5, 1) L_conv_0 𝐹𝑠𝑒𝑔 concat cat_6
𝐹4𝐿 concat cat_0 convdec_5

𝐹4𝑅 𝐹𝑑𝑖𝑠𝑝 concat cat_7


convdec_5
cat_0 upscale ×2 conv1d_0
conv1D(1 × 1, 64) cat_6 conv1D(1 × 1, 1), sigmoid() seg_att
cat_7 conv1D(1 × 1, 1), sigmoid() disp_att
conv1d_0 hourglass(3 × 3, 32) 𝐹𝑠𝑒𝑔
𝐹𝑠𝑒𝑔 resize to L_conv_0 dim convdec_0 seg_att seg_att∗convdec_5 seg_att
resize to L_conv_2 dim
L_conv_0 concat cat_1
convdec_0 disp_att disp_att∗convdec_5 disp_att
resize to L_conv_2 dim
cat_1 conv1D(1 × 1, 32) conv1d_1
conv1d_1 hourglass(3 × 3, 32) convdec_1 seg_att concat cat_8
convdec_1 conv2D(3 × 3, 𝑛_𝑙𝑎𝑏𝑒𝑙𝑠) conv2d_0 disp_att
conv2d_0 resize to Left dim coarse_seg L_conv_2
cat_8 hourglass(3 × 3, 32) convdec_6
convdec_6 conv2D(3 × 3, 𝑛_𝑙𝑎𝑏𝑒𝑙𝑠) conv2d_1
Table A.12 conv2d_1 resize to L_conv_2 dim ref_seg
Disparity branch modules. The brackets in the 𝑆𝑃 𝑃 module indicate that the input
was processed in 4 different ways and the result of each process was concatenated.
Input Layer settings Output
Left conv2D(5 × 5, 1) L_conv_1 autonomous driving systems or the development of affordable robots
⎡avgPool(32, 32), conv2D(3 × 3, 32)⎤ for garden care.
⎢ ⎥ As future work, it is intended to extend the proposal to process other
𝐹2𝐿 ⎢avgPool(16, 16), conv2D(3 × 3, 32)⎥ 𝑆𝑃 𝑃2𝐿
⎢ avgPool(8, 8), conv2D(3 × 3, 32) ⎥ types of data, such as point clouds, to directly calculate their segmen-
⎢ 𝐹2𝐿 ⎥
⎣ ⎦ tation and depth. We also intend to further improve the efficiency of
⎡avgPool(32, 32), conv2D(3 × 3, 32)⎤ the method by studying more advanced network architectures, using
⎢ ⎥
𝐹2𝑅 ⎢avgPool(16, 16), conv2D(3 × 3, 32)⎥ 𝑆𝑃 𝑃2𝑅 more efficient backbones, and also by optimizing the implementation.
⎢ avgPool(8, 8), conv2D(3 × 3, 32) ⎥
⎢ 𝐹2𝑅 ⎥ Another line of work is exploring proposals to reduce the amount
⎣ ⎦
of training data needed, including techniques such as self-supervised
𝑆𝑃 𝑃2𝐿 correlation_layer corr
learning, few-shot learning, or domain adaptation.
𝑆𝑃 𝑃2𝑅
corr conv1D(1 × 1, 128) conv1d_2
CRediT authorship contribution statement
𝐹𝑠𝑒𝑔 hourglass(3 × 3, 128) convdec_2
resize to conv1d_2 dim
Hanz Cuevas-Velasquez: Conceptualization, Formal analysis, In-
conv1d_2 concat cat_3
convdec_2 vestigation, Methodology, Writing – original draft. Alejandro Galán-
cat_3 hourglass(3 × 3, 64) 𝐹𝑑𝑖𝑠𝑝
Cuenca: Formal analysis, Investigation, Software, Validation, Writing
𝐹𝑑𝑖𝑠𝑝 resize to L_conv_1 dim convdec_3 – original draft, Writing – review & editing. Robert B. Fisher: Con-
L_conv_1 concat cat_4 ceptualization, Formal analysis, Investigation, Supervision, Writing –
convdec_3 review & editing. Antonio Javier Gallego: Writing – original draft,
cat_4 conv1D(1 × 1, 64) conv1d_3 Writing – review & editing, Conceptualization, Formal analysis, Inves-
conv1d_3 hourglass(3 × 3, 64) convdec_4 tigation, Methodology, Project administration, Resources, Supervision,
convdec_4 deconv2D(5 × 5, 1) deconv2d_0 Validation, Visualization.
deconv2D_0 resize to Left dim disp

Declaration of competing interest

helps to reduce the complexity of the network and thus to increase The authors declare that they have no known competing finan-
its performance. This reduced number of parameters may allow its cial interests or personal relationships that could have appeared to
application on systems that have limited resources, such as low-budget influence the work reported in this paper.

16
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

Data availability [16] I. Melekhov, J. Kannala, E. Rahtu, Siamese network features for image matching,
in: 2016 23rd International Conference on Pattern Recognition, ICPR, 2016, pp.
378–383, http://dx.doi.org/10.1109/ICPR.2016.7899663.
Data will be made available on request. [17] A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose
estimation, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision
Acknowledgments – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 483–499.
[18] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional
networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9)
This work was supported by the I+D+i project TED2021-132103A- (2015) 1904–1916.
I00 (DOREMI), funded by MCIN/AEI/10.13039/501100011033. [19] H. Zheng, J. Fu, T. Mei, J. Luo, Learning multi-attention convolutional neural
network for fine-grained image recognition, in: Proceedings of the IEEE Interna-
tional Conference on Computer Vision, 2017, pp. 5209–5217, http://dx.doi.org/
Appendix. Network details 10.1109/iccv.2017.557.
[20] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, T. Brox, A
This section complements the network details described in the main large dataset to train convolutional networks for disparity, optical flow, and
body of the paper. Fig. A.13 shows a comprehensive overview of the scene flow estimation, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, CVPR, 2016, pp. 4040–4048, http://dx.doi.org/
proposed architecture with detailed information of each of its parts. 10.1109/cvpr.2016.438.
Table A.10 shows the split of the backbone and the dimension of each [21] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, A.
output feature 𝐹𝑖 . Tables A.11, A.12, and A.13 show the layer settings, Bry, End-to-end learning of geometry and context for deep stereo regression, in:
inputs, and outputs of the coarse segmentation, disparity estimation, Proceedings of the IEEE International Conference on Computer Vision, 2017, pp.
66–75, http://dx.doi.org/10.1109/iccv.2017.17.
and refined segmentation branches, respectively. Each layer has a stride [22] J.-R. Chang, Y.-S. Chen, Pyramid stereo matching network, in: Proceedings of
of 1 except for the Spatial Pyramid Pooling (SPP), where the stride is the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2018,
the same size as the pooling window. pp. 5410–5418, http://dx.doi.org/10.1109/cvpr.2018.00567.
[23] K. Zhou, X. Meng, B. Cheng, Review of stereo matching algorithms based on
deep learning, Comput. Intell. Neurosci. 2020 (2020) 12, http://dx.doi.org/10.
References 1155/2020/8562323.
[24] X. Yuan, J. Shi, L. Gu, A review of deep learning methods for semantic
[1] E.K. Stathopoulou, F. Remondino, A survey on conventional and learning-based segmentation of remote sensing imagery, Expert Syst. Appl. 169 (2021) 114417,
methods for multi-view stereo, Photogramm. Rec. 38 (183) (2023) 374–407, http://dx.doi.org/10.1016/j.eswa.2020.114417.
http://dx.doi.org/10.1111/phor.12456. [25] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic
[2] S. Hao, Y. Zhou, Y. Guo, A brief survey on semantic segmentation with deep segmentation, in: Proceedings of the IEEE Conference on Computer Vision and
learning, Neurocomputing 406 (2020) 302–321, http://dx.doi.org/10.1016/j. Pattern Recognition, CVPR, 2015, pp. 3431–3440, http://dx.doi.org/10.1109/
neucom.2019.11.118. cvpr.2015.7298965.
[3] J. Jiang, J. Liu, J. Fu, W. Wang, H. Lu, Super-resolution semantic segmentation [26] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Deeplab:
with relation calibrating network, Pattern Recognit. 124 (2022) 108501, http: Semantic image segmentation with deep convolutional nets, atrous convolution,
//dx.doi.org/10.1016/j.patcog.2021.108501. and fully connected crfs, IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2017)
[4] C. Wang, X. Wang, J. Zhang, L. Zhang, X. Bai, X. Ning, J. Zhou, E. Hancock, 834–848, http://dx.doi.org/10.1109/tpami.2017.2699184.
Uncertainty estimation for stereo matching based on evidential deep learning, [27] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with
Pattern Recognit. 124 (2022) 108498, http://dx.doi.org/10.1016/j.patcog.2021. atrous separable convolution for semantic image segmentation, in: Proceedings
108498. of the European Conference on Computer Vision, ECCV, 2018, pp. 801–818,
[5] N. Strisciuglio, R. Tylecek, M. Blaich, N. Petkov, P. Biber, J. Hemming, E. van http://dx.doi.org/10.1007/978-3-030-01234-2_49.
Henten, T. Sattler, M. Pollefeys, T. Gevers, et al., Trimbot2020: an outdoor [28] L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution
robot for automatic gardening, in: ISR 2018; 50th International Symposium on for semantic image segmentation, 2017, CoRR abs/1706.05587. URL http://
Robotics, VDE, 2018, pp. 1–6, http://dx.doi.org/10.3030/688007. arxiv.org/abs/1706.05587.
[6] Y. Nie, S. Guo, J. Chang, X. Han, J. Huang, S.-M. Hu, J.J. Zhang, Shallow2Deep: [29] L.-C. Chen, Y. Yang, J. Wang, W. Xu, A.L. Yuille, Attention to scale: Scale-
Indoor scene modeling by single image understanding, Pattern Recognit. 103 aware semantic image segmentation, in: Proceedings of the IEEE Conference
(2020) 107271, http://dx.doi.org/10.1016/j.patcog.2020.107271. on Computer Vision and Pattern Recognition, CVPR, 2016, pp. 3640–3649,
http://dx.doi.org/10.1109/cvpr.2016.396.
[7] Z. Shen, Y. Dai, Z. Rao, MSMD-Net: Deep stereo matching with multi-scale
[30] B. Cheng, M.D. Collins, Y. Zhu, T. Liu, T.S. Huang, H. Adam, L.-C. Chen,
and multi-dimension cost volume, 2020, http://dx.doi.org/10.48550/arXiv.2006.
Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic
12797, arXiv preprint arXiv:2006.12797.
segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision
[8] A. Tao, K. Sapra, B. Catanzaro, Hierarchical multi-scale attention for semantic
and Pattern Recognition, CVPR, 2020, pp. 12475–12485, http://dx.doi.org/10.
segmentation, 2020, arXiv preprint arXiv:2005.10821.
1109/cvpr42600.2020.01249.
[9] H. Cuevas-Velasquez, A.-J. Gallego, R.B. Fisher, Segmentation and 3D recon-
[31] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image
struction of rose plants from stereoscopic images, Comput. Electron. Agric. 171
synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Con-
(2020) 105296, http://dx.doi.org/10.1016/j.compag.2020.105296.
ference on Computer Vision and Pattern Recognition, 2022, pp. 10684–10695,
[10] H. Cuevas-Velasquez, A.-J. Gallego, R. Tylecek, J. Hemming, B. van Tuijl,
http://dx.doi.org/10.1109/cvpr52688.2022.01042.
A. Mencarelli, R.B. Fisher, Real-time stereo visual servoing for rose pruning
[32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B.
with robotic arm, in: 2020 IEEE International Conference on Robotics and
Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient
Automation, ICRA, 2020, pp. 7050–7056, http://dx.doi.org/10.1109/ICRA40945.
foundation language models, 2023, arXiv preprint arXiv:2302.13971. URL https:
2020.9197272.
//arxiv.org/abs/2302.13971.
[11] Y. Zhang, Q. Yang, A survey on multi-task learning, IEEE Trans. Knowl. Data [33] Y. Yuan, X. Chen, J. Wang, Object-contextual representations for semantic
Eng. (2021) 1, http://dx.doi.org/10.1109/TKDE.2021.3070203. segmentation, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Com-
[12] G. Yang, H. Zhao, J. Shi, Z. Deng, J. Jia, Segstereo: Exploiting semantic puter Vision – ECCV 2020, Springer International Publishing, Cham, 2020, pp.
information for disparity estimation, in: Proceedings of the European Conference 173–190.
on Computer Vision, ECCV, 2018, pp. 636–651, http://dx.doi.org/10.1007/978- [34] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U.
3-030-01234-2_39. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene
[13] A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh understanding, in: Proceedings of the IEEE Conference on Computer Vision and
losses for scene geometry and semantics, in: Proceedings of the IEEE Conference Pattern Recognition, CVPR, 2016, pp. 3213–3223.
on Computer Vision and Pattern Recognition, CVPR, 2018, pp. 7482–7491, [35] M. Menze, A. Geiger, Object scene flow for autonomous vehicles, in: Conference
http://dx.doi.org/10.1109/cvpr.2018.00781. on Computer Vision and Pattern Recognition, CVPR, 2015.
[14] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, J. Yang, Joint task-recursive learning for [36] R. Tylecek, T. Sattler, H.-A. Le, T. Brox, M. Pollefeys, R.B. Fisher, T. Gevers,
semantic segmentation and depth estimation, in: Proceedings of the European The second workshop on 3D reconstruction meets semantics: Challenge results
Conference on Computer Vision, ECCV, 2018, pp. 235–251, http://dx.doi.org/ discussion, in: L. Leal-Taixé, S. Roth (Eds.), ECCV 2018 Workshops, Springer
10.1007/978-3-030-01249-6_15. International Publishing, Cham, 2019, pp. 631–644.
[15] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, I. Reid, Real-time [37] D. Xu, W. Ouyang, X. Wang, N. Sebe, PAD-net: Multi-tasks guided prediction-
joint semantic segmentation and depth estimation using asymmetric annotations, and-distillation network for simultaneous depth estimation and scene parsing, in:
in: 2019 International Conference on Robotics and Automation, ICRA, IEEE, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018,
2019, pp. 7101–7107, http://dx.doi.org/10.1109/icra.2019.8794220. pp. 675–684, http://dx.doi.org/10.1109/CVPR.2018.00077.

17
H. Cuevas-Velasquez et al. Pattern Recognition 154 (2024) 110601

[38] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, [60] G. Ilha, T. Waszak, F. Pereira, A. Susin, Lapsi-360, in: 3DRMS Workshop
Cambridge University Press, 2003. Challenge, ECCV, 2018.
[39] J. Baxter, A model of inductive bias learning, J. Artificial Intelligence Res. 12 [61] C. Hazirbas, L. Ma, C. Domokos, D. Cremers, Fusenet: Incorporating depth into
(1) (2000) 149–198. semantic segmentation via fusion-based cnn architecture, in: Asian Conference
[40] G. Lin, A. Milan, C. Shen, I. Reid, Refinenet: Multi-path refinement networks for on Computer Vision, Springer, 2016, pp. 213–228.
high-resolution semantic segmentation, in: Proceedings of the IEEE Conference [62] A. Geiger, M. Roser, R. Urtasun, Efficient large-scale stereo matching, in: Asian
on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 1925–1934. Conference on Computer Vision, Springer, 2010, pp. 25–38.
[41] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected [63] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomed-
convolutional networks, in: Proceedings of the IEEE Conference on Computer ical image segmentation, in: Medical Image Computing and Computer-Assisted
Vision and Pattern Recognition, CVPR, 2017, pp. 4700–4708. Intervention, MICCAI, 2015.
[42] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, J. Mach. [64] V. Badrinarayanan, A. Kendall, R. Cipolla, SegNet: A deep convolutional
Learn. Res. (JMLR) W&CP 15 (2011) 315–323. encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal.
[43] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by Mach. Intell. 39 (12) (2017) 2481–2495, http://dx.doi.org/10.1109/TPAMI.
reducing internal covariate shift, JMLR W&CP 37 (2015). 2016.2644615.
[44] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature [65] K. Konolige, Small vision systems: Hardware and implementation, in: Y. Shirai, S.
pyramid networks for object detection, in: Proceedings of the IEEE Conference Hirose (Eds.), Robotics Research, Springer London, London, 1998, pp. 203–212.
on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 2117–2125. [66] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, J. Tang, Self-supervised
[45] M. Tan, Q. Le, EfficientNet: Rethinking model scaling for convolutional neural learning: Generative or contrastive, IEEE Trans. Knowl. Data Eng. 35 (1) (2023)
networks, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th 857–876, http://dx.doi.org/10.1109/TKDE.2021.3090866.
International Conference on Machine Learning, in: Proceedings of Machine [67] A.-J. Gallego, J. Calvo-Zaragoza, R.B. Fisher, Incremental unsupervised domain-
Learning Research, vol. 97, PMLR, 2019, pp. 6105–6114. adversarial training of neural networks, IEEE Trans. Neural Netw. Learn. Syst.
[46] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, 32 (11) (2021) 4864–4878, http://dx.doi.org/10.1109/TNNLS.2020.3025954.
R. Pang, V. Vasudevan, Q.V. Le, H. Adam, Searching for MobileNetV3, 2019,
arXiv:1905.02244.
[47] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hanz Cuevas-Velasquez earned his Master’s degree in Artificial Intelligence from the
2016, pp. 770–778. University of Edinburgh in 2017, followed by a Ph.D. in Computer Vision from the same
[48] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image institution in 2022. During his time at the University of Edinburgh, he was supervised
recognition, in: 32nd International Conference on Machine Learning, ICML, Vol. by Professor Bob Fisher. Subsequently, Hanz worked as a Researcher Intern at Microsoft
37, Lille, France, 2015. Mixed Reality Group in 2022, under the guidance of Tadas Baltrusaitis. Currently, he
[49] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: serves as a Research Engineer at the Max Planck Institute for Intelligent Systems, where
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, he is focused on digital human research, under the supervision of Michael Black.
CVPR, 2017, pp. 2881–2890.
[50] R. Atienza, Fast disparity estimation using dense networks, in: 2018 IEEE
Alejandro Galán-Cuenca holds a B.Sc. in Computer Science and an M.Sc. in Data
International Conference on Robotics and Automation, ICRA, IEEE, 2018, pp.
Science, both from the University of Alicante, Spain. He is currently a first year Ph.D.
3207–3212.
student in Computer Science at the same university. His research interests include
[51] M. Berman, A. Rannen Triki, M.B. Blaschko, The lovász-softmax loss: A tractable
unsupervised learning, few-shot learning, computer vision, data science, etc.
surrogate for the optimization of the intersection-over-union measure in neural
networks, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, CVPR, 2018, pp. 4413–4421. Prof. Robert B. Fisher FIAPR, FBMVA received a BS (Mathematics, California Institute
[52] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: of Technology, 1974), MS (Computer Science, Stanford, 1978) and a Ph.D. (Edinburgh,
Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186. 1987). Since then, Bob has been an academic at Edinburgh University, including being
[53] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep College Dean of Research. He has been the Education Committee chair and is currently
convolutional neural networks, in: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. the Industrial Liaison Committee chair for the Int. Association for Pattern Recognition.
Weinberger (Eds.), Advances in Neural Information Processing Systems, Vol. 25, His research covers topics mainly in high level computer vision and 3D video analysis,
Curran Associates, Inc., 2012. focusing on reconstructing geometric models from existing examples, which contributed
[54] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep to a spin-off company, Dimensional Imaging. The research has led to 5 authored books
neural networks? in: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. and 300 peer-reviewed scientific articles or book chapters (Google H-index: 59). He
Weinberger (Eds.), Advances in Neural Information Processing Systems, NIPS, has developed several online computer vision resources, with over 1 million hits. Most
2014, pp. 3320–3328. recently, he has been the coordinator of EC projects (1) acquiring and analyzing video
[55] D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International data of 1.4 billion fish from over about 20 camera-years of undersea video of tropical
Conference on Learning Representations, ICLR, 2015. coral reefs and (2) developing a gardening robot (hedge-trimming and rose pruning).
[56] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in He is a Fellow of the Int. Association for Pattern Recognition (2008) and the British
the details: Delving deep into convolutional nets, in: British Machine Vision Machine Vision Association (2010).
Conference, 2014, pp. 1–11, http://dx.doi.org/10.5244/C.28.6.
[57] S. Choi, J.T. Kim, J. Choo, Cars can’t fly up in the sky: Improving urban-
scene segmentation via height-driven attention networks, in: Proceedings of the Antonio Javier Gallego is an associate professor in the Department of Software and
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2020, Computing Systems at the University of Alicante, Spain. He received B.Sc. & M.Sc.
pp. 9373–9383. degrees in Computer Science from the University of Alicante in 2004, and a Ph.D.
[58] I. Kreso, S. Segvic, J. Krapac, Ladder-style densenets for semantic segmentation in Computer Science and Artificial Intelligence from the same university in 2012. He
of large natural images, in: Proceedings of the IEEE International Conference on has been principal investigator and collaborator in 18 research projects financed by
Computer Vision Workshops, 2017, pp. 238–245. the Spanish Government, the Generalitat Valenciana, and private companies. He has
[59] M. Carvalho, M. Ferrera, A. Boulch, J. Moras, B.L. Saux, P. Trouvé-Peloux, authored more than 80 works published in international journals, conferences and
Technical Report: Co-learning of geometry and semantics for online 3D mapping, books. His research interests include Deep Learning, Pattern Recognition, Computer
2019, arXiv preprint arXiv:1911.01082. Vision, and Remote Sensing.

18

You might also like