0% found this document useful (0 votes)
11 views

pepar 1

This paper evaluates various CNN models and data augmentation techniques for hierarchical localization of mobile robots using omnidirectional images. It presents a dual-step localization method: a rough localization to predict the room from which an image was captured, followed by fine localization to retrieve the most similar image from a visual map. The study assesses the performance of different CNN architectures and the impact of data augmentation under real operational conditions, with the findings contributing to improved robustness in localization tasks.

Uploaded by

aya.almallah.96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

pepar 1

This paper evaluates various CNN models and data augmentation techniques for hierarchical localization of mobile robots using omnidirectional images. It presents a dual-step localization method: a rough localization to predict the room from which an image was captured, followed by fine localization to retrieve the most similar image from a visual map. The study assesses the performance of different CNN architectures and the impact of data augmentation under real operational conditions, with the findings contributing to improved robustness in localization tasks.

Uploaded by

aya.almallah.96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Evolving Systems (2024) 15:1991–2003

https://doi.org/10.1007/s12530-024-09604-6

ORIGINAL PAPER

An evaluation of CNN models and data augmentation techniques


in hierarchical localization of mobile robots
Juan José Cabrera1 · Orlando José Céspedes1 · Sergio Cebollada1 · Oscar Reinoso1,2 · Luis Payá1

Received: 31 January 2024 / Accepted: 25 June 2024 / Published online: 8 July 2024
© The Author(s) 2024

Abstract
This work presents an evaluation of CNN models and data augmentation to carry out the hierarchical localization of a mobile
robot by using omnidirectional images. In this sense, an ablation study of different state-of-the-art CNN models used as
backbone is presented and a variety of data augmentation visual effects are proposed for addressing the visual localization
of the robot. The proposed method is based on the adaption and re-training of a CNN with a dual purpose: (1) to perform a
rough localization step in which the model is used to predict the room from which an image was captured, and (2) to address
the fine localization step, which consists in retrieving the most similar image of the visual map among those contained in the
previously predicted room by means of a pairwise comparison between descriptors obtained from an intermediate layer of
the CNN. In this sense, we evaluate the impact of different state-of-the-art CNN models such as ConvNeXt for addressing the
proposed localization. Finally, a variety of data augmentation visual effects are separately employed for training the model
and their impact is assessed. The performance of the resulting CNNs is evaluated under real operation conditions, including
changes in the lighting conditions. Our code is publicly available on the project website https://​github.​com/​juanjo-​cabre​ra/​
Indoo​rLoca​lizat​ionSi​ngleC​NN.​git.

Keywords Mobile robotics · Omnidirectional vision · Hierarchical localization · Deep learning · Data augmentation

1 Introduction have quickly transcended their initial boundaries, establish-


ing themselves as a versatile and powerful tool for tackling
In the ever-evolving landscape of Artificial Intelligence a wide range of challenges in a variety of domains (LeCun
(AI), Convolutional Neural Networks (CNNs) have become and Bengio 1995).
a fundamental pillar of the technology, with disruptive The increasing use of CNNs can be attributed to their
problem-solving capabilities. This kind of neural networks high ability to recognise patterns from different sources
were originally conceived for image recognition tasks, but of information. This ability has made them essential in a
wide variety of applications, from image recognition (Kriz-
* Juan José Cabrera hevsky et al. 2012; Simonyan and Zisserman 2014) and
[email protected] object detection (Redmon et al. 2016; Ren et al. 2015) to
Orlando José Céspedes semantic segmentation (Ronneberger et al. 2015) and even
[email protected] natural language processing (Kim 2014). The success of
Sergio Cebollada such architectures is based on their ability to extract features
[email protected] from data, which allows solving high-level problems such
Oscar Reinoso as visual localization.
[email protected] In this sense, some researchers have addressed visual
Luis Payá localization by means of 360° vision sensors due to its
[email protected] relatively low cost and the wide range of information they
provide. When capturing images in real-world scenarios,
1
Institute for Engineering Research (I3E), Miguel Hernandez especially in robotics applications, the environmental con-
University, Elche, Spain
ditions can vary significantly. Consequently, addressing the
2
Valencian Graduate School and Research Network visual localization could be particularly challenging due to
for Artificial Intelligence (valgrAI), Valencia, Spain

Vol.:(0123456789)
1992 Evolving Systems (2024) 15:1991–2003

different phenomena such as changes in illumination con- perform the rough localization. However, our present pro-
ditions. For this reason, understanding and addressing the posal addresses both rough and fine localization steps and
effects of illumination changes are crucial for developing studies more exhaustively different state-of-the-art models
robust CNN models. such as AlexNet (Krizhevsky et al. 2012), ResNet-152 (He
Related with the above information, the main objective et al. 2016), ResNeXt-101 64x4d (Xie et al. 2017), Mobile-
of this work is to analyze the influence of different visual NetV3 (Howard et al. 2019), EfficientNetV2 (Tan and Le
effects applied to the training data in order to carry out the 2021) and ConvNeXt Large (Liu et al. 2022). Also, an abla-
mapping and localization of a mobile robot, which moves in tion study of a variety of data augmentation visual effects are
an indoor environment under real operation conditions. For carried out with the aim of analysing the performance of the
this purpose, the omnidirectional images captured by a cata- proposed tools under real operation conditions.
dioptric vision sensor are used to train a CNN. Both the raw The following sections are structured as follows. First, in
images, and some sets of images obtained after introducing Sect. 2 we present a review of the state of the art on visual
visual effects to the original images in a data augmentation place-recognition and localization by means of artificial
process are considered during the training. In this paper, we intelligence techniques. Second, in Sect. 3 we describe the
have also evaluated the performance of state-of-the-art CNN proposed hierarchical localization method, the different
models when addressing localization through a hierarchical CNN architectures which are evaluated and the proposed
approach. In this sense, the CNN will be adapted and re- data augmentation visual effects. After that, we present in
trained with a dual purpose: (1) to perform a rough locali- Sect. 4 the dataset used and the experiments carried out to
zation step in which the model is used to predict the room test and validate the proposed method. Finally, conclusions
from which a test image was captured, and (2) to address the and future works are outlined in Sect. 5.
fine localization step, which consists in retrieving the most
similar image of the visual map among those contained in
the previously predicted room by means of a pairwise com- 2 State of the art
parison between descriptors obtained from an intermediate
layer of the CNN. The main contributions of this paper can Artificial intelligence (AI) techniques are commonly pro-
be summarized as follows. posed to address computer vision and robotics problems.
Recent works, such as Aguilar et al. (2017), propose a pedes-
• A CNN is adapted and re-trained to predict the room trian detector for Unmanned Aerial Vehicles (UAVs) based
from which the robot captured an omnidirectional image on Haar-LBP features combined with Adaboost and cascade
which is transformed into panoramic. This approach classifiers with Meanshift. Another example is Wang et al.
enhances robotic localization by initially performing (2018), which utilizes an autoencoder for the fusion and
room recognition. extraction of multiple visual features from different sensors
• We use the re-trained CNN to embed panoramic images with the aim of carrying out motion planning based on deep
into holistic descriptors by extracting the activation of an reinforcement learning.
intermediate layer. These descriptors are compared to the CNNs have proven to be successful in many practical
visual model of the retrieved room via nearest neighbour applications. Well-known architectures, such as GoogLeNet
search, providing an efficient method for scene recogni- (Szegedy et al. 2015), AlexNet (Krizhevsky et al. 2012) and
tion and position retrieval. VGG16 (Simonyan and Zisserman 2014) have been used
• We conduct a thorough study of the individual influence as starting points to address new computer vision tasks.
of different data augmentation visual effects when train- Regarding place-recognition, CNN models were firstly pro-
ing a model to perform hierarchical localization. This posed to address this problem in Chen et al. (2014), where
analysis contributes to improve the robustness of the a pre-trained model called Overfeat (Sermanet et al. 2013)
model and its generalization ability in localization tasks. is used to extract features from images. Sünderhauf et al.
• We evaluate the performance of different state-of-the- (2015) provided a thorough investigation on the perfor-
art CNN models that are used as the backbone for the mance of extracted features for place recognition. In fact,
proposed localization task. This comparative evaluation they found out that the features extracted from convolutional
provides valuable insights for selecting the most suit- layers were more robust against different lighting conditions
able CNN architecture for real-world localization appli- than those extracted from fully connected layers which out-
cations. performed towards viewpoint changes. Bai et al. (2018) pro-
pose the SeqCNNSLAM method, which consists in using
This work is an extension of the initial developments pre- the pre-trained AlexNet (Krizhevsky et al. 2012) to extract
sented in Céspedes et al. (2023). In this previous work, features and feed the SeqSLAM algorithm (Milford and
we used a basic CNN model (Places, Zhou et al. 2014) to Wyeth 2012). Also Naseer et al. (2015) proposed a similar
Evolving Systems (2024) 15:1991–2003 1993

approach, but using GoogleNet (Szegedy et al. 2015). Some In light of the above information, the aim of this work
of the works have not only used images as source of infor- is to analyze the influence of some visual effects to carry
mation, but also point clouds (Uy and Lee 2018) and both out data augmentation for CNN training to address a hier-
combined (Komorowski et al. 2021). archical localization (Cebollada et al. 2022). Hence, the
In the context of robot localization, Kopitkov and Indel- efficiency of each visual effect will be assessed through
man (2018) propose using CNN holistic descriptors to esti- the ability of the CNN model to robustly estimate the posi-
mate the robot position by learning a generative viewpoint- tion where the image was captured. In addition, this work
dependent model of CNN features with a spatially-varying focuses on evaluating the performance of different well-
Gaussian distribution. Sarlin et al. (2019) carry out a hierar- known CNN models for both the coarse and fine localiza-
chical modeling using a CNN, which extracts local features tion steps. The first one consists in estimating the room
and holistic descriptors for 6-DOF localization. In that paper, where the image was taken by means of a classification
a coarse localization is solved by using global descriptors, final layer. The second one is addressed by extracting a
while a fine localization is solved by matching local features. global descriptor from an intermediate layer of the CNN
Recent works (Cebollada et al. 2022) have proposed hierar- and used to retrieve the most similar image that conforms
chical visual models for efficient localization. This method the visual map. To address the proposed evaluation, the
involves arranging visual information hierarchically in dif- unique source of information is the set of images obtained
ferent layers so that localization can be solved in two main by an omnidirectional vision sensor installed on the mobile
steps. The first step involves coarse localization to roughly robot, which moves in an indoor environment under real
determine the area where the robot is located, and the second operation conditions.
step involves fine localization within this pre-selected area.
Regarding the training of CNNs, a large and varied data-
set is essential. Since a lack of a large enough datasets is
quite common, Data Augmentation (DA) can be used to 3 Methodology
increase the training instances to avoid overfitting. As for
the DA for a mobile robot localization task, it is essential 3.1 Hierarchical localization approach
to apply visual effects that may occur in real operation
conditions to make the model robust against those effects. This study aims to tackle visual localization through a hier-
Considering as many effects as possible would increase archical methodology by means of deep learning. The pro-
the effectiveness of the CNN, but this would imply more posed approach (Fig. 1) consists of two main steps: an initial
processing power and memory. Numerous researchers have stage for rough localization, which consist in identifying the
leveraged the data augmentation technique as a valuable room from which the test image has been captured, and a
tool to enhance the efficacy of their models. For example, subsequent phase for fine localization where the position
Ding et al. (2016) train a CNN with three distinct types of of the robot is obtained by a pairwise comparison between
data augmentation operations. Their investigation aims to the test image and the visual model that conforms the pre-
enhance the performance of Synthetic Aperture Radar tar- selected room.
get recognition by achieving invariance against pose varia- The initial step of rough localization is performed using
tions. Similarly, Salamon and Bello (2017) present a CNN the output of a CNN. The output layer of that CNN is com-
designed for environmental sound classification, accompa- posed by R neurons, each one corresponding to a room (R is
nied by an audio data augmentation strategy. This augmen- the number of rooms or relevant areas in the target environ-
tation approach is useful to mitigate the scarcity of data in ment). Then, a SoftMax activation function is applied and
this domain, contributing to improved model performance. the room prediction is obtained. However, before training the
Furthermore, Perez and Wang (2017) present a study about CNN, a dataset of labelled images captured along the target
the effectiveness of data augmentation to solve the classi- environment is needed. In this case, each image is labelled
fication task. Shorten and Khoshgoftaar (2019) present a with the corresponding room information. The CNN is then
survey about the existing methods for data augmentation trained to address the room retrieval task. Once the CNN is
and related developments. Nonetheless, the previously pro- appropriately trained for the room classification task, the
posed data augmentation methods do not exactly analyze coarse localization step is performed: a test image imtest is
the visual phenomena that can occur when the mobile robot fed into the CNN and the output indicates the room ci in
moves through the target environment under real-operation which the image was captured.
conditions. Therefore, the present work performs a data aug- Simultaneously, a holistic descriptor is extracted by
mentation analysis that focuses on a wide range of those flattening the activation map from the last convolutional
specific visual effects. layer. This descriptor dtest is compared with the descrip-
tors Dci = {dci ,1 , dci ,2 , … , dci ,Ni } from the visual map of
1994 Evolving Systems (2024) 15:1991–2003

Fig. 1  Diagram of the proposed


hierarchical localization. The
test image imtest is the input of
the CNN, which predicts the
most likely room ci and embeds
the image into a global descrip-
tor dtest by flattening the last
activation map. This descriptor
is compared with the descrip-
tors from the training dataset
included in the retrieved room
by means of a nearest neigh-
bour search. Consequently, the
capture point of the image that
corresponds to the most similar
descriptor (imci ,k ) is considered
an estimation of the position
where imtest was captured

the predicted room ci , where Ni is the number of images in xest = xci ,k , yest = yci ,k (4)
the room ci . Note that the visual map descriptors are also
obtained by flattening the last activation map of the same
CNN. Then, the distance between the test descriptor dtest and 3.2 CNN selection and adaption
each descriptor dci ,j 𝜖 Dci in the room ci is calculated (Eq. 1).
Designing a Convolutional Neural Network to a address
qtj = dist(dtest , dci ,j ), j = 1, … , Ni (1) a specific task supposes a big challenge. In the present
work, the CNN must be able to predict the room in which
where Ni is the number of descriptors in room ci and dist is
an image was captured and embed the input image into a
the Euclidean distance (Eq. 2)
global descriptor to retrieve the exact position within the

√m
√∑ predicted room. Crafting a CNN from scratch demands both
dist(dtest , dci ,j ) = √ (dtest,i − dci ,j,i )2 (2) a profound understanding of the specificities involved and
i=1 access to a sufficiently varied dataset for effective training.
Furthermore, as previously demonstrated in Ballesta et al.
where dtest = (dtest,1 , dtest,2 , … , dtest,m ) and (2021), in general terms, re-training networks that have been
dci ,j = (dci ,j,1 , dci ,j,2 , … , dci ,j,m ) are the descriptors of size m, designed for a different objective yields more precise and
and dtest,i and dci ,j,i are the i-th components of the vectors dtest reliable outcomes in the new task than training from scratch.
and dci ,j , respectively. In light of this information, this research work incorpo-
After that, a set qt = {qt1 , … , qtNi } is constructed with the rates several widely recognised and tested CNN models,
calculated distances. The index k which minimizes the distance each of which serves as the backbone for our hierarchi-
in the set qt is found in Eq. 3. Subsequently, the estimated cal localization task. These models cover a diverse range,
position (xest , yest ) corresponds to the position (xci ,k , yci ,k ) from addressing different architectural complexities and capa-
which the image imci ,k of the visual map (i.e, the image whose bilities. All of the architectures employed were originally
descriptor is the retrieved one dci ,k ) was captured (Eq. 4). This designed for visual object recognition. In this work, the
hierarchical approach ensures both a broad understanding of CNN is first used to address the room retrieval problem,
the scene and precise localization within the identified room, which is a similar task:
contributing to an effective visual localization strategy. Fig-
ure 1 outlines the whole localization process. • AlexNet (Krizhevsky et al. 2012): AlexNet is a pio-
neering CNN architecture known for its success in the
k = arg min(qt ) (3)
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) 2012. Comprising multiple convolutional and
Evolving Systems (2024) 15:1991–2003 1995

fully connected layers, AlexNet laid the foundation for work, as described in Sect. 4.1). As for the fine localization,
subsequent CNN designs. This network and the following the global descriptor has been extracted by flattening the
ones were trained to classify the 1.2 million high-resolu- output of the Average Pooling Layer of each CNN model.
tion images into 1000 different classes. The weights and Finally, Table 1 shows a summary with the evaluated models
biases obtained by training with this database have been and its corresponding number of Floating Point Operations
taken as starting point for our own task. (FLOPs) and the number of parameters.
• ResNet-152 (He et al. 2016): ResNet, or Residual Net-
work, introduced the concept of residual learning. This
approach is based on skip connections and allows the 3.3 Data augmentation
CNN to learn an identity function. ResNet-152 is a spe-
cific variant featuring 152 layers, enabling the model Training a model involves setting up its parameters to per-
to effectively capture intricate hierarchical features. form a specific task. When a model has many parameters, it
Although it is computationally costly due to its depth, requires a sufficiently large number of examples for effec-
its accuracy and robustness compensate this cost. tive training. However, in practice, the training dataset is
• ResNeXt-101 64x4d (Xie et al. 2017): ResNeXt is an often limited. In such cases, data augmentation is a useful
extension of the ResNet architecture, emphasizing a car- solution as it is able to generate new instances by applying
dinality parameter to enhance model capacity. The car- various visual effects. This not only helps the model avoid
dinality is just the number of parallel blocks, that allows overfitting but also makes it more robust against challenging
to learn various input representations. In this sense, real-operation dynamic conditions.
ResNeXt-101 64x4d has a cardinality of 64. By increas- In previous studies focused on training models for visual
ing the cardinality, the network can capture a greater localization, various effects like changes in orientation,
diversity of features, enhancing its potential ability to reflections, alterations in illumination, noise, and occlusions
image recognition. were applied (Cabrera et al. 2022). The use of data aug-
• MobileNetV3 (Howard et al. 2019): MobileNetV3 is mentation has shown to improve model performance. These
designed for efficient mobile and edge computing appli- effects are applied individually or together to each image in
cations. It uses depth-wise separable convolutions to the original dataset, and all the generated images are com-
build light weight deep neural networks. This fact makes bined into a new augmented training dataset. However, the
them specially suitable for scenarios with resource con- specific impact of each type of effect on the resulting CNN’s
straints, such as performing the localization in real time performance is not well understood. This study aims to apply
by the robot’s on-board computer. different data augmentation effects individually to evaluate
• EfficientNetV2 (Tan and Le 2021): EfficientNetV2 is their influence on the resulting CNN.
based on the EfficientNet architecture, and uses a tech- The focus of this work is on two categories of visual
nique called compound coefficient to scale up models in effects: changes in illumination conditions and changes in
a simple but effective manner. It prioritizes model effi- orientation. For changes in illumination conditions, the fol-
ciency, achieving remarkable accuracy with fewer param- lowing effects are considered:
eters compared to traditional CNNs. This makes Effi-
cientNetV2 an attractive choice for applications requiring • Spotlights and shadows: Circular light sources, like
high accuracy with limited computational resources. bulbs, are common indoors. The proposed approach
• ConvNeXt Large (Liu et al. 2022): ConvNeXt Large rep- involves increasing pixel values to simulate higher light
resents a recent advancement in CNN architectures. It intensity (spotlights) and decreasing pixel values to sim-
leverages a combination of depth-wise separable convo-
lutions, an inverted bottleneck and spatial factorization Table 1  FLOPs and parameters of the evaluated and adapted models
(“patchify”), contributing to improved efficiency and when the size of the input image is 512 × 128 × 3 pixels
effectiveness in capturing features. Thus, outperforming
Backbone model FLOPs (G) Number of
the previous models in terms of accuracy. parameters
(M)
By evaluating these diverse CNN models, we aim to com-
AlexNet 0.9 57.0
prehensively understand their strengths and weaknesses
ResNet-152 15.2 58.2
in the context of scene recognition and localization task.
ResNeXt-101 64X4d 20.4 81.4
Regarding the room recognition, the final layer of all the
MobileNetV3 0.3 4.2
architectures needs to be adapted for classifying the images
EfficientNetV2 16.2 117.2
into N categories corresponding to N possible rooms in the
ConvNeXt Large 44.9 196.2
target environment ( N = 9 in the dataset used in the present
1996 Evolving Systems (2024) 15:1991–2003

ulate shadows (shadow spots). Spotlights and shadow tion channel can be directly modified by multiplying it by
spots are applied separately for different data augmenta- a constant factor s. If the saturation is multiplied by s > 1,
tion options. In our experiments, these bulbs are created the colors become more saturated, whereas if multiplied
with diameters ranging from 15 to 40 pixels. Five kinds by s < 1, the saturation decreases.
of intensities variations are applied. In the first type the
intensity is degraded ± 160 and in the fifth ± 100. Regarding changes in orientation, these can occur during
• General brightness and darkness: Low intensity values image capture when the robot captures images from the
of the original images are increased to create brighter same position but with a different orientation. For this data
images, simulating higher overall illumination (e.g., augmentation option, new images are generated for each
a sunny day). Conversely, high intensity values are original image by applying rotations of n degrees, where
decreased to create darker images, simulating lower light n = i × 10◦ , i ∈ [1, 35]. Thus, for each original image in the
supply (e.g., capturing images at night). Brightness and training set, 35 additional images are generated.
darkness are applied separately but used for the same Figure 2 shows an example of the effects applied to a
data augmentation. sample omnidirectional image converted to panoramic for-
• Contrast: Image contrast plays a vital role in distinguish- mat. The first image corresponds to the original one and the
ing objects in a scene. Images with low contrast tend rest of images include the different effects presented above
to have a smoother appearance with fewer shadows and (they have been separately applied).
reflections. The contrast is modified following Eq. 5
Is = 64 + c ∗ (I − 64) (5) 4 Results
where Is is the resulting image, I the original image and
c is the contrast factor. For c > 1 the contrast increases 4.1 COLD Freiburg database
and c < 1 decreases the contrast.
• Saturation: Color saturation, indicating the color intensity The current study utilizes images sourced from the Freiburg
given by pixels, is considered. Lower saturation results dataset, a subset of the COsy Localization Database (COLD)
in less colorful images, potentially resembling grayscale (Pronobis and Caputo 2009). This dataset contains omnidi-
images for very low saturation. This phenomenon may rectional images captured by a robot which follows various
occur in real environments and is incorporated into data paths within a building at Freiburg University. The robot
augmentation. The color saturation can be adjusted by explores diverse spaces such as kitchens, corridors, printer
first converting the RGB image to HSV. Then, the satura- areas, bathrooms, personal offices, and more. Image cap-
ture occurs under realistic operational conditions, including

Fig. 2  Example of data aug-


mentation where only one effect
is applied over each image.
a Original image, b spotlight
effect, c shadow effect d general
(a) (b)
brightness, e general darkness, f
contrast, g saturation and h rota-
tion. The images contained in
this dataset can be downloaded
from the web site https://​www.​
cas.​kth.​se/​COLD/
(c) (d)

(e) (f)

(g) (h)
Evolving Systems (2024) 15:1991–2003 1997

changes in furniture arrangement, the dynamic presence of Training Dataset comprising 556 images. This dataset
individuals in scenes, and fluctuations in illumination condi- serves the dual purpose of training the CNNs and provid-
tions, including cloudy days, sunny days, and nights. ing a visual map. In addition, a Validation Dataset is used
To assess the impact of these variations on the localiza- during training and keeps the same proportion of images
tion task, we propose incorporating images taken exclusively as the Baseline Training set. The Validation Dataset is
on cloudy days as part of the training data. Additionally, a also sampled at 20-cm intervals, but in this case in an
separate dataset comprising cloudy images (distinct from interleaved manner with respect to the Baseline Training
the aforementioned one) is employed as test set to evalu- Dataset in such a way that the images in the baseline and
ate localization performance without illumination changes. validation datasets are different. In this regard, the valida-
Furthermore, to appraise localization under varying illumi- tion covers uniformly the whole environment, which is
nation conditions, datasets captured on sunny days and at expected to be a robust approach for validation, consid-
night are utilized as test sets. Beyond the images, the dataset ering that the retrained CNN must be able to solve the
offers ground truth data (obtained via a laser sensor), which localization problem considering the whole environment.
is exclusively employed in this study to quantify localization Furthermore, the Baseline Training Dataset undergoes
errors. The ground truth over the path of the robot has been a data augmentation as described in Sect. 3.3, resulting
generated using the laser sensor in a grid-based SLAM tech- in six additional training datasets. These datasets will be
nique, in particular, the one described in Grisetti et al. (2005, individually employed to train the CNNs, allowing an
2007). This solution, based on these two papers, can have an exploration of the impact of each visual effect on network
error up to 5 cm or 10 cm depending on the grid resolution. performance. Table 2 shows a summary with the number
Concerning the image capture process, the robot acquires of images per room of each training and validation dataset.
images while it moves, introducing potential blur effects or In terms of the test data, various datasets are consid-
dynamic alterations. Moreover, the chosen environment has ered: Cloudy Test Dataset, comprising images captured
the longest trajectory within the available database and is in cloudy conditions along a route distinct from training
characterized by extensive windows and glass walls, mak- and validation sets (2595 images); Sunny Test Dataset,
ing visual localization a particularly challenging problem. including all images captured in sunny conditions (2114
Consequently, this environment provides ideal conditions images); and Night Test Dataset, containing all images
for evaluating the proposed localization methods under real captured at night (2707 images). Table 3 shows a sum-
operation conditions and real scenarios. mary with the number of images per room of each test
The selected dataset contains images from nine distinct set. Consequently, network training and validation, in all
rooms: a kitchen, a bathroom, a printer area, a stairwell, a instances, employs images captured exclusively in cloudy
long corridor and four offices. The cloudy dataset is down- conditions, while testing occurs under three distinct light-
sampled to achieve an average distance of 20 cm between ing conditions: cloudy, sunny, or night. This methodology
consecutive image capture points, resulting in the Baseline enables the assessment of the network’s robustness against
variations in lighting conditions.

Table 2  Number of images in Training dataset 1P0-A 2P01-A 2P02-A CR-A KT-A LO-A PA-A ST-A TL-A
each training dataset (number of
images per room) Baseline 44 46 31 238 46 26 57 30 38
Validation 43 47 32 236 46 26 57 31 38
Augmented 1 264 276 186 1428 276 156 342 180 228
Augmented 2 264 276 186 1428 276 156 342 180 228
Augmented 3 308 322 217 1666 322 182 399 210 266
Augmented 4 264 276 186 1428 276 156 342 180 228
Augmented 5 264 276 186 1428 276 156 342 180 228
Augmented 6 1364 1426 961 7378 1426 806 1767 930 1178

Table 3  Number of images in Test dataset 1P0-A 2P01-A 2P02-A CR-A KT-A LO-A PA-A ST-A TL-A
each test dataset (number of
images per room) Cloudy 155 230 135 1040 254 177 222 133 249
Night 168 215 168 1114 270 121 241 198 212
Sunny 123 187 109 793 213 102 191 180 216
1998 Evolving Systems (2024) 15:1991–2003

4.2 Implementation details 4.3.1 Coarse localization: room retrieval

In this work, the CNNs are trained to address the coarse This section presents the results derived from the use of
localization or room retrieval stage. As this is a classifica- different CNNs for the execution of the coarse localization
tion task, these networks have been retrained employing a or room retrieval stage. As described in Sect. 3.2, the CNN
Cross Entropy loss function (Eq. 6). models evaluated in this article are AlexNet (Krizhevsky
et al. 2012), ResNet-152 (He et al. 2016), ResNeXt-101
B R
1 ∑∑ 64x4d (Xie et al. 2017), MobileNetV3 (Howard et al. 2019),
L(y, ŷ ) = − y log(̂yij ) (6)
B i=1 j=1 ij EfficientNetV2 (Tan and Le 2021) and ConvNeXt Large (Liu
et al. 2022). The reason why we have selected these models
where y is the matrix of actual labels and ŷ is the matrix of is to cover a wide range of architectures proposed for image
model predictions, both matrices have size B × R, in which classification in the last 10 years.
B is the number of samples (batch size) and R is the number The results in Table 5 showcase the performance of six
of classes (rooms), yij is 1 if sample i belongs to class j and different models used as backbone in the context of room
0 otherwise, and ŷ ij is the probability predicted by the model retrieval across varied environmental conditions. In fact,
that sample i belongs to class j. each model was subjected to evaluation under cloudy, night,
In addition, Stochastic Gradient Descent (SGD) with and sunny conditions, providing a comprehensive under-
Momentum 0.9 and Learning Rate of 0.001 has been used standing of their robustness and adaptability to changes in
as optimization algorithm. Furthermore, the training batch environment illumination.
size (B) was 16 and the total number of epochs was 30. AlexNet exhibits an excellent overall performance, par-
For every architecture, the network that presents the best ticularly in Cloudy conditions with an accuracy of 97.61%.
validation accuracy for room retrieval during the training In contrast, ResNet demonstrates robust performance but
is preserved for testing. Table 4 summarizes all the values slightly lower accuracy compared to AlexNet. Notably, its
of the parameters that have been described above. accuracy decreases in sunny conditions which is the most
All experiments are carried out with a NVIDIA demanding illumination environment. The ResNext model
GeForce RTX 3090 GPU with 24 GB. Our code is pub- excels in cloudy conditions. However, it shows a compara-
licly available on the project website https://​github.​com/​ tively lower accuracy in night scenarios. On the one hand,
juanjo-​cabre​ra/​Indoo​rLoca​lizat​ionSi​ngleC​NN.​git. MobileNet stands out for its consistency, achieving high
accuracy across all conditions. Its notable performance in
sunny conditions, with an accuracy of 77.29%, highlights
4.3 CNN backbone ablation study its generalisation capability. On the other hand, EfficientNet
emerges as a top-performing model, outperforming others
In this section, we asses an experimental evaluation of in terms of accuracy in cloudy and night scenarios, which
the different CNN models used as backbone presented in are the most similar to training conditions. Finally, the most
Sect. 3.2 for both rough and fine localization. As previ- striking result comes from ConvNext, which consistently
ously stated, the hierarchical localization proposed in this achieves the highest accuracy in all scenarios, making it
study comprises two distinct steps. The initial stage, rough the top-performing model. Particularly noteworthy is its
localization step, involves retraining a model to execute
the room retrieval task. Subsequently, the fine localiza-
tion step utilizes the previously trained CNN to generate
holistic descriptors, employing a nearest neighbor search Table 5  Room retrieval ablation study for different top-level classi-
method to estimate the precise position where an image fication architectures tested under three different illumination condi-
tions: cloudy, night, sunny and all together
was captured.
Backbone model Room retrieval accuracy (%)
Cloudy Night Sunny Global

Table 4  Training parameters for AlexNet 97.61 97.60 70.67 89.93


Parameter Value
room retrieval ResNet-152 96.76 96.64 64.95 87.63
Batch size (B) 16 ResNeXt-101 64X4d 98.11 95.16 72.47 89.71
Number of epochs 30 MobileNetV3 98.50 96.93 77.29 91.88
Learning rate 1 × 10−3 EfficientNetV2 98.81 97.16 75.73 91.63
Momentum 0.9 ConvNeXt Large 98.77 97.64 86.28 94.80
Number of rooms (R) 9
Bold values represent the best accuracy for every lighting condition
Evolving Systems (2024) 15:1991–2003 1999

exceptional accuracy of 86.28% in sunny conditions, indi- Table 6  Computation time required to execute the whole hierarchical
cating its robustness and generalization capabilities. localization process for all the evaluated models
Backbone model Mean time (ms)
4.3.2 Fine localization
AlexNet 3.4
ResNet-152 6.9
Once the CNN model is trained for the room retrieval step, it
ResNeXt-101 64X4d 9.5
can be used to embed the input image into a global descripor.
MobileNetV3 4.6
This facilitates the resolution of the fine localization step
EfficientNetV2 10.7
through an image retrieval process, in which the descriptor
ConvNeXt Large 12.5
of the test image is compared with the descriptors of the
visual map of the previously retrieved room. As in previous
subsection, we are going to evaluate the performance of dif-
ferent CNN backbones to address the fine localization step. demonstrated a consistent localization error and low dis-
Fig. 3 shows the hierarchical localization error for differ- persion for cloudy and night conditions. However, its
ent backbone models (AlexNet, ResNet-152, ResNeXt-101, performance degraded in sunny conditions. ResNet-152
MobileNetV3, EfficientNetV2 and ConvNeXt Large) under displayed higher errors across all conditions compared to
various lighting conditions (cloudy, night, sunny) and con- AlexNet, with a notable increase of both the mean absolute
sidering jointly the three conditions (global). The errors are error and dispersion in sunny conditions. ResNeXt-101
measured in meters and are represented by box plots with demonstrated a better performance than ResNet-152
whiskers, indicating the distribution of the errors. Further- for cloudy and sunny conditions, but the error slightly
more, the Mean Absolute Error (Eq. 7) is represented by the increases for night scenarios. MobileNet consistently
black dot and the text displaying the error value. In addition, maintained low errors across all conditions, signifying its
Table 6 shows the computation time required to execute the adaptability to diverse lighting environments. EfficientNet
whole hierarchical localization process for all the evaluated showcased a worse performance than MobileNet in each
models. scenario. Finally, ConvNeXt emerged as the top-perform-
N
ing model, consistently outperforming others with the low-
1 ∑| est errors across all conditions. Its remarkable accuracy
MAE = (x , y ) − (̂xi , ŷ i )|| (7)
N i=1 | i i in sunny conditions implies a robust capability to handle
scenarios with substantial changes of the lighting condi-
where (xi , yi ) is the actual position, (̂xi , ŷ i ) is the position of tions. In terms of computation time, Table 6 illustrates that
the visual map retrieved after the complete localization pro- the hierarchical localization process with the shortest aver-
cess, and N is the number of images in the test dataset. age computation time occurs when employing AlexNet,
Each backbone model exhibited similar characteristics which requires only 3.4 ms. In contrast, the hierarchical
in hierarchical localization comparing to room retrieval, localization process employing ConvNeXt Large requires
since both tasks are correlated. As Fig. 3 shows, AlexNet the longest computation time, with a mean of 12.5 ms.
However, despite the need for more time to estimate the

Fig. 3  Hierarchical localization


errors in meters for different
CNN architectures. The box
plots represent the distribu-
tion of errors, with whiskers
indicating variability. The Mean
Absolute Error for each model
and condition is marked by a
black dot and annotated with the
specific error value. Results are
obtained under different lighting
conditions: cloudy (red), night
(orange), sunny (yellow) and
considering jointly the three
conditions (green)
2000 Evolving Systems (2024) 15:1991–2003

position, this time is sufficiently short to enable real-time 1, 2, 3, 4, 5 and 6. As in previous experiments, the Baseline
localization. Training Dataset serves as a visual map and the Validation
Dataset is employed to validate the performance of the CNN.
4.4 Data augmentation ablation study Furthermore, for the model evaluation, three different test
datasets are considered: the Cloudy Test Dataset, the Night
In this comprehensive experiment, the investigation is Test Dataset and the Sunny Test Dataset.
extended to evaluate the influence of both data augmentation
effects (illumination and orientation changes) on the perfor- 4.4.1 Coarse localization: room retrieval
mance of the CNN. Due to the existence of a high probabil-
ity of variations in robot orientation during operation under In this subsection we use the best CNN architecture obtained
real operation conditions with respect to the images captured in Sect. 4.3.1, which is ConvNeXt Large. In a similar
in the visual map, a model should demonstrate robustness approach, we have departed from the pre-trained weights
to orientation changes. To this end, a data augmentation for ImageNet Large Scale Visual Recognition Challenge and
technique is employed that consists in applying 35 differ- re-trained the model for the different datasets obtained by
ent orientation changes to each training image as described the proposed data augmentation.
in Sect. 3.3. This augmentation is essential to improve the Table 7 presents the room retrieval accuracy when the
adaptability of the model to the various orientations encoun- model has been trained with each of the augmented training
tered in practice. datasets previously described. The performance of the CNN
Simultaneously, the illumination effects that occur under is evaluated under the three different lighting conditions:
real operating conditions, a critical aspect for robust visual cloudy, night, sunny and all together.
perception, have been explored in detail. Five specific light- Training with the baseline dataset shows a remarkable
ing effects are considered (Sect. 3.3): spotlights, shadow accuracy, especially in cloudy and night conditions. How-
spots, general brightness/darkness, contrast, and satura- ever, a significant decrease is observed in sunny conditions,
tion. Each effect is systematically applied individually on which differ more from the training set. This evaluation pro-
the training data set, leading to the creation of distinct aug- vides a reference to analyse the impact of the different effects
mented training datasets. Using the different effects sepa- that have been applied to the training data.
rately allows a detailed understanding of their individual The spotlight augmentation (Augmentation 1) shows
contributions, which sheds light on the importance of each insignificant improvements or even small decreases under
effect in performance. night and sunny conditions. In contrast, data augmentation
In particular, for each image, the experiment incorporates with shadows (Augmentation 2) produces slight improve-
a detailed approach by applying different levels of spotlights, ments, especially in sunny conditions.
contrast and saturation (five levels for each), ensuring a thor- Alterations to the overall brightness and darkness of the
ough assessment of the impact of these factors on the abil- image (Augmentation 3) are effective and show substantial
ity of the CNN to adapt to various lighting conditions. In improvements, especially in sunny conditions. In addition,
addition, the effect of brightness is meticulously explored, contrast-based effects (Augmentation 4) are very effective,
with three levels of brightness and three levels of darkness with substantial improvements in all lighting conditions and
applied to each image. This dual investigation of orientation
changes and illumination effects is intended to provide a
comprehensive understanding of the robustness of the CNN
to cope with real-world challenges, encompassing variations Table 7  Room retrieval accuracy for ConvNeXt Large architecture
with different augmented training datasets
in both spatial orientation and illumination conditions. As
a result of applying these effects, six additional training Training dataset Room retrieval accuracy (%)
datasets have been obtained: Augmented Training Dataset Cloudy Night Sunny Global
1 (spotlights), Augmented Training Dataset 2 (shadows),
Augmented Training Dataset 3 (general brightness/dark- Baseline 98.77 97.64 86.28 94.80
ness), Augmented Training Dataset 4 (contrast), Augmented Augmented 1 (spotlights) 98.84 97.45 86.14 94.71
Training Dataset 5 (saturation) and Augmented Training Augmented 2 (shadows) 98.96 97.56 86.52 94.90
Dataset 6 (rotations). Augmented Training Datasets 1, 2, Augmented 3 (brightness/dark- 98.81 97.41 91.11 96.10
ness)
4 and 5 consist of 3336 images each, whereas Augmented
Augmented 4 (contrast) 99.08 97.27 93.57 96.84
Training Datasets 3 and 6 includes 3892 and 17,236 images
Augmented 5 (saturation) 98.88 97.60 83.07 93.91
respectively.
Augmented 6 (rotations) 99.15 97.52 91.39 96.34
In conclusion, in this ablation study the model is retrained
using separately each of the Augmented Training Datasets Bold values represent the best accuracy for every lighting condition
Evolving Systems (2024) 15:1991–2003 2001

especially in sunny circumstances, thus achieving improved error is around 0.27 ms. In this case, the minimum error is
results in this challenging environment. obtained by training the network without data augmentation.
Surprisingly, augmentation with changes in saturation In contrast, under sunny lighting conditions the mean
(Augmented 5) shows a negative impact on accuracy, espe- localization error has a higher variability, similarly to the
cially in sunny conditions. Finally, augmenting the data set coarse localization (Table 7). This demonstrates the corre-
with rotations (Augmented 6) shows substantial improve- lation between the two tasks. Under this condition, the best
ments, especially in cloudy conditions. fine localization result is obtained by training the model with
the contrast effect (DA 4) and the worst with saturation (DA
4.4.2 Fine localization 5).

Once the ConvNeXt Large model is trained for the room 4.4.3 General comparison with other methods
retrieval step, it can be used to embed the input image into
a global descriptor. This facilitates the resolution of the fine Finally, the proposed method is compared with other pre-
localization step through an image retrieval process, wherein vious global appearance techniques, including the use of
the descriptor of the test image is compared with the descrip- single CNN structures (Cabrera et al. 2022; Rostkowska and
tors of the visual map. As in previous subsection, we are Skrzypczynski 2023), triplet structures (Alfaro et al. 2024)
going to evaluate the performance of different data augmen- and two classical analytical descriptors: HOG and gist, as
tation effects to address the fine localization step. described in Cebollada et al. (2022). Both HOG and gist
As shown in Fig. 4, training with every augmented data- are only taken into consideration when testing with night
set result in similar network performance under cloudy illu- and sunny conditions, since the conditions of the cloudy
mination conditions for the fine localization task, achieving test experiment in Cebollada et al. (2022) are different to
a mean absolute error around 0.22 ms. The same happens the conditions in the present work. Table 8 compares all the
under the night condition, in which the mean absolute methods in a global localization task, using in all cases the
COLD-Freiburg dataset, which is the same dataset used in

Fig. 4  Hierarchical localiza-


tion errors in meters when
training the ConvNeXt Large
architecture with different data
augmentation effects. The box
plots represent the distribu-
tion of errors, with whiskers
indicating variability. The Mean
Absolute Error for each model
and condition is marked by a
black dot and annotated with the
specific error value. Results are
obtained under different lighting
conditions: cloudy (red), night
(orange), sunny (yellow) and
considering jointly the three
conditions (green)

Table 8  Comparison with other Global-appearance descriptor technique Cloudy Night error (m) Sunny error (m)
methods error (m)

Alexnet + DA (Cabrera et al. 2022) 0.29 0.29 0.69


EfficientNet (Rostkowska and Skrzypczynski 2023) 0.24 0.33 0.44
Triplet VGG16 (Alfaro et al. 2024) 0.25 0.28 0.40
ConvNeXt Large (ours) 0.22 0.26 0.83
ConvNeXt Large + DA (ours) 0.22 0.27 0.57
HOG (Cebollada et al. 2022) – 0.45 0.82
gist (Cebollada et al. 2022) – 1.07 0.88

Bold values represent the minimum error for every lighting condition
2002 Evolving Systems (2024) 15:1991–2003

the previous subsections. This table shows that ConvNeXt improvements in all lighting conditions and especially in
Large without data augmentation provides the best results sunny conditions, improving results in this tough environ-
in terms of localization error for cloudy and night condi- ment. Surprisingly, augmenting the dataset with changes
tions. Training with data augmentation does not improve the in saturation shows a negative impact, especially in sunny
performance in cloudy conditions. However, it favours the conditions. Finally, increasing the dataset with rotations
results under sunny conditions. In this illumination condi- results in significant improvements in cloudy conditions.
tion, the best result is obtained with a triplet VGG16 pro- Finally, as for sunny conditions, the contrast effect yields
posed in Alfaro et al. (2024). the most optimal results, thereby enhancing the model’s
generalization capabilities and preventing overfitting.
In future works, studying more advanced techniques for
5 Conclusion generating more realistic visual effects with Generative
Adversarial Networks (GANs) is a priority. Furthermore, we
This study assesses the application of a deep learning tech- will evaluate other deep learning schemas such as Siamese,
nique in addressing hierarchical localization using omnidi- Triplet Neural Networks and Feature Pyramid Networks
rectional imaging. The technique involves training a CNN (FPNs). Finally, we will approach the localization problem
to perform room retrieval, addressed as an image classifica- in outdoor environments by using CNNs, considering the
tion problem. Additionally, the CNN is employed to embed specificities of such scenarios.
the input image into a holistic descriptor from intermediate
Acknowledgements The Ministry of Science, Innovation and Uni-
layers, aggregating relevant information that characterizes versities (Spain) has supported this work through “Ayudas para la
the input image. Additionally, we evaluate the influence of Formación de Profesorado Universitario” (FPU21/04969). This work
two main components on the localization performance: CNN is also part of the project TED2021-130901B-I00, funded by MCIN/
architecture and effects applied in the data augmentation. AEI/10.13039501100011033 and the European Union “NextGenera-
tionEU”/PRTR, and of the project PROMETEO/2021/075 funded by
As for the CNN backbone, AlexNet shows excellent Generalitat Valenciana.
overall performance, especially when tested under the same
lighting conditions than the training images. In contrast, Funding Open Access funding provided thanks to the CRUE-CSIC
ResNet performance decreases in sunny conditions which agreement with Springer Nature.
are the most challenging test conditions. This fact shows its Data availability Data is available in the github repo provided in Code
low capability of generalization. The ResNext model surpass availability.
both in cloudy and sunny conditions, showcasing versatility
across different lighting environments. However, Efficient- Code availability Our code is publicly available on the project website
https://​github.​com/​juanjo-​cabre​ra/​Indoo​rLoca​lizat​ionSi​ngleC​NN.​git.
Net exhibits a slight advantage over the ResNext model in
terms of accuracy, although it requires more computational Open Access This article is licensed under a Creative Commons Attri-
time. Furthermore, MobileNet consistently produces accu- bution 4.0 International License, which permits use, sharing, adapta-
rate results with a competitive computational time, demon- tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
strating high performance across all conditions. Finally, the provide a link to the Creative Commons licence, and indicate if changes
most striking result comes from ConvNext, which consist- were made. The images or other third party material in this article are
ently achieves the highest accuracy in all scenarios, mak- included in the article’s Creative Commons licence, unless indicated
ing it the top-performing model. Particularly noteworthy is otherwise in a credit line to the material. If material is not included in
the article’s Creative Commons licence and your intended use is not
its exceptional accuracy in sunny conditions, indicating its permitted by statutory regulation or exceeds the permitted use, you will
robustness and generalization capabilities. need to obtain permission directly from the copyright holder. To view a
Regarding the proposed data augmentation, training copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
with the baseline dataset yields a remarkable accuracy,
especially in cloudy and night conditions. However, a sig-
nificant decrease is observed in sunny conditions, which References
diverge more from the training dataset. The spotlight effect
shows marginal improvements, indicating that spotlight- Aguilar WG, Luna MA, Moya JF, Abad V, Parra H, Ruiz H (2017)
based enhancement does not contribute to improve the Pedestrian detection for UAVs using cascade classifiers with
meanshift. In: 2017 IEEE 11th international conference on seman-
generalization ability of the network. In contrast, data aug- tic computing (ICSC). IEEE, pp 509–514
mentation with shadows produces moderate improvements, Alfaro M, Cabrera JJ, Jiménez LM, Reinoso Payá L (2024) Hierarchical
especially in sunny conditions. Changing the overall localization with panoramic views and triplet loss functions. arXiv
brightness and darkness of the image produces substantial preprint. arXiv:​2404.​14117
Bai D, Wang C, Zhang B, Yi X, Yang X (2018) CNN feature boosted
improvements, especially in sunny conditions. In addition, SeqSLAM for real-time loop closure detection. Chin J Electron
contrast-based effects are very effective, with significant 27(3):488–499
Evolving Systems (2024) 15:1991–2003 2003

Ballesta M, Payá L, Cebollada S, Reinoso O, Murcia F (2021) A cnn Pronobis A, Caputo B (2009) COLD: COsy localization database. Int
regression approach to mobile robot localization using omnidirec- J Robot Res 28(5):588–594. https://​doi.​org/​10.​1177/​02783​64909​
tional images. Appl Sci 11(16):7521 103912
Cabrera JJ, Cebollada S, Flores M, Reinoso Ó, Payá L (2022) Train- Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look
ing, optimization and validation of a CNN for room retrieval and once: unified, real-time object detection. In: Proceedings of the
description of omnidirectional images. SN Comput Sci 3(4):1–13 IEEE conference on computer vision and pattern recognition, pp
Cebollada S, Payá L, Jiang X, Reinoso O (2022) Development and use 779–788
of a convolutional neural network for hierarchical appearance- Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time
based localization. Artif Intell Rev 55(4):2847–2874 object detection with region proposal networks. In: Advances in
Céspedes OJ, Cebollada S, Cabrera JJ, Reinoso O, Payá L (2023) neural information processing systems, p 28
Analysis of data augmentation techniques for mobile robots Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional net-
localization by means of convolutional neural networks. In: IFIP works for biomedical image segmentation. In: Medical image
international conference on artificial intelligence applications and computing and computer-assisted intervention—MICCAI 2015:
innovations. Springer, pp 503–514 18th international conference, Munich, Germany, October 5-9,
Chen Z, Lam O, Jacobson A, Milford M (2014) Convolutional neural 2015, Proceedings, Part III 18. Springer, pp 234–241
network-based place recognition. arXiv preprint. arXiv:1​ 411.1​ 509 Rostkowska M, Skrzypczynski P (2023) Optimizing appearance-based
Ding J, Chen B, Liu H, Huang M (2016) Convolutional neural network localization with catadioptric cameras: small-footprint models for
with data augmentation for SAR target recognition. IEEE Geosci real-time inference on edge devices. Sensors 23(14):6485. https://​
Remote Sens Lett 13(3):364–368 doi.​org/​10.​3390/​s2314​6485
Grisetti G, Stachniss C, Burgard W (2005) Improving grid-based Salamon J, Bello JP (2017) Deep convolutional neural networks and
slam with rao-blackwellized particle filters by adaptive propos- data augmentation for environmental sound classification. IEEE
als and selective resampling. In: Proceedings of the 2005 IEEE Signal Process Lett 24(3):279–283. https://​doi.​org/​10.​1109/​LSP.​
international conference on robotics and automation. IEEE, pp 2017.​26573​81
2432–2437 Sarlin P, Cadena C, Siegwart R, Dymczyk M (2019) From coarse
Grisetti G, Stachniss C, Burgard W (2007) Improved techniques for to fine: robust hierarchical localization at large scale. In: 2019
grid mapping with rao-blackwellized particle filters. IEEE Trans IEEE/CVF conference on computer vision and pattern recogni-
Robot 23(1):34–46 tion (CVPR), pp 12708–12717 . https://​doi.​org/​10.​1109/​CVPR.​
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image 2019.​01300
recognition. In: Proceedings of the IEEE conference on computer Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013)
vision and pattern recognition, pp 770–778 Overfeat: integrated recognition, localization and detection using
Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, convolutional networks. arXiv preprint. arXiv:​1312.​6229
Zhu Y, Pang R, Vasudevan V (2019) Searching for mobilenetv3. Shorten C, Khoshgoftaar TM (2019) A survey on image data augmen-
In: Proceedings of the IEEE/CVF international conference on tation for deep learning. J Big Data 6(1):60
computer vision, pp 1314–1324 Simonyan K, Zisserman A (2014) Very deep convolutional networks
Kim Y (2014) Convolutional neural networks for sentence classifica- for large-scale image recognition. arXiv preprint. arXiv:​1409.​
tion. arXiv preprint. arXiv:​1408.​5882 1556
Komorowski J, Wysoczańska M, Trzcinski T (2021) Minkloc++: lidar Sünderhauf N, Shirazi S, Dayoub F, Upcroft B, Milford M (2015)
and monocular image fusion for place recognition. In: 2021 inter- On the performance of convnet features for place recognition.
national joint conference on neural networks (IJCNN). IEEE, pp In: 2015 IEEE/RSJ international conference on intelligent robots
1–8 and systems (IROS). IEEE, pp 4297–4304
Kopitkov D, Indelman V (2018) Bayesian Information Recovery from Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D,
CNN for probabilistic inference. In: 2018 IEEE/RSJ international Vanhoucke V, Rabinovich A (2015) Going deeper with convolu-
conference on intelligent robots and systems (IROS), pp 7795– tions. In: Proceedings of the IEEE conference on computer vision
7802 . https://​doi.​org/​10.​1109/​IROS.​2018.​85945​06 and pattern recognition, pp 1–9
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification Tan M, Le Q (2021) Efficientnetv2: smaller models and faster train-
with deep convolutional neural networks. In: Advances in neural ing. In: International conference on machine learning. PMLR, pp
information processing systems, p 25 10096–10106
LeCun Y, Bengio Y (1995) Convolutional networks for images, speech, Uy MA, Lee GH (2018) Pointnetvlad: deep point cloud based retrieval
and time series. The handbook of brain theory and neural net- for large-scale place recognition. In: Proceedings of the IEEE con-
works. MIT Press, Cambridge ference on computer vision and pattern recognition, pp 4470–4479
Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A Wang H, Yang W, Huang W, Lin Z, Tang Y (2018) Multi-feature fusion
convnet for the 2020s. In: Proceedings of the IEEE/CVF confer- for deep reinforcement learning: sequential control of mobile
ence on computer vision and pattern recognition, pp 11976–11986 robots. In: International conference on neural information pro-
Milford MJ, Wyeth GF (2012) Seqslam: visual route-based navigation cessing. Springer, pp 303–315
for sunny summer days and stormy winter nights. In: 2012 IEEE Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual
international conference on robotics and automation. IEEE, pp transformations for deep neural networks. In: Proceedings of the
1643–1649 IEEE conference on computer vision and pattern recognition, pp
Naseer T, Ruhnke M, Stachniss C, Spinello L, Burgard W (2015) 1492–1500
Robust visual SLAM across seasons. In: 2015 IEEE/RSJ interna- Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning
tional conference on intelligent robots and systems (IROS). IEEE, deep features for scene recognition using places database. In:
pp 2529–2535 Advances in neural information processing systems, pp 487–495
Perez L, Wang J (2017) The effectiveness of data augmentation in
image classification using deep learning. arXiv preprint. arXiv:​ Publisher's Note Springer Nature remains neutral with regard to
1712.​04621 jurisdictional claims in published maps and institutional affiliations.

You might also like