2022 - A Novel Method For Keyframe Selection
2022 - A Novel Method For Keyframe Selection
XXIV ISPRS Congress (2022 edition), 6–11 June 2022, Nice, France
1 Department of Photogrammetry and Remote Sensing, Faculty of Geodesy and Geomatics Engineering, Toosi University of
Technology, K. N, Tehran, Iran - [email protected], [email protected]
2 3D Optical Metrology (3DOM) unit, Bruno Kessler Foundation (FBK), Trento, Italy - [email protected]
KEY WORDS: Visual Odometry, Visual SLAM, Visual-Inertial Systems, IMU, Geometric Key-Frame Selection
ABSTRACT
Given the importance of key-frame selection in determining the positioning accuracy of Simultaneous Localization And Mapping
(SLAM) and Odometry algorithms, and the urgent need in this field for a flexible key-frame selection algorithm, this paper proposes a
novel and geometric method for key-frame selection built on top of ORB-SLAM3. It takes a key-frame in a completely robust and
flexible way regardless of the environment, data and scene conditions, and according to the physics and geometry of the environment.
In the proposed method, the camera sensor and IMU take key-frames simultaneously and in parallel. While selecting a key-frame, an
adaptive threshold first decides whether the geometric condition of the frame is appropriate based on the degree of change in the
orientation of the point visibility vector from the last key-frame to the current frame. Then the quality of the frame is evaluated by
examining the distribution of points inside the frame by a balance criterion. A new key-frame will be created if both conditions provide
a positive answer. In addition, if the IMU sensor detects large changes in acceleration, a key-frame independently chosen. The proposed
method is evaluated qualitatively and quantitatively on the EuRoC dataset by comparing the algorithm trajectory to a reference
trajectory and usig the Absolute Trajectory Error (ATE) and the processing time as metrics. The evaluation results indicate a 26%
improvement in the positioning of the algorithm although it has a 9% increase in the processing time due to its geometric key-frame
selection process.
* Corresponding author
key-frames based on the notion of greatest frame separation, after lower rate. The results show that this method increases the
categorizing frames using a graph. Wolf (1996) proposed a key- algorithm's speed by reducing 40% - 60% of the redundant frames
frame selection method based on motion analysis for identifying and, at the same time, does not reduce the positioning accuracy.
key-frames in shots from video programs. They use optical flow Finally, the ORB-SLAM algorithm (Mur-Artal et al., 2015), by
computations to identify local minima of motion in a shot. This adopting the best key-frame selection strategy, first takes a lot of
technique allows to identify both gestures which are emphasized key-frames and then marginalizes their redundancies to maintain
by momentary pauses and camera motion which links together the algorithm's performance. ORB-SLAM3 select a key-frame if
several distinct images in a single shot. Their results show that four requirements are met: 1) It must have been more than 20
this method can well extract key-frames from a complex shot. frames since the last global re-localization. 2) Local mapping is
Photogrammetric and 3D reconstruction applications are one of inactive, or there has been more than 20 frames since the previous
the most significant study areas in the field of key-frame key-frame insertion. 3) At least 50 points are tracked in the
selection. Hosseininaveh et al., (2021) and Ahmadabadian et al. current frame. 4) The current frame hasn't tracked more than 90%
(2013) proposed the image network designer (IND) approach for of the points from the previous key-frame. Experiments have
extracting ideal subsets of images from a sequence of images shown that the ORB-SLAM3 method is strong and reliable, and
acquired from an object. In this method, the angle between the that it has delivered good results in difficult circumstances.
normal to the surface in each point and the viewing vector of each
point, in each image, classified in four different areas. The camera
that covers the most areas of the all points is then selected as the 3. THE PROPOSED METHOD
best camera. The findings demonstrate that this technique
produces a full and accurate point cloud, as well as a final 3.1 Method overview
reconstructed model, with excellent outcomes. The only problem
with this method is the inability to run in real-time applications The proposed method is based on ORB-SLAM3 (Campos et al.
(Hosseininaveh et al. 2014; Hosseininaveh and Remondino 2020) and is aimed to improve the accuracy and robustness of the
2021). Dong et al. (2014) developed an offline key-frame visual SLAM algorithm. The suggested method uses a key-frames
selection technique that consisted of two parts: an off-line module selection methodology based on geometric and photogrammetric
for selecting features from a set of reference pictures and an concepts, as well as adaptive (rather than static) thresholds to the
online module for matching them to the input live video for greatest extent possible. Furthermore, employing the
estimating the camera posture rapidly (Dong et al., 2014). synchronized IMU sensor and camera, this technique utilizes a
Computer vision applications and VO/VSLAM algorithms is one key-frame. The camera sensor uses geometric and
of the most common uses of key-frame selection methods. The photogrammetric principles inspired by the IND approach to
biggest difference between these methods and the methods of the determine whether or not to pick a key-frame (Ahmadabadian et
previous categories is the ability to run in real-time. These al. 2013; Hosseininaveh et al. 2012). Simultaneously, the IMU
approaches involve visual information such as scene light flow, sensor picks the current frame as the key-frame if it detects
pixel grays, and so on, as well as positional information such as considerable acceleration changes. Figure 1 shows the method's
the distance between frames and the positioning of map points in schematic diagram.
the key-frame selection process. Engel et al. (2017) first select
many key-frames and quickly sparsify them by marginalizing 3.2 Camera key-frame selection
redundant key-frames. To select a key-frame, they introduce a
combination of three criteria, focusing on drastic changes in light Two geometric criteria participate in the selection of the key-
and scene brightness and the gray-scale value of the pixels. frame by the camera: 1) adaptive threshold and 2) balance criteria.
Experiments on numerous datasets have shown that this approach The adaptive threshold is explained first. For each new input
of picking key-frames provides improved outcomes in poor frame, there are a number of points that are also present in the last
illumination circumstances. Position-based methods in computer key-frame. After categorizing the angle of visibility vector of
vision applications and VO / VSLAM algorithms can be divided each of these points relative to the normal vector on the surface,
into the sub-categories including based on 1) specified time or in four 10-degree zones, the points whose visibility vector zone
place intervals, 2) image overlap, 3) parallax, and 4) others (Lin has changed from the last key-frame so far are counted.
et al. 2019). Key-frame selection in parallel tracking and mapping
(PTAM) (Klein and Murray, 2007), Semi-direct monocular visual
odometry (SVO) (Forster et al. 2014), and Large-scale direct
monocular SLAM (LSD-SLAM) (Engel et al. 2014) are based on
the first category and without considering any specific criteria and
only with the passage of a particular time or distance intervals
key-frames are selected. OKVIS (Leutenegger et al. 2015) and
SLAM in dynamic environments (RD-SLAM) (Tan et al. 2013)
exploit the image overlap methods, the second category, and have
more flexibility and more power than the methods using the
previous category criterion. VINS-mono (Qin et al. 2018), as an
instance in the third category, has two criteria for selecting key-
frames including average parallax and tracking quality. An
example for the last category is Kerl et al. (2013) who presented
a key-frame selection method based on differential entropy of
multivariate normal distribution that had excellent results in
texture-less environments but it is computationally complex.
Xiaohu lin et al., (Lin et al., 2019) select key-frames based on the
relative variations of the Roll, Pitch, Yaw angles. If camera Figure 1. The flowchart of the proposed key-frame selection
attitude changes sharply, the key-frame selection rate increases, method.
and if camera attitude shifts slightly, Key-frames are taken at a
The role of the adaptive threshold is to control the number of these
changes, and if these changes exceed the adaptive threshold, a method proposed in this work. Stereo images, synchronized IMU
key-frame will be allowed by the adaptive threshold. The 10 measurements, and precise motion and structural ground-truth are
degree angle of the conical zones has been selected following the all included in the datasets. The proposed method and ORB-
investigations of Hosseininaveh et al. (2012). SLAM3 were then evaluated quantitatively and qualitatively by
To define this adaptive threshold, we need the most similar frame comparing the algorithm's trajectory to the ground truth
to the last key-frame, called the reference frame. The frame that trajectory, as well as the Absolute Trajectory Error (ATE) (Sturm
enters immediately after the key-frame is considered the et al., 2012) for two image sequences from EuRoC dataset. The
reference frame. processing time of the algorithms was also compared. All
Firstly, an initial threshold is estimated to calculate the adaptive experiments were performed with an Intel (R) Core i7- 4510U (4
threshold. This initial value, assuming that the current frame and cores @ 2 GHz) and 8 GB of RAM. Each dataset was run 10 times
the reference frame are similar, is selected in such a way that the and the average was utilized to eliminate some unpredictability in
ratio of the points whose area has changed to the total the findings.
corresponding points is equal in the reference frame and the
current frame. 4.1 Data and material
This initial threshold is simplistic and has to be modified because
this frame is not identical to the reference frame and has been The EuRoC dataset is one of the most widely used datasets for
moved and changed. By the ratio of decreasing the number of evaluating computer vision algorithms in automated navigation
matched points from the reference frame to the current frame, a scenarios (Burri et al., 2016). There are 11 image sequences
coefficient is employed to make the initial threshold tougher. By including simple, medium and difficult level flights in this
adding this coefficient, the adaptive threshold is strict and does dataset, which includes accurate ground truth position
not allow key-frames; Because it considers the change in the area information measured by laser scanners and IMU information
of the points visibility vector only due to the decrease in the synced with frames. The ORB-SLAM3 and the algorithm
number of corresponding points, which is the result of the presented in this paper were evaluated in mono-inertial and
displacement of the frame; While these changes may be due to stereo-inertial modes for two image sequences from this dataset
poor lighting conditions and so on. Therefore, another coefficient (MH01, MH02) and their trajectories are compared with the
is considered which simplifies the initial threshold by decreasing reference trajectories, also the value of Absolute Trajectory Error
the ratio of points whose area has changed to all corresponding (ATE) and the processing time of the algorithms is obtained for
matched points. them. The calculated trajectory and the ground truth are aligned
After applying the mentioned coefficients, the initial threshold is using a similarity transformation (Zhang and Scaramuzza, 2018)
fully adapted and can be adapted to any situation; But the to determine the ATE.
remarkable thing about this threshold is that it does not pay
attention to the quality of the current frame to become a key- 4.2 Comparison of the trajectories
frame. As a result, to check the quality of the frame, the balance
criterion of the points inside the frame is activated and the Experiments were carried out on two sequences of the EuRoC
distribution of points inside the frame is examined. dataset (MH01, MH02) in stereo-inertial and mono-inertial
To calculate the balance criterion, a 3-by-3 grid is first created modes to qualitatively validate the key-frame selection method
inside each frame, and inside each cell of this grid, the number of proposed in this study. Figures 2 and 3 illustrate the trajectories
points whose area has changed are counted. This process creates compared to the reference trajectories in stereo-inertial and
a 3-by-3 matrix for each image. By calculating the center of mono-inertial modes, respectively.
gravity of this matrix (Johnson, 2013) for all frames that satisfy
the condition of the adaptive threshold, the frame whose center of
gravity is closer to the center of the matrix is selected as the key-
frame. Satisfaction of these two criteria gives us the assurance
that the selected key-frame, in addition to having a good
geometric condition, its quality is also suitable for matching and
pairing with the previous key-frame.
The EuRoC Micro Aerial Vehicles (MAV) dataset (Burri et al., Figures 2 and 3 indicate that the proposed method overtook ORB-
2016) was used to test the performance of the key-frame selection SLAM3 in terms of performance and trajectory deviation. The
divergence between the ORB-SLAM3 and the reference number of runs is evident from the image above.
trajectory has grown as the route turns. The method given in this Figure 6 also displays the average value of each algorithm's ATE
study, on the other hand, has retained its closeness to the reference outputs in these two sequences.
trajectory. The results of Table 1 and Figures 6 and 7 show an improvement
of 13.7% in stereo-inertial mode and 38.9% in mono-inertial
4.3 Comparison of the Absolute Trajectory Error - ATE mode of the algorithm presented in this paper. Figure 7, which is
the average ATE of both data for each algorithm, also shows the
The ATE is used to determine positioning accuracy in order to reduction of the ATE difference between mono-inertial and
quantitatively assess the suggested approach. This criteria was stereo-inertial modes for the proposed algorithm. Reducing the
computed ten times in two image sequences for stereo-inertial and difference between ATE in mono-inertial and stereo-inertial
mono-inertial modes of both algorithms, and the average of the mode in the algorithm proposed in this paper, indicates more
results can be seen in Table 1. Figures 4 and 5 illustrate the stability and less effectiveness of this algorithm from the type of
cumulative ATE values of ten time runs for stereo-inertial and system used.
mono-inertial modes, respectively.
0.3 0.01
0.2 0
Stereo-Inertial Mono-Inertial
0.1
ORB-SLAM3 Proposed Method
0
1 3 5 9 7 2 4 6 8 10 Figure 7. Total average of ATE values in stereo-inertial and
Run number mono-inertial modes.
ORB-SLAM3 Proposed Method
4.4 Comparison of processing time
Figure 4. Cumulative ATE results in stereo-inertial mode. The processing time of the method proposed in this paper is
expected to rise compared to ORB-SLAM3 due to its geometric
MH01 MH02 key-frame selection process. As a result, the processing time of
0.8 each method is measured in this section. To determine the
Cumulative ATE (m)
5. DISCUSSIONS Besiris, D., Laskaris, N., Fotopoulou, F., Economou, G., 2007.
Key frame extraction in video sequences: a vantage points
As the selection of key-frames is the foundation of positioning in approach. Proc. IEEE 9th Workshop on Multimedia Signal
SLAM and Odometry algorithms, the precision and manner of Processing.
chosing these frames will have a substantial influence on the
algorithm's accuracy – and successive 3D reconstruction tasks. Burri, M., Nikolic, J., Gohl, P., Schneider, T., Rehder, J., Omari,
Due to the use of predefined thresholds specifically fine-tuned for S., Achtelik, M. W., Siegwart, R., 2016. The EuRoC micro aerial
standard data, existing key-frame selection algorithms do not vehicle datasets. International Journal of Robotics Research,
work effectively in diverse settings and in non-standard data, 35(10): 1157-1163.
despite their high accuracy in standard data (such as the EuRoC
Campos, C., Elvira, R., Rodríguez, J. J. G., Montiel, J. M.,
dataset). As a result, an attempt has been made in this article to
Tardós, J. D., 2020. ORB-SLAM3: An accurate open-source
introduce a flexible geometric method for picking key-frames that
library for visual, visual-inertial and multi-map SLAM. arXiv
is efficient in all scenarios. There are a few key considerations to
preprint arXiv:2007.11898.
consider regarding this method: as mentioned in Section 3.2, the
angle of the cone zones according to (Hosseininaveh et al. 2012; Dong, Z., Zhang, G., Jia, J., Bao, H., 2014. Efficient keyframe-
Ahmadabadian et al., 2013) is considered to be 10-degrees. Small based real-time camera tracking. Computer Vision and Image
camera motions cause point zones to alter and key-frames to be Understanding, 118: 97-110.
picked faster when this angle is reduced. This increases
computing time while also weakening the intersecting triangle's Engel, J., Koltun, V. Cremers, D., 2017. Direct sparse odometry.
geometry. Increased this angle, on the other hand, decreases the IEEE Transactions on PAMI, 40(3): 611-625.
key-frame selection rate and hence the key-frame network's
stability. Optimal selection of this parameter will help to improve Engel, J., Schöps, T., Cremers, D., 2014. LSD-SLAM: Large-
the positioning accuracy of the algorithm. Another issue worth scale direct monocular SLAM. Proc. ECCV.
mentioning is the acceleration change threshold used by the IMU
to choose a key-frame. This threshold is set based on the camera's Forster, C., Pizzoli, M., Scaramuzza, D., 2014. SVO: Fast semi-
mounting platform. It will be larger on faster-moving flying direct monocular visual odometry. Proc. IEEE ICRA.
platforms and smaller on slower-moving ground platforms. In this
study, the value of this threshold is set at 1 (meter/second^2) by Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M.,
experimentation. 2015. Visual simultaneous localization and mapping: a survey.
Artificial Intelligence Review, 43(1): 55-81.
6. CONCLUSIONS
Hosseininaveh, A. and Remondino, F., 2021. An Imaging
This paper proposed a novel geometric key-frame selection Network design for UGV-based 3D reconstruction of buildings.
method for visual-inertial SLAM and Odometry systems built on Remote Sensing, 13(10): 1923.
ORB-SLAM3 framework. Extensive tests with two sequences
from the EuRoC dataset in mono-inertial and stereo-inertial Hosseininaveh, A., Sargeant, B., Erfani, T., Robson, S., Shortis,
modes were conducted to assess the proposed method. The results M., Hess, M., Boehm, J., 2014. Towards fully automatic reliable
demonstrated that we were able to create a completely geometric 3D acquisition: from designing imaging network to a complete
key-frame selection procedure that worked reliably and and accurate point cloud. Robotics and Autonomous Systems,
consistently in a variety of settings without the need of heuristic 62(8): 1197-1207.
thresholds. By comparing the algorithm trajectory to the reference
Hosseininaveh, A., Serpico, M., Robson, S., Hess, M., Boehm, J.,
trajectory and the ATE, our approach was assessed quantitatively
Pridden, I., Amati, G., 2012. Automatic image selection in
and qualitatively. The proposed algorithm shows a 25-30%
photogrammetric multi-view stereo methods. Proc. 13th VAST
improvement in accuracy, although the processing time is slightly
longer, requiring some further optmizations for real-time Symposium, Eurographics Association.
processing operations. Johnson, R. A., 2013. Advanced euclidean geometry. Courier
In future research, the proposed algorithm might be modified to
Corporation.
choose key-frames in such a manner that a dense and coherent
point cloud is produced, in addition to further enhance positioning Kerl, C., Sturm, J. Cremers, D., 2013. Dense visual SLAM for
accuracy. Our method may be used as a basic algorithm in the RGB-D cameras. Proc. IEEE/RSJ International Conference on
generation of training data for deep learning networks, and its Intelligent Robots and Systems.
speed can be enhanced with the help of deep learning networks.
Klein, G. and Murray, D., 2007. Parallel tracking and mapping
for small AR workspaces. Proc. 6th IEEE ISMAR.
Lin, X., Wang, F., Guo, L., Zhang, W., 2019. An automatic key-
frame selection method for monocular visual odometry of ground
vehicle. IEEE Access, 7: 70742-70754.
Nistér, D., Naroditsky, O., Bergen, J., 2006. Visual odometry for
ground vehicle applications. Journal of Field Robotics, 23(1): 3-
20.
Qin, T., Li, P., Shen, S., 2018. Vins-mono: A robust and versatile
monocular visual-inertial state estimator. IEEE Transactions on
Robotics, 34(4): 1004-1020.
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.,
2012. A benchmark for the evaluation of RGB-D SLAM systems.
IEEE/RSJ International Conference on Intelligent Robots and
Systems.
Tan, W., Liu, H., Dong, Z., Zhang, G., Bao, H., 2013. Robust
monocular SLAM in dynamic environments. Proc. IEEE ISMAR.
Yan, X., Gong, H., Jiang, Y., Xia, S.-T., Zheng, F., You, X., Shao,
L., 2020. Video scene parsing: an overview of deep learning
methods and datasets. Computer Vision and Image
Understanding, 201: 103077.
Zhuang, Y., Rui, Y., Huang, T. S., Mehrotra, S., 1998. Adaptive
key frame extraction using unsupervised clustering. Proc. ICIP.