Pedestrian Tracking Algorithm For Video Surveillance Based On Lightweight Convolutional Neural Network
Pedestrian Tracking Algorithm For Video Surveillance Based On Lightweight Convolutional Neural Network
ABSTRACT The Efficient Convolution Operators for Tracking (ECO) algorithm has garnered considerable
attention in both academic research and practical applications due to its remarkable tracking efficacy, yielding
exceptional accuracy and success rates in various challenging contexts. However, the ECO algorithm heavily
relies on the deep learning Visual Geometry Group (VGG) network model, which entails complexity and
substantial computational resources. Moreover, its performance tends to deteriorate in scenarios involving
target occlusion, background clutter, and similar challenges. To tackle these issues, this study introduces
a novel enhancement to the pedestrian tracking algorithm. Specifically, the VGG network is substituted
with a lightweight MobileNet v2 model, thereby reducing computational demands. Additionally, a Double
Attention Networks (A2-Net) module is incorporated to augment the extraction of crucial information, while
pre-training techniques are integrated to expedite model convergence. Experimental results demonstrate that
the C-ECO algorithm achieves comparable accuracy and success rates to the conventional ECO algorithm,
despite reducing the model size by 27.96% and increasing the tracking frame rate by 46.11%. Notably, when
compared to other prevalent tracking algorithms, the C-ECO algorithm exhibits an accuracy of 82.20% and
a success rate of 64.72%. These findings underscore the enhanced adaptability of the C-ECO algorithm in
complex environments, offering a more lightweight model while delivering superior tracking capabilities.
INDEX TERMS Machine vision, target tracking, deep learning, efficient convolution operator, pedestrian
tracking.
algorithms using correlation filtering for target tracking have proposed Spatio-Temporal adaptive and Channel selective
appeared one after another, and the tracking effect has Correlation Filters (STCCF) for robust tracking, selecting a
the tracking results are getting better and better. Morid- set of target-specific features from high dimensional features,
vaisi et al. [6] surmount KCF’s limitations through the lens STCCF can not only alleviate the over-fitting problem and
of the Tracking-Learning-Detection (TLD) framework and reduce the computational cost, but also enhance the discrim-
devised an algorithm that concurrently trains two classifiers, inability and interpretability of the learned filters.
employing a semi-supervised co-training learning algorithm. With the rapid development of deep learning in recent
Subsequently, they subject the proposed method to rigorous years, many scholars use deep learning networks to extract
scrutiny against TB-100 datasets, juxtaposed with its coun- image features and fuse them with relevant filters to perform
terparts. Yang et al. [7] used KCF-based SOT to learn dis- target tracking.
criminative target appearance that relied on hand-crafted deep Zdarsky et al. [13] introduced a deep learning-based
features and used the prediction results to refine detection approach that uses the video frames of low-cost web cam-
errors in new ways and eliminated tracking errors caused by eras. Using DeepLabCut (DLC), an open-source toolbox
uncorrelated algorithms. Sanagavarapu and Pullakandam [8] for extracting points of interest from videos, they obtained
proposed the method using the Kernelized Correlation Filter facial landmarks critical to gaze location and estimated the
(KCF) object tracking technique. The segmented region is point of gaze on a computer screen via a shallow neural
encoded by the complexity-efficient Scalable HEVC (SHVC) network. Tested for three extreme poses, this architecture
to meet the resolution of an end-user device. The com- reached a median error of about one degree of visual
plexity of SHVC is decreased by using the Convolutional angle. Abdelali et al. [14] introduced a wholly automated
Neural Network (CNN) and Long- and Short-Term Memory methodology for Multiple Hypothesis Detection and Track-
(LSTM) to predict the Coding Tree Unit (CTU) structure. ing (MHDT) in the domain of video traffic surveillance.
The results show that the proposed method decreases the The presented framework integrates the Kalman filter with
bitrate significantly for video sequences without degradation data association-based tracking techniques, employing the
in Peak Signal-to-Noise Ratio (PSNR). A tracking method YOLO detection method, to adeptly monitor vehicles in
that integrates the objectness-bounding box regression (O- intricate traffic surveillance scenarios. Empirical findings
BBR) model and a scheme based on kernelized correlation substantiate that the proposed approach exhibits resilience
filter (KCF) is proposed by Mbelwa et al. [9]. The scheme in discerning and tracing the trajectories of vehicles under
based on KCF is used to improve the tracking performance diverse circumstances, including scale variations, station-
of FM and MB. For handling drift problems caused by ary vehicles, rotations, fluctuating lighting conditions, and
OCC and IV, we propose objectness proposals trained in instances of occlusion. Zhang et al. [15] introduced a pioneer-
bounding box regression as prior knowledge to provide can- ing approach known as Harris Hawks Optimization with deep
didates and background suppression. Finally, scheme KCF learning-enhanced automated face detection and tracking
as a base tracker and O-BBR are fused to obtain the state (HHODL-AFDT). The HHODL-AFDT model, as proposed,
of a target object. Khan et al. [10] proposed a new criterion incorporates a Faster Region-Based Convolutional Neural
based on the hybridization of multiple cues i.e., average Network (RCNN) for face detection and leverages the Harris
peak correlation energy (APCE) and confidence of squared Hawks Optimization (HHO) for hyperparameter optimiza-
response map (CSRM), which is presented to enhance the tion. The optimized Faster RCNN model presented in this
tracking efficiency. They updated the occlusion detection context impeccably discerns facial features and feeds this
module adaptive learning rate adjustment module, and drift information into the face-tracking model through a regres-
handling using an adaptive learning rate model based on sion network (REGN). The face tracking, facilitated by the
hybridized criterion, and integrated all these modules to REGN model, makes use of features extracted from adjacent
propose a new tracking scheme. Degli-Esposti et al. [11] frames to anticipate the facial target’s location in subse-
proposed a new algorithm for object tracking in SWIR imag- quent frames. Almuqren et al. [16] presented an effective
ing, using a kernelized correlation filter (KCF) as a basic method to track an object based on a combination of feature
tracker. To overcome occlusions, they proposed the use of hierarchies of CNNs, they combined several feature hier-
the Kalman filter as a predictor and a method to expand the archies and compute the more discriminative map to track
object search area. To cope with outliers, Huber’s M-robust the object, a novel method of feature hierarchies integra-
approach is applied, so this paper proposes robustification of tion based on Kullback–Leibler (KL) divergence is adopted.
the Kalman filter by introducing a nonlinear Huber’s influ- Ahmed et al. [17] unveiled an intricate multi-person tracking
ence function in the Kalman filter estimation step. To make a framework, thoughtfully intertwined with 5G infrastructure.
balance between desired estimator efficiency and resistance Employing a top-view perspective, this framework yields an
to outliers, a new adaptive M-robustified Kalman filter is expansive scope of the observed scene or field of vision.
proposed. This is achieved by adjusting the saturation thresh- The essence of person tracking is encapsulated within a deep
old of the influence function using the detection confidence learning-driven tracking-by-detection framework, wherein
information from the basic KCF tracker. Liang et al. [12] detection duties are seamlessly executed by the YOLOv3
model, and the subsequent tracking operations are orches- The algorithm finally achieves target localization and filter
trated by the Deep SORT algorithm. To further elevate the update by applying the convolutional features of the target
precision of the detection model, a transfer learning approach in the input image video, the directional gradient histogram
is artfully employed. In this methodology, a detection model feature HOG (histogram of gradients) and the color channel
capitalizes on a pre-trained foundation, enriched with an feature CN (color-names) [22], [23].
additional layer meticulously fine-tuned using a top-view The ECO algorithm mainly includes the processes of
dataset. Zhang et al. [18] proposed a robust adaptive learning feature extraction, continuous convolution operation, con-
visual tracking algorithm, HOG features, CN features, and volution operation of factorization, generation of sample
deep convolution features are extracted from the template space model and correlation filtering operation [24]. First,
frame and search region frame, respectively, and analyzed the the interpolation operation is performed for the features x
merits of each feature and perform feature adaptive fusion to in the search region of the target to be detected as shown in
improve the validity of feature representation. Equation (1) [25].
Although target tracking algorithms have been developed d −1
NX
over many years, the current algorithms still face challenges
n o T
Jd x d (t) = x d [n] bd t − n (1)
in accurately tracking targets that experience occlusion, back- Nd
n=0
ground clutter, or leave the field of view. Additionally, the
where: xd denotes the d-channel characteristic of x,
deep learning network models, while highly effective, are
Jd x d (t) is a function on t ∈ [0, T ) that represents the
complex and have a large number of parameters. Conse-
result of the interpolation operation of x d , x d [n] ∈ RNd is
quently, they require substantial computational resources and
a function with respect to n ∈ {0, · · ·, Nd − 1}, Nd indicates
place higher demands on computer hardware.
To address the above problems, this study proposes resolution, bd t − NTd n denotes the d-channel interpolation
a lightweight convolutional neural network-based C-ECO function. The interpolation results for channels 1 to D are
tracking algorithm [19] based on the ECO tracking algorithm denoted by J {x} (t) ∈ RD , abbreviated as J {x}. After that,
based on deep learning and correlation filtering. The ECO the filter is simplified using principal component analysis and
algorithm has two implementation forms, ECO based on the response score SPf {x} obtained by convolving with J {x}
convolutional features and ECO_HC based on artificial fea- is shown in equation (2) [26].
tures [20]. Combining the accuracy and speed considerations, SPf {x} = Pf ∗ J {x} = f ∗ PT J {x} (2)
this experiment chooses the ECO algorithm based on convo-
lutional features for optimization and improvement. The main where: f denotes the filter with channel number D, denotes
contributions of this study are as follows: the convolution calculation, P is the projection matrix of D
(1) In response to the complex deep learning VGG network rows and C columns, and PT denotes its transpose matrix. The
model in the ECO algorithm, which occupies large computa- position of the score maximum, i.e., the new position of the
tional resources, a lightweight MobileNet v2 is used instead target, is obtained by optimizing SPf {x} using the Gaussian
to perform feature information extraction, which effectively Newton algorithm. Finally, the data set is compressed using
reduces resource consumption and improves tracking speed. a Gaussian mixture model, and the error of the convolutional
(2) In order to improve the target feature extraction ability response score SPf {x} of the training sample and the current
of the convolutional network, the A2-Net module is added filter f with the Gaussian label y0 of the training sample is
to MobileNet v2, which effectively improves the extraction taken as L2 parametric, and the penalty term is added to obtain
effect of the network on important information with a small the loss function as shown in equation (3).
increase of computing parameters, and significantly improves M C
2 2
X X
the training efficiency and tracking accuracy. E (f ) = πm SPf {µm } − y0 L2
+ ωf c L2
(3)
(3) Introducing the pre-training model in the training stage m=1 c=1
effectively accelerates the model convergence speed, signifi- where: µm and πm are the mean and weight of the training
cantly shortens the training time, and effectively improves the samples, respectively; M is the total number of training sam-
accuracy and success rate of the tracking algorithm. ples; ω is the penalty term of f . P is only calculated in the
The remaining chapters of this paper are organized as first frame and is kept constant when f is updated using the
follows: in Section I, the basics are introduced, in Section II, conjugate gradient algorithm to re-solve (3) every 6 frames
the construction of the C-ECO algorithm is introduced, thereafter [27].
in Section III, the model is subjected to ablation experiments, In summary, ECO takes three ways to improve by reducing
comparison tests, and in Section IV, the whole paper is sum- the filter, optimizing the training set and reducing the filter
marized and in Section V, an outlook on future work is given. update frequency, which effectively improves the tracking
speed.
II. FOUNDATIONAL THEORIES AND PROPOSED METHOD
A. THE OBJECT TRACKING ECO ALGORITHM B. LIGHTWEIGHT NETWORK MobileNet v2
The ECO target tracking algorithm is improved from the The ECO algorithm uses convolutional networks of VGG19
continuous convolutional tracking algorithm C-COT [21]. and ResNet50, which have deeper networks, better feature
extraction, and higher tracking accuracy, but the overly com- Depthwise Separable convolution significantly reduces the
plex networks and a huge number of parameters take up number of operations and parameters, which can effectively
a lot of computational resources and require higher hard- reduce the complexity of the network and improve the speed
ware, which leads to an increase in computational cost [28]. of target tracking.
The target tracking task requires high speed, so it is neces- DK × DK × M + M × N 1 1
sary to build a lightweight convolutional network model to = + 2 (4)
DK × DK × M × N N DK
reduce the model size and improve the detection speed while
DK × DK × M × DF × DF + M × N × DF × DF
guaranteeing accuracy. MobileNet has a simple streamlined
structure with the advantages of a small number of parameters DK × DK × M × N × DF × DF
and low latency. MobileNet network structure is shown in 1 1
= + 2 (5)
Table 1, where tis the expansion factor, c is the number N DK
of channels, n is the block number, and s is the step size
3) A2-NET ATTENTION MODULE
[29], [30].
During the process of network training, as the volume of
information to be acquired grows, the complexity of the
1) DEPTHWISE SEPARABLE CONVOLUTION
model also tends to rise. Consequently, this heightened com-
MobileNet v2 is mainly composed of depth separable convo- plexity necessitates increased computational capacity from
lution (DSC), the standard convolution operation is split into the hardware on which the model is deployed. The attention
a Depthwise convolution (DW) and a pointwise convolution mechanism plays a pivotal role in this context by sieving
(PW) [31]. The comparison of the convolution is shown in and selecting a small fraction of significant information from
Figure 1. For the feature map obtained by Depthwise convo- a substantial volume of data. By concentrating predomi-
lution, a 1×1 convolution kernel is used in the point-by-point nantly on this essential information, the attention mechanism
convolution to perform the convolution operation, and the effectively disregards the majority of relatively unimportant
final output feature layer after point-by-point convolution has data [35].
the same dimension as the standard convolution [32]. The fundamental concept of A2-Net revolves around gath-
ering the pivotal features of the entire space into a concise
2) CONTRAST BETWEEN DEPTHWISE SEPARABLE set, followed by an adaptive distribution to each location.
CONVOLUTION AND TRADITIONAL This enables subsequent convolutional layers to sense the fea-
CONVOLUTIONAL NETWORK tures of the entire space even without an extensive receptive
The number of parameters and the amount of computation field. The A2-Net module introduces a dual attention block
of Depthwise Separable convolution is compared with the specifically designed to efficiently capture and distribute
standard convolution to get the ratio of the number of param- long-distance features. This architectural design showcases
eters (4) and the ratio of the amount of computation (5). its potential for enhancing image and video recognition per-
Generally speaking, N is larger, 1/N is negligible, and DK formance, as it effectively models quadratic feature statistics
indicates the size of the convolution kernel. The number of and adapts feature assignments.
parameters and computation of Depthwise Separable convo- The central premise of the A2-Net module involves two
lution is reduced to about 1/D2K of the original one, and if the primary steps. Initially, it gathers crucial features from the
common 3 × 3 convolution kernel is used, it can be reduced complete space and condenses them into a concise set. Sub-
to about 1/9 of the original one [33], [34]. It can be seen that sequently, these pivotal features are adaptively distributed
FIGURE 1. Comparison of standard convolution and Depthwise Separable Convolution. Figure 1(a) shows the standard convolution,
Figure 1(b) shows the Depthwise convolution and Figure 1(c) shows the pointwise convolution.
TABLE 4. Comparison experiment with and without pre-training. pre-training (C-ECO without pre-training, C-ECO-N), from
which it can be concluded that the model with pre-training is
more effective in tracking the target, and therefore the model
with pre-training is used in all subsequent sections.
FIGURE 5. Comparison of target detection effects before and after algorithm improvement. Figure 5(a1-a4) is the detection result of algorithm 2,
Figure 5(b1-b4) is the detection result of algorithm 3, and Figure 5(c1-c4) is the detection result of algorithm 4.
Furthermore, it outperforms mainstream correlation filtering The accuracy and success rate curves of the C-ECO
algorithms, exhibiting an average accuracy rate of 82.04%and algorithm and mainstream tracking algorithms are shown in
a success rate of 64.72%. In complex scenarios such as FM, Figure 6.
OCC, and MB, the C-ECO algorithm demonstrates a supe- From Figure 6(a), it can be seen that when the position
rior performance, surpassing the ECO algorithm by 0.51% error threshold is less than 20, the accuracy value of the
in terms of accuracy and 0.23% in terms of success rate. C-ECO algorithm in this paper is slightly lower than the
These improvements highlight the enhanced accuracy and ECO algorithm, and at the same time, it has a large lead
success rate achieved by the C-ECO algorithm over the ECO compared with other mainstream tracking algorithms, and
algorithm. after the position error threshold is greater than 20, C-ECO
FIGURE 6. Accuracy and success curves of 7 tracking algorithms on the OTB-100 dataset.
algorithm overtakes ECO algorithm and continues to lead. that KCF, ECO, and C-ECO are able to track the target
From Figure 6(b), it can be seen that when the IoU setting more accurately when the target is moving without distur-
is less than 0.5, the gap between ECO and C-ECO success bance, where C-ECO successfully tracked the small target.
rate is not obvious, and when IoU is greater than 0.6 C-ECO When a new target appeared as shown in Figure 7(a2), KCF
is in the leading position and significantly better than other tracked the shadow of the target incorrectly, and ECO and
tracking algorithms. C-ECO were able to track it correctly. When the initial
The main challenges of the video in Figure 7(a1-a4) transformation as shown in Figure 7(a3-a4) occurred and
are SV, OCC, and LR. from Figure 7(a1), it can be seen the moving target was shaded and separated, both KCF and
FIGURE 7. Comparison of tracking effects. For clarity of display, only KCF, ECO and the algorithm C-ECO are used in this paper. black boxes are KCF, yellow
boxes are ECO and blue boxes are C-ECO.
ECO lost the original target, and C-ECO was able to track pedestrians, interplay of street lamps, and mutual obstruction
accurately. among pedestrians engaged in typical walking, the tracking
The main challenges of the video in Figure 7(b1-b4) are efficacy of the KCF and ECO algorithms is notably compro-
IV, DEF, MB. from Figure 7(b1-b2), it can be seen that KCF mised. In contrast, the proposed C-ECO framework maintains
shows tracking drift when the target is occluded, as shown a commendable level of tracking performance under these
in Figure 7(b3-b4) when the video has motion blur and has a demanding conditions.
large angle change and a complex background, KCF predic- Frame c3 highlights that, in the case of swiftly moving
tion frame shows a large range of drift, ECO has a similarly pedestrians, the KCF algorithm has regrettably lost track of
colored background for the C-ECO algorithm is still able to the target entirely, and ECO, though valiant, still struggles to
track more accurately. maintain effective tracking. Remarkably, C-ECO manages to
The principal challenges in the video sequence depicted uphold a higher degree of tracking accuracy even in the face
in Figure 7(c1)-(c4) encompass issues such as Scale Vari- of such dynamic scenarios.
ation (SV), Object Occlusion (OCC), Deformation (DEF), In frame c4, we observe that when confronted with the
and Object Perspective Changes (OPR). From the analysis challenge of tracking small targets within complex scenes rife
of frame c2, it becomes apparent that as the complexity of with occlusions, both KCF and ECO have forfeited the ability
occlusion scenarios intensifies, featuring rapid movements of to track the target. In contrast, C-ECO exhibits resilience and
maintains the ability to accurately track the target, thus show- perspective for intelligent monitoring applications, contribut-
casing its remarkable robustness in these intricate situations. ing to advancements in the field.
The primary challenges encountered in the video sequence
depicted in Figure 7(d1)-(d4) encompass Illumination Varia- V. FUTURE WORK AND PROSPECTS
tion (IV), Scale Variation (SV), Object Occlusion (OCC), and While the pedestrian tracking algorithm proposed in this
Deformation (DEF). Notably, from the scrutiny of frame d2, paper, based on a lightweight convolutional neural network,
it becomes evident that when the tracked target shares a color has demonstrated favorable results in terms of tracking speed
proximity with the environmental background, both the KCF and accuracy, it is important to acknowledge the existing
and ECO algorithms are susceptible to tracking drift. limitations and room for improvement.
Further insights from frames d3 and d4 reveal that as Firstly, the experiments conducted in this study pri-
the tracked target progressively recedes from the camera, marily encompassed scenarios with good lighting condi-
diminishing in size and encountering occlusions, both KCF tions, favorable weather conditions, and indoor monitoring
and ECO manifest varying degrees of tracking drift. In con- scenes. To provide a more comprehensive evaluation of
trast, owing to its augmented feature extraction capabilities, the algorithm’s performance, future research should include
C-ECO demonstrates a heightened resistance to tracking fail- pedestrian monitoring videos captured under poor lighting
ures in these challenging scenarios. conditions at night, as well as videos recorded in adverse
The experimental results presented above emphasize the weather conditions such as rain, snow, and fog. By expanding
superior performance of the C-ECO algorithm proposed in the dataset to encompass these challenging scenarios, the
this paper compared to other classical correlation filter track- algorithm’s robustness and generalizability can be further
ing algorithms. The C-ECO algorithm not only achieves scrutinized.
higher accuracy and success rates in tracking, but it also Despite efforts to simplify the algorithm’s architecture in
exhibits a significantly smaller feature extraction model size this study, the inference operation still requires a substantial
compared to the ECO algorithm prior to optimization. Addi- amount of computing resources due to the inherent com-
tionally, the algorithm effectively improves the frame rate for plexity of the model and its network. As part of our future
video tracking. The comparison with other classical corre- work, we aim to optimize and refine the model to minimize
lation filter tracking algorithms highlights the strength and computational demands, enabling it to be executed efficiently
competitiveness of the C-ECO algorithm. Its enhanced accu- on mobile devices while maintaining or even improving its
racy and success rates solidify its position as a reliable and performance.
efficient tracking solution. Furthermore, the reduced model It is also essential to note that this algorithm has its
size contributes to its practicality and resource efficiency, limitations and areas for further refinement. Additional exper-
while the improved frame rate enriches the user experience iments can be conducted to address these deficiencies and
during video tracking tasks. These findings demonstrate the improve the overall effectiveness of the proposed pedestrian
significance of the C-ECO algorithm in advancing correlation tracking algorithm. By embracing these future endeavors,
filter tracking methods and establishing it as a promising we aspire to achieve superior results and make significant
choice for various tracking applications. advancements in the field of pedestrian monitoring and
tracking.
IV. CONCLUSION AND EXPECTATIONS
In this study, we propose a novel C-ECO tracking algorithm REFERENCES
that leverages a lightweight convolutional neural network.
[1] M. Kumar and S. Mondal, ‘‘Recent developments on target tracking prob-
The algorithm employs MobileNet v2 for efficient feature lems: A review,’’ Ocean Eng., vol. 236, Sep. 2021, Art. no. 109558.
extraction, integrates the A2-Net module to enhance feature [2] N. Mahmoudi, S. M. Ahadi, and M. Rahmati, ‘‘Multi-target tracking using
representation, and incorporates a pre-trained model to expe- CNN-based features: CNNMTT,’’ Multimedia Tools Appl., vol. 78, no. 6,
pp. 7077–7096, Mar. 2019.
dite training. The primary objective is to improve tracking
[3] M. A. Khan, H. Menouar, and R. Hamila, ‘‘Visual crowd analysis: Open
accuracy and success rates. Experimental results demonstrate research problems,’’ 2023, arXiv:2308.10677.
that the C-ECO algorithm outperforms the previous ECO [4] M. A. Khan, H. Menouar, and R. Hamila, ‘‘Revisiting crowd counting:
algorithm employing VGG Net, exhibiting a 27.96% reduc- State-of-the-art, trends, and future perspectives,’’ 2022, arXiv:2209.07271.
[5] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, ‘‘Visual object
tion in model size and a 46.11% improvement in frame rate.
tracking using adaptive correlation filters,’’ in Proc. IEEE Comput. Soc.
Importantly, these improvements are achieved without com- Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2544–2550.
promising the accuracy and success rate achieved before the [6] H. Moridvaisi, F. Razzazi, M. A. Pourmina, and M. Dousti, ‘‘An extended
enhancements. When compared with six other mainstream KCF tracking algorithm based on TLD structure in low frame rate videos,’’
Multimedia Tools Appl., vol. 79, nos. 29–30, pp. 20995–21012, Apr. 2020.
tracking algorithms, including ECO, the C-ECO algorithm [7] H. Yang, S. Gao, X. Wu, and Y. Zhang, ‘‘Online multi-object tracking using
consistently ranks at the top, boasting an average accuracy KCF-based single-object tracker with occlusion analysis,’’ Multimedia
rate of 82.04% and a success rate of 64.72%. The lightweight Syst., vol. 26, no. 6, pp. 655–669, Dec. 2020.
pedestrian tracking algorithm proposed in this paper show- [8] K. S. Sanagavarapu and M. Pullakandam, ‘‘Object tracking based surgical
incision region encoding using scalable high efficiency video coding for
cases its ability to effectively detect and track pedestrians surgical telementoring applications,’’ Radioengineering, vol. 31, no. 2,
in various complex scenarios. This research provides a new pp. 231–242, May 2022.
[9] J. T. Mbelwa, Q. Zhao, Y. Lu, F. Wang, and M. E. Mbise, ‘‘Visual [31] A. Michele, V. Colin, and D. D. Santika, ‘‘MobileNet convolutional neural
tracking using objectness-bounding box regression and correlation filters,’’ networks and support vector machines for palmprint recognition,’’ Proc.
J. Electron. Imag., vol. 27, no. 2, p. 1, Mar. 2018. Comput. Sci., vol. 157, pp. 110–117, Jan. 2019.
[10] B. Khan, A. Jalil, A. Ali, K. Alkhaledi, K. Mehmood, K. M. Cheema, [32] U. Kulkarni, M. S. Meena, S. V. Gurlahosur, and G. Bhogar, ‘‘Quanti-
M. Murad, H. Tariq, and A. M. El-Sherbeeny, ‘‘Multiple cues-based zation friendly MobileNet (QF-MobileNet) architecture for vision based
robust visual object tracking method,’’ Electronics, vol. 11, no. 3, p. 345, applications on embedded platforms,’’ Neural Netw., vol. 136, pp. 28–39,
Jan. 2022. Apr. 2021, doi: 10.1016/j.neunet.2020.12.022.
[11] V. Degli-Esposti, F. Fuschini, H. L. Bertoni, R. S. Thomä, T. Kürner, [33] X. Zhai, H. Wei, Y. He, Y. Shang, and C. Liu, ‘‘Underwater sea cucumber
X. Yin, and K. Guan, ‘‘IEEE access special section editorial: Millimeter- identification based on improved YOLOv5,’’ Appl. Sci., vol. 12, no. 18,
wave and terahertz propagation, channel modeling, and applications,’’ p. 9105, Sep. 2022.
IEEE Access, vol. 9, pp. 67660–67666, 2021. [34] H. Fu, G. Song, and Y. Wang, ‘‘Improved YOLOv4 marine target detection
[12] Y. Liang, Y. Liu, Y. Yan, L. Zhang, and H. Wang, ‘‘Robust visual tracking combined with CBAM,’’ Symmetry, vol. 13, no. 4, p. 623, Apr. 2021.
via spatio-temporal adaptive and channel selective correlation filters,’’ [35] K. Xu, Z. Wang, J. Shi, H. Li, and Q. C. Zhang, ‘‘A2-Net: Molecular
Pattern Recognit., vol. 112, Apr. 2021, Art. no. 107738. structure estimation from cryo-EM density, volumes,’’ in Proc. AAAI Conf.
[13] N. Zdarsky, S. Treue, and M. Esghaei, ‘‘A deep learning-based approach Artif. Intell., vol. 33, Jul. 2019, pp. 1230–1237.
to video-based eye tracking for human psychophysics,’’ Frontiers Hum. [36] Y. Chen, X. Zhang, W. Chen, Y. Li, and J. Wang, ‘‘Research on recognition
Neurosci., vol. 15, pp. 1–10, Jul. 2021. of fly species based on improved RetinaNet and CBAM,’’ IEEE Access,
[14] H. A. I. T. Abdelali, H. Derrouz, Y. Zennayi, R. O. H. Thami, vol. 8, pp. 102907–102919, 2020.
and F. Bourzeix, ‘‘Multiple hypothesis detection and tracking using
deep learning for video traffic surveillance,’’ IEEE Access, vol. 9,
pp. 164282–164291, 2021, doi: 10.1109/ACCESS.2021.3133529.
[15] J. Zhang, Y. Liu, H. Liu, J. Wang, and Y. Zhang, ‘‘Distractor-aware visual
tracking using hierarchical correlation filters adaptive selection,’’ Appl.
Intell., vol. 52, no. 6, pp. 6129–6147, Sep. 2021.
[16] L. Almuqren, M. A. Hamza, A. Mohamed, and A. A. Abdelmageed,
‘‘Automated video-based face detection using Harris hawks optimiza-
tion with deep learning,’’ Comput., Mater. Continua, vol. 75, no. 3, HONGLEI WEI was born in 1973. He received
pp. 4917–4933, 2023. the Ph.D. degree from the Dalian University of
[17] I. Ahmed, M. Ahmad, A. Ahmad, and G. Jeon, ‘‘Top view multiple Technology. He is currently an Associate Professor
people tracking by detection using deep SORT and YOLOv3 with transfer with Dalian Polytechnic University. His research
learning: Within 5G infrastructure,’’ Int. J. Mach. Learn. Cybern., vol. 12, interests include machine vision, deep learning,
no. 11, pp. 3053–3067, Oct. 2020.
and object tracking.
[18] W. Zhang, Y. Du, Z. Chen, J. Deng, and P. Liu, ‘‘Robust adaptive learning
with Siamese network architecture for visual tracking,’’ Vis. Comput.,
vol. 37, no. 5, pp. 881–894, Apr. 2020.
[19] Y. Wang, H. Huang, X. Huang, and Y. Tian, ‘‘ECO-HC based tracking for
ground moving target using single UAV,’’ Neural Comput. Appl., vol. 32,
no. 10, pp. 1–12, Jul. 2020.
[20] P. Wang, M. Sun, H. Wang, X. Li, and Y. Yang, ‘‘Convolution operators for
visual tracking based on spatial–temporal regularization,’’ Neural Comput.
Appl., vol. 32, no. 10, pp. 5339–5351, Jan. 2020.
[21] S. Shen, S. Tian, L. Wang, A. Shen, and X. Liu, ‘‘Improved C-COT based
on feature channels confidence for visual tracking,’’ J. Adv. Mech. Design,
Syst., Manuf., vol. 13, no. 5, 2019, Art. no. JAMDSM0096.
[22] R. Zhang, Y. Zheng, C. C. Y. Poon, D. Shen, and J. Y. W. Lau, ‘‘Polyp XIANYI ZHAI was born in 1998. He received the
detection during colonoscopy using a regression-based convolutional neu- bachelor’s degree from Dalian Maritime Univer-
ral network with a tracker,’’ Pattern Recognit., vol. 83, pp. 209–219, sity. He is currently pursuing the master’s degree
Nov. 2018. with Dalian Polytechnic University. His research
[23] P. M. Raju, D. Mishra, and R. K. S. S. Gorthi, ‘‘Detection based long term interests include machine vision, object detection,
tracking in correlation filter trackers,’’ Pattern Recognit. Lett., vol. 122, and object tracking.
pp. 79–85, May 2019.
[24] D. Yuan, X. Chang, P. Huang, Q. Liu, and Z. He, ‘‘Self-supervised deep
correlation tracking,’’ IEEE Trans. Image Process., vol. 30, pp. 976–985,
2021.
[25] C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, ‘‘RGB-T object track-
ing: Benchmark and baseline,’’ Pattern Recognit., vol. 96, Dec. 2019,
Art. no. 106977.
[26] T. Xu, Z.-H. Feng, X.-J. Wu, and J. Kittler, ‘‘Learning adaptive discrimina-
tive correlation filters via temporal consistency preserving spatial feature
selection for robust visual object tracking,’’ IEEE Trans. Image Process.,
vol. 28, no. 11, pp. 5596–5609, Nov. 2019.
[27] Z. Liang and J. Shen, ‘‘Local semantic Siamese networks for fast tracking,’’
IEEE Trans. Image Process., vol. 29, pp. 3351–3364, 2020. HONGDA WU was born in 1998. He received the
[28] J. Zhang, J. Sun, J. Wang, and X.-G. Yue, ‘‘Visual object tracking based bachelor’s degree from Dalian Polytechnic Uni-
on residual network and cascaded correlation filters,’’ J. Ambient Intell. versity, where he is currently pursuing the master’s
Humanized Comput., vol. 12, no. 8, pp. 8427–8440, Sep. 2020. degree. His research interests include automated
[29] J. Chen, D. Zhang, M. Suzauddola, and A. Zeb, ‘‘Identifying crop diseases control and intelligent algorithms.
using attention embedded MobileNet-V2 model,’’ Appl. Soft Comput.,
vol. 113, Dec. 2021, Art. no. 107901.
[30] B. Singh, D. Toshniwal, and S. K. Allur, ‘‘Shunt connection: An intelligent
skipping of contiguous blocks for optimizing MobileNet-V2,’’ Neural
Netw., vol. 118, pp. 192–203, Oct. 2019.