0% found this document useful (0 votes)
41 views

Pedestrian Tracking Algorithm For Video Surveillance Based On Lightweight Convolutional Neural Network

The document proposes a new pedestrian tracking algorithm called C-ECO that uses a lightweight MobileNet v2 model instead of the VGG model used in the ECO algorithm. It also incorporates a Double Attention Networks module to improve feature extraction and uses pre-training to speed up model convergence. Experimental results show the C-ECO algorithm achieves comparable accuracy to ECO while reducing model size and increasing tracking frame rate.

Uploaded by

llama8873
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Pedestrian Tracking Algorithm For Video Surveillance Based On Lightweight Convolutional Neural Network

The document proposes a new pedestrian tracking algorithm called C-ECO that uses a lightweight MobileNet v2 model instead of the VGG model used in the ECO algorithm. It also incorporates a Double Attention Networks module to improve feature extraction and uses pre-training to speed up model convergence. Experimental results show the C-ECO algorithm achieves comparable accuracy to ECO while reducing model size and increasing tracking frame rate.

Uploaded by

llama8873
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received 15 December 2023, accepted 2 February 2024, date of publication 13 February 2024, date of current version 21 February 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3365501

Pedestrian Tracking Algorithm for Video


Surveillance Based on Lightweight
Convolutional Neural Network
HONGLEI WEI, XIANYI ZHAI , AND HONGDA WU
School of Mechanical Engineering and Automation, Dalian Polytechnic University, Dalian, Liaoning 116034, China
Corresponding author: Honglei Wei ([email protected])
This work was supported in part by the Liaoning Provincial Department of Education 2021 Annual Scientific Research Funding Program
under Grant LJKZ0535 and Grant LJKZ0526; in part by the 2021 Annual Comprehensive Reform of Undergraduate Education Teaching,
Dalian Polytechnic University, under Grant JGLX2021020 and Grant JCLX2021008; and in part by the Graduate Innovation Fund, Dalian
Polytechnic University, under Grant 2023CXYJ13.

ABSTRACT The Efficient Convolution Operators for Tracking (ECO) algorithm has garnered considerable
attention in both academic research and practical applications due to its remarkable tracking efficacy, yielding
exceptional accuracy and success rates in various challenging contexts. However, the ECO algorithm heavily
relies on the deep learning Visual Geometry Group (VGG) network model, which entails complexity and
substantial computational resources. Moreover, its performance tends to deteriorate in scenarios involving
target occlusion, background clutter, and similar challenges. To tackle these issues, this study introduces
a novel enhancement to the pedestrian tracking algorithm. Specifically, the VGG network is substituted
with a lightweight MobileNet v2 model, thereby reducing computational demands. Additionally, a Double
Attention Networks (A2-Net) module is incorporated to augment the extraction of crucial information, while
pre-training techniques are integrated to expedite model convergence. Experimental results demonstrate that
the C-ECO algorithm achieves comparable accuracy and success rates to the conventional ECO algorithm,
despite reducing the model size by 27.96% and increasing the tracking frame rate by 46.11%. Notably, when
compared to other prevalent tracking algorithms, the C-ECO algorithm exhibits an accuracy of 82.20% and
a success rate of 64.72%. These findings underscore the enhanced adaptability of the C-ECO algorithm in
complex environments, offering a more lightweight model while delivering superior tracking capabilities.

INDEX TERMS Machine vision, target tracking, deep learning, efficient convolution operator, pedestrian
tracking.

I. INTRODUCTION monitoring, video surveillance, security, and other fields, and


Target tracking is an important branch of machine vision and has certain application value and challenges. There has been
is the technical basis for intelligent visual surveillance, visual a remarkable upsurge of interest in automated crowd moni-
behavior analysis, and human-computer interaction [1]. Tar- toring within the computer vision community. Modern deep
get tracking refers to the technique of predicting the location learning techniques have enabled the development of fully
of a target in each frame of a subsequent video sequence based automated crowd monitoring applications based on visual
on the given target location in the input video sequence [2]. analysis. Even with the magnitude of the issue, the substantial
Pedestrian detection and tracking is a research hotspot in technological progress, and the unwavering interest from the
the field of computer vision, which can be applied to traffic research community, there are still several challenges that
demand attention [3], [4].
The associate editor coordinating the review of this manuscript and 2010 CVPR, Bolme et al. [5] first applied correlation
approving it for publication was Xuebo Zhang . filtering to the field of tracking, and based on his idea,
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by/4.0/ 24831
H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

algorithms using correlation filtering for target tracking have proposed Spatio-Temporal adaptive and Channel selective
appeared one after another, and the tracking effect has Correlation Filters (STCCF) for robust tracking, selecting a
the tracking results are getting better and better. Morid- set of target-specific features from high dimensional features,
vaisi et al. [6] surmount KCF’s limitations through the lens STCCF can not only alleviate the over-fitting problem and
of the Tracking-Learning-Detection (TLD) framework and reduce the computational cost, but also enhance the discrim-
devised an algorithm that concurrently trains two classifiers, inability and interpretability of the learned filters.
employing a semi-supervised co-training learning algorithm. With the rapid development of deep learning in recent
Subsequently, they subject the proposed method to rigorous years, many scholars use deep learning networks to extract
scrutiny against TB-100 datasets, juxtaposed with its coun- image features and fuse them with relevant filters to perform
terparts. Yang et al. [7] used KCF-based SOT to learn dis- target tracking.
criminative target appearance that relied on hand-crafted deep Zdarsky et al. [13] introduced a deep learning-based
features and used the prediction results to refine detection approach that uses the video frames of low-cost web cam-
errors in new ways and eliminated tracking errors caused by eras. Using DeepLabCut (DLC), an open-source toolbox
uncorrelated algorithms. Sanagavarapu and Pullakandam [8] for extracting points of interest from videos, they obtained
proposed the method using the Kernelized Correlation Filter facial landmarks critical to gaze location and estimated the
(KCF) object tracking technique. The segmented region is point of gaze on a computer screen via a shallow neural
encoded by the complexity-efficient Scalable HEVC (SHVC) network. Tested for three extreme poses, this architecture
to meet the resolution of an end-user device. The com- reached a median error of about one degree of visual
plexity of SHVC is decreased by using the Convolutional angle. Abdelali et al. [14] introduced a wholly automated
Neural Network (CNN) and Long- and Short-Term Memory methodology for Multiple Hypothesis Detection and Track-
(LSTM) to predict the Coding Tree Unit (CTU) structure. ing (MHDT) in the domain of video traffic surveillance.
The results show that the proposed method decreases the The presented framework integrates the Kalman filter with
bitrate significantly for video sequences without degradation data association-based tracking techniques, employing the
in Peak Signal-to-Noise Ratio (PSNR). A tracking method YOLO detection method, to adeptly monitor vehicles in
that integrates the objectness-bounding box regression (O- intricate traffic surveillance scenarios. Empirical findings
BBR) model and a scheme based on kernelized correlation substantiate that the proposed approach exhibits resilience
filter (KCF) is proposed by Mbelwa et al. [9]. The scheme in discerning and tracing the trajectories of vehicles under
based on KCF is used to improve the tracking performance diverse circumstances, including scale variations, station-
of FM and MB. For handling drift problems caused by ary vehicles, rotations, fluctuating lighting conditions, and
OCC and IV, we propose objectness proposals trained in instances of occlusion. Zhang et al. [15] introduced a pioneer-
bounding box regression as prior knowledge to provide can- ing approach known as Harris Hawks Optimization with deep
didates and background suppression. Finally, scheme KCF learning-enhanced automated face detection and tracking
as a base tracker and O-BBR are fused to obtain the state (HHODL-AFDT). The HHODL-AFDT model, as proposed,
of a target object. Khan et al. [10] proposed a new criterion incorporates a Faster Region-Based Convolutional Neural
based on the hybridization of multiple cues i.e., average Network (RCNN) for face detection and leverages the Harris
peak correlation energy (APCE) and confidence of squared Hawks Optimization (HHO) for hyperparameter optimiza-
response map (CSRM), which is presented to enhance the tion. The optimized Faster RCNN model presented in this
tracking efficiency. They updated the occlusion detection context impeccably discerns facial features and feeds this
module adaptive learning rate adjustment module, and drift information into the face-tracking model through a regres-
handling using an adaptive learning rate model based on sion network (REGN). The face tracking, facilitated by the
hybridized criterion, and integrated all these modules to REGN model, makes use of features extracted from adjacent
propose a new tracking scheme. Degli-Esposti et al. [11] frames to anticipate the facial target’s location in subse-
proposed a new algorithm for object tracking in SWIR imag- quent frames. Almuqren et al. [16] presented an effective
ing, using a kernelized correlation filter (KCF) as a basic method to track an object based on a combination of feature
tracker. To overcome occlusions, they proposed the use of hierarchies of CNNs, they combined several feature hier-
the Kalman filter as a predictor and a method to expand the archies and compute the more discriminative map to track
object search area. To cope with outliers, Huber’s M-robust the object, a novel method of feature hierarchies integra-
approach is applied, so this paper proposes robustification of tion based on Kullback–Leibler (KL) divergence is adopted.
the Kalman filter by introducing a nonlinear Huber’s influ- Ahmed et al. [17] unveiled an intricate multi-person tracking
ence function in the Kalman filter estimation step. To make a framework, thoughtfully intertwined with 5G infrastructure.
balance between desired estimator efficiency and resistance Employing a top-view perspective, this framework yields an
to outliers, a new adaptive M-robustified Kalman filter is expansive scope of the observed scene or field of vision.
proposed. This is achieved by adjusting the saturation thresh- The essence of person tracking is encapsulated within a deep
old of the influence function using the detection confidence learning-driven tracking-by-detection framework, wherein
information from the basic KCF tracker. Liang et al. [12] detection duties are seamlessly executed by the YOLOv3

24832 VOLUME 12, 2024


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

model, and the subsequent tracking operations are orches- The algorithm finally achieves target localization and filter
trated by the Deep SORT algorithm. To further elevate the update by applying the convolutional features of the target
precision of the detection model, a transfer learning approach in the input image video, the directional gradient histogram
is artfully employed. In this methodology, a detection model feature HOG (histogram of gradients) and the color channel
capitalizes on a pre-trained foundation, enriched with an feature CN (color-names) [22], [23].
additional layer meticulously fine-tuned using a top-view The ECO algorithm mainly includes the processes of
dataset. Zhang et al. [18] proposed a robust adaptive learning feature extraction, continuous convolution operation, con-
visual tracking algorithm, HOG features, CN features, and volution operation of factorization, generation of sample
deep convolution features are extracted from the template space model and correlation filtering operation [24]. First,
frame and search region frame, respectively, and analyzed the the interpolation operation is performed for the features x
merits of each feature and perform feature adaptive fusion to in the search region of the target to be detected as shown in
improve the validity of feature representation. Equation (1) [25].
Although target tracking algorithms have been developed d −1
NX  
over many years, the current algorithms still face challenges
n o T
Jd x d (t) = x d [n] bd t − n (1)
in accurately tracking targets that experience occlusion, back- Nd
n=0
ground clutter, or leave the field of view. Additionally, the
where: xd denotes the d-channel characteristic of x,
deep learning network models, while highly effective, are
Jd x d (t) is a function on t ∈ [0, T ) that represents the

complex and have a large number of parameters. Conse-
result of the interpolation operation of x d , x d [n] ∈ RNd is
quently, they require substantial computational resources and
a function with respect to n ∈ {0, · · ·, Nd − 1}, Nd indicates
place higher demands on computer hardware.
To address the above problems, this study proposes resolution, bd t − NTd n denotes the d-channel interpolation
a lightweight convolutional neural network-based C-ECO function. The interpolation results for channels 1 to D are
tracking algorithm [19] based on the ECO tracking algorithm denoted by J {x} (t) ∈ RD , abbreviated as J {x}. After that,
based on deep learning and correlation filtering. The ECO the filter is simplified using principal component analysis and
algorithm has two implementation forms, ECO based on the response score SPf {x} obtained by convolving with J {x}
convolutional features and ECO_HC based on artificial fea- is shown in equation (2) [26].
tures [20]. Combining the accuracy and speed considerations, SPf {x} = Pf ∗ J {x} = f ∗ PT J {x} (2)
this experiment chooses the ECO algorithm based on convo-
lutional features for optimization and improvement. The main where: f denotes the filter with channel number D, denotes
contributions of this study are as follows: the convolution calculation, P is the projection matrix of D
(1) In response to the complex deep learning VGG network rows and C columns, and PT denotes its transpose matrix. The
model in the ECO algorithm, which occupies large computa- position of the score maximum, i.e., the new position of the
tional resources, a lightweight MobileNet v2 is used instead target, is obtained by optimizing SPf {x} using the Gaussian
to perform feature information extraction, which effectively Newton algorithm. Finally, the data set is compressed using
reduces resource consumption and improves tracking speed. a Gaussian mixture model, and the error of the convolutional
(2) In order to improve the target feature extraction ability response score SPf {x} of the training sample and the current
of the convolutional network, the A2-Net module is added filter f with the Gaussian label y0 of the training sample is
to MobileNet v2, which effectively improves the extraction taken as L2 parametric, and the penalty term is added to obtain
effect of the network on important information with a small the loss function as shown in equation (3).
increase of computing parameters, and significantly improves M C
2 2
X X
the training efficiency and tracking accuracy. E (f ) = πm SPf {µm } − y0 L2
+ ωf c L2
(3)
(3) Introducing the pre-training model in the training stage m=1 c=1
effectively accelerates the model convergence speed, signifi- where: µm and πm are the mean and weight of the training
cantly shortens the training time, and effectively improves the samples, respectively; M is the total number of training sam-
accuracy and success rate of the tracking algorithm. ples; ω is the penalty term of f . P is only calculated in the
The remaining chapters of this paper are organized as first frame and is kept constant when f is updated using the
follows: in Section I, the basics are introduced, in Section II, conjugate gradient algorithm to re-solve (3) every 6 frames
the construction of the C-ECO algorithm is introduced, thereafter [27].
in Section III, the model is subjected to ablation experiments, In summary, ECO takes three ways to improve by reducing
comparison tests, and in Section IV, the whole paper is sum- the filter, optimizing the training set and reducing the filter
marized and in Section V, an outlook on future work is given. update frequency, which effectively improves the tracking
speed.
II. FOUNDATIONAL THEORIES AND PROPOSED METHOD
A. THE OBJECT TRACKING ECO ALGORITHM B. LIGHTWEIGHT NETWORK MobileNet v2
The ECO target tracking algorithm is improved from the The ECO algorithm uses convolutional networks of VGG19
continuous convolutional tracking algorithm C-COT [21]. and ResNet50, which have deeper networks, better feature

VOLUME 12, 2024 24833


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

TABLE 1. MobileNet v2 network structure.

extraction, and higher tracking accuracy, but the overly com- Depthwise Separable convolution significantly reduces the
plex networks and a huge number of parameters take up number of operations and parameters, which can effectively
a lot of computational resources and require higher hard- reduce the complexity of the network and improve the speed
ware, which leads to an increase in computational cost [28]. of target tracking.
The target tracking task requires high speed, so it is neces- DK × DK × M + M × N 1 1
sary to build a lightweight convolutional network model to = + 2 (4)
DK × DK × M × N N DK
reduce the model size and improve the detection speed while
DK × DK × M × DF × DF + M × N × DF × DF
guaranteeing accuracy. MobileNet has a simple streamlined
structure with the advantages of a small number of parameters DK × DK × M × N × DF × DF
and low latency. MobileNet network structure is shown in 1 1
= + 2 (5)
Table 1, where tis the expansion factor, c is the number N DK
of channels, n is the block number, and s is the step size
3) A2-NET ATTENTION MODULE
[29], [30].
During the process of network training, as the volume of
information to be acquired grows, the complexity of the
1) DEPTHWISE SEPARABLE CONVOLUTION
model also tends to rise. Consequently, this heightened com-
MobileNet v2 is mainly composed of depth separable convo- plexity necessitates increased computational capacity from
lution (DSC), the standard convolution operation is split into the hardware on which the model is deployed. The attention
a Depthwise convolution (DW) and a pointwise convolution mechanism plays a pivotal role in this context by sieving
(PW) [31]. The comparison of the convolution is shown in and selecting a small fraction of significant information from
Figure 1. For the feature map obtained by Depthwise convo- a substantial volume of data. By concentrating predomi-
lution, a 1×1 convolution kernel is used in the point-by-point nantly on this essential information, the attention mechanism
convolution to perform the convolution operation, and the effectively disregards the majority of relatively unimportant
final output feature layer after point-by-point convolution has data [35].
the same dimension as the standard convolution [32]. The fundamental concept of A2-Net revolves around gath-
ering the pivotal features of the entire space into a concise
2) CONTRAST BETWEEN DEPTHWISE SEPARABLE set, followed by an adaptive distribution to each location.
CONVOLUTION AND TRADITIONAL This enables subsequent convolutional layers to sense the fea-
CONVOLUTIONAL NETWORK tures of the entire space even without an extensive receptive
The number of parameters and the amount of computation field. The A2-Net module introduces a dual attention block
of Depthwise Separable convolution is compared with the specifically designed to efficiently capture and distribute
standard convolution to get the ratio of the number of param- long-distance features. This architectural design showcases
eters (4) and the ratio of the amount of computation (5). its potential for enhancing image and video recognition per-
Generally speaking, N is larger, 1/N is negligible, and DK formance, as it effectively models quadratic feature statistics
indicates the size of the convolution kernel. The number of and adapts feature assignments.
parameters and computation of Depthwise Separable convo- The central premise of the A2-Net module involves two
lution is reduced to about 1/D2K of the original one, and if the primary steps. Initially, it gathers crucial features from the
common 3 × 3 convolution kernel is used, it can be reduced complete space and condenses them into a concise set. Sub-
to about 1/9 of the original one [33], [34]. It can be seen that sequently, these pivotal features are adaptively distributed

24834 VOLUME 12, 2024


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

FIGURE 1. Comparison of standard convolution and Depthwise Separable Convolution. Figure 1(a) shows the standard convolution,
Figure 1(b) shows the Depthwise convolution and Figure 1(c) shows the pointwise convolution.

TABLE 2. Performance comparison of A2-net on imagenet-1K dataset.

FIGURE 2. A2-Net module structure.

by Non-local and Transformer, which examines correlations


to each location, allowing subsequent convolutional lay- between features from all locations and specific positions.
ers to perceive the features of the overall space without Table 2 shows the performance comparison between
necessitating an extensive receptive field. The first-level A2-Net and the famous attention network SENet using
attention operation within the A2-Net selectively gathers ImageNet-1K as the dataset. The commonly used metrics
crucial features from the complete space, ensuring the inte- Top1-acc and Top5-acc are selected for performance com-
gration of vital information. Meanwhile, the second-level parison. Top-1 Accuracy refers to the Accuracy that the
attention operation employs an additional attention mecha- top-ranked category matches the actual results, and Top-5
nism to dynamically allocate subsets of pivotal features to accuracy refers to the accuracy that the top-five categories
complement each specific spatio-temporal location within the contain the actual results.
higher-level task [36]. For a visual representation of the A2- It can be seen from Table 1 that in terms of two indicators
Net module’s structure, refer to Figure 2, which depicts the Top1-acc and Top5-acc, the prediction effect of using A2-Net
module’s components and architecture. is improved compared with that on SENet and ResNet.
A2-Net shares some similarities with SENet, covariance
pooling, Non-local, and Transformer. However, it sets itself 4) CONSTRUCTION OF C-ECO TRACKING ALGORITHM
apart through its first attention operation, which implic- The C-ECO pedestrian tracking algorithm proposed in this
itly calculates second-order statistics of pooled features. study is based on the ECO tracking algorithm. In order to
This unique approach allows A2-Net to capture intricate increase the ability of the network to extract feature infor-
appearance and motion correlations that elude global average mation and improve the accuracy of target detection, the
pooling, a technique employed in SENet. Additionally, the A2-Net module is introduced into MobileNet v2. By adding
second attention operation in A2-Net dynamically allocates A2-Net module after the second, fourth and sixth layer of
features from a concise collection, providing a more effi- bottleneck, the final C-MobileNet is obtained by replacing
cient alternative to the exhaustive feature correlation utilized the bottleneck in the original network, and the improved

VOLUME 12, 2024 24835


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

respectively, using the dataset OTB-100, and four metrics


were used to measure model strengths and weaknesses,
namely model size, accuracy, success rate, and FPS.
(1) Accuracy rate: the percentage of video frames in which
the distance between the center point of the target location
(bounding box) estimated by the tracking algorithm and the
center point of the manually labeled (ground-truth) target is
less than a given threshold.
(2) Success rate: define the overlap score (OS), the bound-
ing box obtained by the tracking algorithm (denoted as a), and
the box obtained by ground-truth (denoted as b), the overlap
rate is defined as OS = |a ∩ b| / |a ∪ b| , | · | indicates the
FIGURE 3. C-MobileNet network structure. number of pixels in the region. When the OS of a frame
is greater than the set threshold, the frame is considered as
Success and the percentage of successful frames to all frames
C-MobileNet network structure is shown in Figure 3. The is the Success rate.
addition of three A2-Net modules only adds fewer training (3) FPS: The number of frames per second that the tracking
parameters and operations, but brings a great improvement algorithm processes the image.
in the ability to extract important information in the feature The experimental results are shown in Table 3.
map. According to the data presented in Table 3, when
The ECO algorithm is enhanced by incorporating MobileNet v2 is used for feature extraction, the algorithm
C-MobileNet, which replaces the VGG network utilized in model size is reduced by 30.29% compared to the VGG
the original algorithm. This modification leads to the creation network. Although there is a slight decrease in accuracy
of the C-ECO algorithm, which is based on a lightweight and success rate, the FPS (frames per second) shows some
convolutional neural network. In this paper, the C-ECO improvement. Furthermore, with the inclusion of the A2-Net
algorithm is designed to improve upon the limitations of the module, the model size increases slightly but remains 27.96%
VGG19 network, characterized by its complex structure and smaller than the VGG network model. The accuracy and suc-
redundant parameters. Instead, the lightweight MobileNet v2 cess rates are also comparable to the VGG network, differing
is employed for efficient feature extraction. Additionally, the by less than 1%. Notably, the FPS improves significantly,
A2-Net attention module is introduced to enhance recogni- increasing from 16.57 FPS to 24.21 FPS, representing a
tion performance. By balancing feature extraction capability remarkable 46.11%improvement. These results demonstrate
and network complexity, the C-ECO algorithm achieves a the superior performance of the C-ECO algorithm, which
lightweight architecture. Ultimately, the improved C-ECO incorporates the lightweight MobileNet network and the
algorithm is utilized for precise target tracking. Figure 4 A2-Net module.
shows the flow of the improved C-ECO algorithm. Algorithms No. 2, No. 3 and No. 4 are used as the basic
feature extraction network to conduct target detection tests on
III. EXPERIMENT AND ANALYSIS pedestrians in the video to verify their detection performance.
1) EXPERIMENTAL ENVIRONMENT AND DATASET In order to more clearly display the differences between
The GPU used in this experiment is NVIDIA algorithms, the detection results are displayed, as shown in
GeForceRTX3060 with 6G of memory; the CPU is Intel Core Figure 5.
i7-12700H with 2.70GHz and 32GB of RAM; the OS is Figure 5 shows the comparison diagram of target detection
Windows 11, the programming environment is python3.9, the for algorithms 2,3, and 4. The detection results of ResNet-
programming software is PyCharm2021.3.1, and the CUDA 50 algorithm are shown in Figure 5(a1-a4). There is target
version is 11.6. In terms of experimental parameter setting, error detection in pedestrian detection, and the umbrella in
the initial learning rate was set as 0.001, and the epoch was the upper part of the figure is wrongly detected as a passenger.
set as 200. And the pedestrian detection effect on the road is not good.
In order to evaluate the tracking algorithm performance The detection results of MobileNet v2 algorithm are shown
and show the tracking effect, two data sets, OTB-50 and OTB- in Figure 5(b1-b4). The pedestrian detection is relatively
100, containing 50 and 100 videos respectively, are selected accurate, and there is almost no problem of error detection.
for testing. However, due to its simple network structure, the pedestrian
detection effect of small targets on the right edge of the image
2) ABLATION EXPERIMENTS in Figure 5(b4) is not good. In general, the pedestrian detec-
In order to verify the feasibility of applying the A2-Net tion effect of algorithm 3 is significantly better than that of
module in MobileNet, ablation experiments are conducted. algorithm 2. The detection effect after adding A2-Net atten-
The ECO algorithm was introduced for comparison with ECO tion module to MobileNet v2 is shown in Figure 5(c1-c4).
algorithm combined with MobileNet and C-ECO algorithm, Due to the addition of the attention mechanism, the algorithm

24836 VOLUME 12, 2024


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

FIGURE 4. C-ECO algorithm flow.

TABLE 3. Ablation experiments.

TABLE 4. Comparison experiment with and without pre-training. pre-training (C-ECO without pre-training, C-ECO-N), from
which it can be concluded that the model with pre-training is
more effective in tracking the target, and therefore the model
with pre-training is used in all subsequent sections.

4) TRACKING ALGORITHM PERFORMANCE


has a stronger ability to extract features and a better ability to COMPARISON EXPERIMENTS
detect small targets than Algorithm 3 and Algorithm 2. It can In order to validate the effectiveness of the C-ECO algorithm,
be concluded from Figure 5 and Table 3 that the introduc- it is compared against several mainstream correlation fil-
tion of MobileNet v2 and A2-Net module can significantly ter tracking algorithms. The comparison algorithms include
improve the target detection ability, reduce the size of the Kernel Correlation Filter (KCF), Discriminative Correlation
model, and improve the speed of the pedestrian tracking Filter (DCF), Discriminative Scale Space Tracker (DSST),
algorithm. Spatially Regularized Correlation Filters (SRDCF), and
Background-Aware Correlation Filters (BACF). By conduct-
3) MODEL PRE-TRAINING ing this comparison, the performance and advantages of the
Pre-training refers to the process of initially training a model C-ECO algorithm can be assessed in relation to these estab-
on a large-scale dataset and subsequently fine-tuning it on lished tracking algorithms.
specific downstream task data. This approach can accel- The comparison experiment uses the dataset OTB-50,
erate model convergence and significantly reduce training which consists of 50 video sequences manually labeled with
time. To investigate the impact of the C-MobileNet feature the true position of the target, classified according to different
extraction network in the proposed C-ECO algorithm on interference factors as occlusion (OCC), motion blur (MB),
target tracking performance with and without pre-training, background clutters (BC), illumination variation (IV), low
experiments were conducted on the OTB-100 dataset. Specif- resolution (LR), scale variation (SV), deformation (DEF),
ically, C-MobileNet was tested with and without pre-training. out of view (OV), in plane rotation (IPR), fast motion (FM),
For pre-training, the MobileNet v2 model was selected, out of The tracking (OPR), accuracy and tracking success rate
pre-trained on the ImageNet dataset, and its parameters of C-ECO and other tracking algorithms for different types
were fine-tuned. The experimental results are summarized in of videos are shown in Tables 5 and 6, respectively, and the
Table 4. bolded data are the optimal data.
From Table 4, it can be seen that the accuracy rate of Based on the data presented in Tables 5 and 6, it is evident
the model with pre-training is 2.62% higher and the suc- that the C-ECO algorithm proposed in this paper achieves a
cess rate is 2.47% higher than that of the model without high accuracy rate when evaluated on the OTB-50 dataset.

VOLUME 12, 2024 24837


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

FIGURE 5. Comparison of target detection effects before and after algorithm improvement. Figure 5(a1-a4) is the detection result of algorithm 2,
Figure 5(b1-b4) is the detection result of algorithm 3, and Figure 5(c1-c4) is the detection result of algorithm 4.

Furthermore, it outperforms mainstream correlation filtering The accuracy and success rate curves of the C-ECO
algorithms, exhibiting an average accuracy rate of 82.04%and algorithm and mainstream tracking algorithms are shown in
a success rate of 64.72%. In complex scenarios such as FM, Figure 6.
OCC, and MB, the C-ECO algorithm demonstrates a supe- From Figure 6(a), it can be seen that when the position
rior performance, surpassing the ECO algorithm by 0.51% error threshold is less than 20, the accuracy value of the
in terms of accuracy and 0.23% in terms of success rate. C-ECO algorithm in this paper is slightly lower than the
These improvements highlight the enhanced accuracy and ECO algorithm, and at the same time, it has a large lead
success rate achieved by the C-ECO algorithm over the ECO compared with other mainstream tracking algorithms, and
algorithm. after the position error threshold is greater than 20, C-ECO

24838 VOLUME 12, 2024


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

TABLE 5. Accuracy of each tracking algorithm on OTB-50 (%).

TABLE 6. Success rate of each tracking algorithm on OTB-50 (%).

FIGURE 6. Accuracy and success curves of 7 tracking algorithms on the OTB-100 dataset.

algorithm overtakes ECO algorithm and continues to lead. that KCF, ECO, and C-ECO are able to track the target
From Figure 6(b), it can be seen that when the IoU setting more accurately when the target is moving without distur-
is less than 0.5, the gap between ECO and C-ECO success bance, where C-ECO successfully tracked the small target.
rate is not obvious, and when IoU is greater than 0.6 C-ECO When a new target appeared as shown in Figure 7(a2), KCF
is in the leading position and significantly better than other tracked the shadow of the target incorrectly, and ECO and
tracking algorithms. C-ECO were able to track it correctly. When the initial
The main challenges of the video in Figure 7(a1-a4) transformation as shown in Figure 7(a3-a4) occurred and
are SV, OCC, and LR. from Figure 7(a1), it can be seen the moving target was shaded and separated, both KCF and

VOLUME 12, 2024 24839


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

FIGURE 7. Comparison of tracking effects. For clarity of display, only KCF, ECO and the algorithm C-ECO are used in this paper. black boxes are KCF, yellow
boxes are ECO and blue boxes are C-ECO.

ECO lost the original target, and C-ECO was able to track pedestrians, interplay of street lamps, and mutual obstruction
accurately. among pedestrians engaged in typical walking, the tracking
The main challenges of the video in Figure 7(b1-b4) are efficacy of the KCF and ECO algorithms is notably compro-
IV, DEF, MB. from Figure 7(b1-b2), it can be seen that KCF mised. In contrast, the proposed C-ECO framework maintains
shows tracking drift when the target is occluded, as shown a commendable level of tracking performance under these
in Figure 7(b3-b4) when the video has motion blur and has a demanding conditions.
large angle change and a complex background, KCF predic- Frame c3 highlights that, in the case of swiftly moving
tion frame shows a large range of drift, ECO has a similarly pedestrians, the KCF algorithm has regrettably lost track of
colored background for the C-ECO algorithm is still able to the target entirely, and ECO, though valiant, still struggles to
track more accurately. maintain effective tracking. Remarkably, C-ECO manages to
The principal challenges in the video sequence depicted uphold a higher degree of tracking accuracy even in the face
in Figure 7(c1)-(c4) encompass issues such as Scale Vari- of such dynamic scenarios.
ation (SV), Object Occlusion (OCC), Deformation (DEF), In frame c4, we observe that when confronted with the
and Object Perspective Changes (OPR). From the analysis challenge of tracking small targets within complex scenes rife
of frame c2, it becomes apparent that as the complexity of with occlusions, both KCF and ECO have forfeited the ability
occlusion scenarios intensifies, featuring rapid movements of to track the target. In contrast, C-ECO exhibits resilience and

24840 VOLUME 12, 2024


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

maintains the ability to accurately track the target, thus show- perspective for intelligent monitoring applications, contribut-
casing its remarkable robustness in these intricate situations. ing to advancements in the field.
The primary challenges encountered in the video sequence
depicted in Figure 7(d1)-(d4) encompass Illumination Varia- V. FUTURE WORK AND PROSPECTS
tion (IV), Scale Variation (SV), Object Occlusion (OCC), and While the pedestrian tracking algorithm proposed in this
Deformation (DEF). Notably, from the scrutiny of frame d2, paper, based on a lightweight convolutional neural network,
it becomes evident that when the tracked target shares a color has demonstrated favorable results in terms of tracking speed
proximity with the environmental background, both the KCF and accuracy, it is important to acknowledge the existing
and ECO algorithms are susceptible to tracking drift. limitations and room for improvement.
Further insights from frames d3 and d4 reveal that as Firstly, the experiments conducted in this study pri-
the tracked target progressively recedes from the camera, marily encompassed scenarios with good lighting condi-
diminishing in size and encountering occlusions, both KCF tions, favorable weather conditions, and indoor monitoring
and ECO manifest varying degrees of tracking drift. In con- scenes. To provide a more comprehensive evaluation of
trast, owing to its augmented feature extraction capabilities, the algorithm’s performance, future research should include
C-ECO demonstrates a heightened resistance to tracking fail- pedestrian monitoring videos captured under poor lighting
ures in these challenging scenarios. conditions at night, as well as videos recorded in adverse
The experimental results presented above emphasize the weather conditions such as rain, snow, and fog. By expanding
superior performance of the C-ECO algorithm proposed in the dataset to encompass these challenging scenarios, the
this paper compared to other classical correlation filter track- algorithm’s robustness and generalizability can be further
ing algorithms. The C-ECO algorithm not only achieves scrutinized.
higher accuracy and success rates in tracking, but it also Despite efforts to simplify the algorithm’s architecture in
exhibits a significantly smaller feature extraction model size this study, the inference operation still requires a substantial
compared to the ECO algorithm prior to optimization. Addi- amount of computing resources due to the inherent com-
tionally, the algorithm effectively improves the frame rate for plexity of the model and its network. As part of our future
video tracking. The comparison with other classical corre- work, we aim to optimize and refine the model to minimize
lation filter tracking algorithms highlights the strength and computational demands, enabling it to be executed efficiently
competitiveness of the C-ECO algorithm. Its enhanced accu- on mobile devices while maintaining or even improving its
racy and success rates solidify its position as a reliable and performance.
efficient tracking solution. Furthermore, the reduced model It is also essential to note that this algorithm has its
size contributes to its practicality and resource efficiency, limitations and areas for further refinement. Additional exper-
while the improved frame rate enriches the user experience iments can be conducted to address these deficiencies and
during video tracking tasks. These findings demonstrate the improve the overall effectiveness of the proposed pedestrian
significance of the C-ECO algorithm in advancing correlation tracking algorithm. By embracing these future endeavors,
filter tracking methods and establishing it as a promising we aspire to achieve superior results and make significant
choice for various tracking applications. advancements in the field of pedestrian monitoring and
tracking.
IV. CONCLUSION AND EXPECTATIONS
In this study, we propose a novel C-ECO tracking algorithm REFERENCES
that leverages a lightweight convolutional neural network.
[1] M. Kumar and S. Mondal, ‘‘Recent developments on target tracking prob-
The algorithm employs MobileNet v2 for efficient feature lems: A review,’’ Ocean Eng., vol. 236, Sep. 2021, Art. no. 109558.
extraction, integrates the A2-Net module to enhance feature [2] N. Mahmoudi, S. M. Ahadi, and M. Rahmati, ‘‘Multi-target tracking using
representation, and incorporates a pre-trained model to expe- CNN-based features: CNNMTT,’’ Multimedia Tools Appl., vol. 78, no. 6,
pp. 7077–7096, Mar. 2019.
dite training. The primary objective is to improve tracking
[3] M. A. Khan, H. Menouar, and R. Hamila, ‘‘Visual crowd analysis: Open
accuracy and success rates. Experimental results demonstrate research problems,’’ 2023, arXiv:2308.10677.
that the C-ECO algorithm outperforms the previous ECO [4] M. A. Khan, H. Menouar, and R. Hamila, ‘‘Revisiting crowd counting:
algorithm employing VGG Net, exhibiting a 27.96% reduc- State-of-the-art, trends, and future perspectives,’’ 2022, arXiv:2209.07271.
[5] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, ‘‘Visual object
tion in model size and a 46.11% improvement in frame rate.
tracking using adaptive correlation filters,’’ in Proc. IEEE Comput. Soc.
Importantly, these improvements are achieved without com- Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2544–2550.
promising the accuracy and success rate achieved before the [6] H. Moridvaisi, F. Razzazi, M. A. Pourmina, and M. Dousti, ‘‘An extended
enhancements. When compared with six other mainstream KCF tracking algorithm based on TLD structure in low frame rate videos,’’
Multimedia Tools Appl., vol. 79, nos. 29–30, pp. 20995–21012, Apr. 2020.
tracking algorithms, including ECO, the C-ECO algorithm [7] H. Yang, S. Gao, X. Wu, and Y. Zhang, ‘‘Online multi-object tracking using
consistently ranks at the top, boasting an average accuracy KCF-based single-object tracker with occlusion analysis,’’ Multimedia
rate of 82.04% and a success rate of 64.72%. The lightweight Syst., vol. 26, no. 6, pp. 655–669, Dec. 2020.
pedestrian tracking algorithm proposed in this paper show- [8] K. S. Sanagavarapu and M. Pullakandam, ‘‘Object tracking based surgical
incision region encoding using scalable high efficiency video coding for
cases its ability to effectively detect and track pedestrians surgical telementoring applications,’’ Radioengineering, vol. 31, no. 2,
in various complex scenarios. This research provides a new pp. 231–242, May 2022.

VOLUME 12, 2024 24841


H. Wei et al.: Pedestrian Tracking Algorithm for Video Surveillance Based on Lightweight CNN

[9] J. T. Mbelwa, Q. Zhao, Y. Lu, F. Wang, and M. E. Mbise, ‘‘Visual [31] A. Michele, V. Colin, and D. D. Santika, ‘‘MobileNet convolutional neural
tracking using objectness-bounding box regression and correlation filters,’’ networks and support vector machines for palmprint recognition,’’ Proc.
J. Electron. Imag., vol. 27, no. 2, p. 1, Mar. 2018. Comput. Sci., vol. 157, pp. 110–117, Jan. 2019.
[10] B. Khan, A. Jalil, A. Ali, K. Alkhaledi, K. Mehmood, K. M. Cheema, [32] U. Kulkarni, M. S. Meena, S. V. Gurlahosur, and G. Bhogar, ‘‘Quanti-
M. Murad, H. Tariq, and A. M. El-Sherbeeny, ‘‘Multiple cues-based zation friendly MobileNet (QF-MobileNet) architecture for vision based
robust visual object tracking method,’’ Electronics, vol. 11, no. 3, p. 345, applications on embedded platforms,’’ Neural Netw., vol. 136, pp. 28–39,
Jan. 2022. Apr. 2021, doi: 10.1016/j.neunet.2020.12.022.
[11] V. Degli-Esposti, F. Fuschini, H. L. Bertoni, R. S. Thomä, T. Kürner, [33] X. Zhai, H. Wei, Y. He, Y. Shang, and C. Liu, ‘‘Underwater sea cucumber
X. Yin, and K. Guan, ‘‘IEEE access special section editorial: Millimeter- identification based on improved YOLOv5,’’ Appl. Sci., vol. 12, no. 18,
wave and terahertz propagation, channel modeling, and applications,’’ p. 9105, Sep. 2022.
IEEE Access, vol. 9, pp. 67660–67666, 2021. [34] H. Fu, G. Song, and Y. Wang, ‘‘Improved YOLOv4 marine target detection
[12] Y. Liang, Y. Liu, Y. Yan, L. Zhang, and H. Wang, ‘‘Robust visual tracking combined with CBAM,’’ Symmetry, vol. 13, no. 4, p. 623, Apr. 2021.
via spatio-temporal adaptive and channel selective correlation filters,’’ [35] K. Xu, Z. Wang, J. Shi, H. Li, and Q. C. Zhang, ‘‘A2-Net: Molecular
Pattern Recognit., vol. 112, Apr. 2021, Art. no. 107738. structure estimation from cryo-EM density, volumes,’’ in Proc. AAAI Conf.
[13] N. Zdarsky, S. Treue, and M. Esghaei, ‘‘A deep learning-based approach Artif. Intell., vol. 33, Jul. 2019, pp. 1230–1237.
to video-based eye tracking for human psychophysics,’’ Frontiers Hum. [36] Y. Chen, X. Zhang, W. Chen, Y. Li, and J. Wang, ‘‘Research on recognition
Neurosci., vol. 15, pp. 1–10, Jul. 2021. of fly species based on improved RetinaNet and CBAM,’’ IEEE Access,
[14] H. A. I. T. Abdelali, H. Derrouz, Y. Zennayi, R. O. H. Thami, vol. 8, pp. 102907–102919, 2020.
and F. Bourzeix, ‘‘Multiple hypothesis detection and tracking using
deep learning for video traffic surveillance,’’ IEEE Access, vol. 9,
pp. 164282–164291, 2021, doi: 10.1109/ACCESS.2021.3133529.
[15] J. Zhang, Y. Liu, H. Liu, J. Wang, and Y. Zhang, ‘‘Distractor-aware visual
tracking using hierarchical correlation filters adaptive selection,’’ Appl.
Intell., vol. 52, no. 6, pp. 6129–6147, Sep. 2021.
[16] L. Almuqren, M. A. Hamza, A. Mohamed, and A. A. Abdelmageed,
‘‘Automated video-based face detection using Harris hawks optimiza-
tion with deep learning,’’ Comput., Mater. Continua, vol. 75, no. 3, HONGLEI WEI was born in 1973. He received
pp. 4917–4933, 2023. the Ph.D. degree from the Dalian University of
[17] I. Ahmed, M. Ahmad, A. Ahmad, and G. Jeon, ‘‘Top view multiple Technology. He is currently an Associate Professor
people tracking by detection using deep SORT and YOLOv3 with transfer with Dalian Polytechnic University. His research
learning: Within 5G infrastructure,’’ Int. J. Mach. Learn. Cybern., vol. 12, interests include machine vision, deep learning,
no. 11, pp. 3053–3067, Oct. 2020.
and object tracking.
[18] W. Zhang, Y. Du, Z. Chen, J. Deng, and P. Liu, ‘‘Robust adaptive learning
with Siamese network architecture for visual tracking,’’ Vis. Comput.,
vol. 37, no. 5, pp. 881–894, Apr. 2020.
[19] Y. Wang, H. Huang, X. Huang, and Y. Tian, ‘‘ECO-HC based tracking for
ground moving target using single UAV,’’ Neural Comput. Appl., vol. 32,
no. 10, pp. 1–12, Jul. 2020.
[20] P. Wang, M. Sun, H. Wang, X. Li, and Y. Yang, ‘‘Convolution operators for
visual tracking based on spatial–temporal regularization,’’ Neural Comput.
Appl., vol. 32, no. 10, pp. 5339–5351, Jan. 2020.
[21] S. Shen, S. Tian, L. Wang, A. Shen, and X. Liu, ‘‘Improved C-COT based
on feature channels confidence for visual tracking,’’ J. Adv. Mech. Design,
Syst., Manuf., vol. 13, no. 5, 2019, Art. no. JAMDSM0096.
[22] R. Zhang, Y. Zheng, C. C. Y. Poon, D. Shen, and J. Y. W. Lau, ‘‘Polyp XIANYI ZHAI was born in 1998. He received the
detection during colonoscopy using a regression-based convolutional neu- bachelor’s degree from Dalian Maritime Univer-
ral network with a tracker,’’ Pattern Recognit., vol. 83, pp. 209–219, sity. He is currently pursuing the master’s degree
Nov. 2018. with Dalian Polytechnic University. His research
[23] P. M. Raju, D. Mishra, and R. K. S. S. Gorthi, ‘‘Detection based long term interests include machine vision, object detection,
tracking in correlation filter trackers,’’ Pattern Recognit. Lett., vol. 122, and object tracking.
pp. 79–85, May 2019.
[24] D. Yuan, X. Chang, P. Huang, Q. Liu, and Z. He, ‘‘Self-supervised deep
correlation tracking,’’ IEEE Trans. Image Process., vol. 30, pp. 976–985,
2021.
[25] C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, ‘‘RGB-T object track-
ing: Benchmark and baseline,’’ Pattern Recognit., vol. 96, Dec. 2019,
Art. no. 106977.
[26] T. Xu, Z.-H. Feng, X.-J. Wu, and J. Kittler, ‘‘Learning adaptive discrimina-
tive correlation filters via temporal consistency preserving spatial feature
selection for robust visual object tracking,’’ IEEE Trans. Image Process.,
vol. 28, no. 11, pp. 5596–5609, Nov. 2019.
[27] Z. Liang and J. Shen, ‘‘Local semantic Siamese networks for fast tracking,’’
IEEE Trans. Image Process., vol. 29, pp. 3351–3364, 2020. HONGDA WU was born in 1998. He received the
[28] J. Zhang, J. Sun, J. Wang, and X.-G. Yue, ‘‘Visual object tracking based bachelor’s degree from Dalian Polytechnic Uni-
on residual network and cascaded correlation filters,’’ J. Ambient Intell. versity, where he is currently pursuing the master’s
Humanized Comput., vol. 12, no. 8, pp. 8427–8440, Sep. 2020. degree. His research interests include automated
[29] J. Chen, D. Zhang, M. Suzauddola, and A. Zeb, ‘‘Identifying crop diseases control and intelligent algorithms.
using attention embedded MobileNet-V2 model,’’ Appl. Soft Comput.,
vol. 113, Dec. 2021, Art. no. 107901.
[30] B. Singh, D. Toshniwal, and S. K. Allur, ‘‘Shunt connection: An intelligent
skipping of contiguous blocks for optimizing MobileNet-V2,’’ Neural
Netw., vol. 118, pp. 192–203, Oct. 2019.

24842 VOLUME 12, 2024

You might also like