0% found this document useful (0 votes)
17 views

Object-Based Hybrid Deep Learning Technique For Recognition of Sequential Actions

Uploaded by

robby mahendra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Object-Based Hybrid Deep Learning Technique For Recognition of Sequential Actions

Uploaded by

robby mahendra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received 29 April 2023, accepted 23 June 2023, date of publication 3 July 2023, date of current version 10 July 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3291395

Object-Based Hybrid Deep Learning Technique


for Recognition of Sequential Actions
YO-PING HUANG 1,2,3,4 , (Fellow, IEEE), SATCHIDANAND KSHETRIMAYUM 1,

AND CHUN-TING CHIANG 1


1 Department of Electrical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
2 Department of Electrical Engineering, National Penghu University of Science and Technology, Penghu 88046, Taiwan
3 Department of Computer Science and Information Engineering, National Taipei University, New Taipei City 23741, Taiwan
4 Department of Information and Communication Engineering, Chaoyang University of Technology, Taichung 41349, Taiwan

Corresponding author: Yo-Ping Huang ([email protected])


This work was supported in part by the National Science and Technology Council, Taiwan, under Grant MOST108-2221-E-346-006-MY3
and Grant MOST111-2221-E-346-002-MY3; and in part by the AU Optronics Corporation Research Projects under Grant 209A221
and Grant 210A212.

ABSTRACT Using different objects or tools to perform activities in a step-by-step manner is a common
practice in various settings, including workplaces, households, and recreational activities. However, this
approach can pose several challenges and potential hazards if the correct sequence of actions is not followed
and the object or tool is not used in the appropriate sequence; therefore, it must be addressed to ensure safety
and efficiency. These issues have garnered significant attention in recent years. Previous research has relied
on using body keypoints to detect actions, but not the objects or tools used during activity. As a result, the
lack of a system to identify the target objects or tools being used while performing tasks increases the risk
of accidents and mishaps during the process. This study suggests a possible solution to the aforementioned
issue by introducing a model that is both efficient and durable. The model utilizes video data to monitor and
identify daily activities, as well as the objects involved in the process, thus enabling real-time feedback and
alerts to enhance safety and productivity. The suggested model separates the overall recognition process into
two components. Firstly, it utilizes the advanced BlazePose architecture for pose estimation, and interpolates
any undetected and wrong-detected landmarks to enhance the precision of the posture estimation. After
this, the features are forwarded to a long short-term memory network to identify the actions performed
during the activity. Secondly, the model also employs an enhanced YOLOv4 algorithm for object detection,
to accurately identify the objects used in the course of the activity. Finally, a durable and efficient activity
recognition model has been developed, which achieves 95.91% accuracy rate in identifying actions, a mean
average precision score of 97.68% for detecting objects, and overall activity recognition model that is capable
of processing at a rate of 10.47 frames per second.

INDEX TERMS Human activity recognition, long short-term memory (LSTM), object detection, pose
estimation, standard operating procedures (SOPs).

I. INTRODUCTION following the proper sequence can lead to accidents, such


Performing activities that involve different human actions as injuries from blades or bits, or damage to the workpiece.
and objects require careful attention to safety and efficiency. Mishandling hot surfaces, not allowing appliances to cool
If the appropriate action sequence and the correct object down, or improper use of heat sources can result in burns
or tool are not used, it can pose significant challenges and or scalds. These challenges and hazards must be addressed
potential hazards. For example, using power tools without to ensure that the activity is carried out safely and efficiently.
To address these concerns, human activity recognition (HAR)
The associate editor coordinating the review of this manuscript and can be used to monitor the activity and ensure that it
approving it for publication was Kostas Kolomvatsos . follows standard operating procedures (SOPs) that outline
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 67385
Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

step-by-step processes to complete the task. Human pose esti- and their skeleton representation generated from body land-
mation (HPE) is a popular research field in computer vision marks. The physical differences between some actions are
that plays a significant role in activity recognition [1], [2], [3]. very small or even identical, making it difficult to iden-
The majority of these techniques rely on using optical sensors tify activities that are identical yet interacting with different
to take RGB images in order to determine body landmarks objects, such as in households, recreational, workplace
and the overall position. It is also possible to combine with activities of persons involving machine operation, mate-
other computer vision technologies for 3D animation, fitness, rial movement, maintenance, assembly, product and process
virtual and augmented reality, and rehabilitation [4], [5], [6]. design, etc.
HAR on the other hand, is a crucial computer vision task Therefore, with the growing popularity of HAR and object
that enables machines to examine the identified body land- detection in the computer vision field, it is better to have
marks from HPE models and comprehend various human a system that can accurately recognize actions sequence in
activities [7], [8], [9]. Many researchers have been driven an activity as well as detect objects used during the activity
to advance HAR systems in real-world setting by the rapid will be of profound benefit. This would aid in analyzing
growth of artificial intelligence, smart phones, and CCTV and monitoring a person’s activity to determine if they are
systems. This drive has been motivated by the role of HAR adhering to the SOPs with appropriate objects.
systems in health, security and behavioral studies. Some of The goal of this research is to create an activity recognition
their applications include patient monitoring systems [10], model for a person from video information that can detect
[11], ambient assisted living (AAL) [12], [13], surveillance their actions sequence as well as the objects being used while
systems [14], [15], gesture recognition [16], [17], behavior they are performing an activity. To achieve this, a person’s
analysis [18], and a range of healthcare systems [19], [20]. pose estimation is discovered using BlazePose [22] and unde-
In particular, vision-based human activity recognition sys- tected or wrong-detected landmarks were interpolated using
tems, which evaluate input in the form of video or image to linear interpolation method, then the information is processed
identify performed activities are quite complicated. This is by a recurrent neural network that can learn sequential order
because the appearance of the body changes dynamically due dependency, known as long short-term memory (LSTM).
to various types of clothing, occlusions caused by viewing Object detection method is carried out in the second part
angles, background context, etc. [21]. And the performance using an enhanced YOLOv4 algorithm to recognize the object
would be poor if the occlusion is very high. It is also interest- in the person’s hand while they are performing the activ-
ing to note that the majority of current studies only address the ity. Finally, a lightweight and robust system for recognizing
recognition of an action, and none really gives insight about person’s activities is created by combining the two models.
the object they use during the activity. Fig. 2 depicts the suggested architecture. Three challenges
Fig. 1 shows some example pictures of confusing cases, are considered to be resolved in this study: (1) human pose
where a person performs an action with and without object, estimation-based action detection using LSTM, (2) an object
detection model to detect objects being used in an activity,
and (3) an activity recognition model to classify the overall
activity.

FIGURE 2. Proposed activity recognition framework.

The followings are the key contributions in this study:


1) An action recognition technique is proposed that uti-
lizes body landmark information from the sequence of
FIGURE 1. Example of a confusing case for action detection. (a) RGB
images of a person performing an action with (left) and without (right) frames. We further detect object being used in order to
object, and (b) corresponding skeletal representations. make recognition system more informative.

67386 VOLUME 11, 2023


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

2) We proposed a technique to improve the accuracy of K-partite graph matching for multi-person pose estimation.
person’s pose estimation by interpolating the undetected The primary drawbacks of this system, despite processing
and wrong-detected landmarks. at 0.4 frames per second, it demands a lot of computational
3) The object detection algorithm is further enhanced by power and is difficult to work on real-time videos. A two-
introducing extra YOLO head to detect the various step detector-tracker inference pipeline is used by Google’s
object of different shape and size used by the person (Mountain View, CA) BlazePose model [22], where the
while performing the activity. detector is employed in the initial frame and tracker is used
4) An activity recognition model is developed that can to follow the person in consecutive frames until the person
recognize different actions performed within the activ- is discovered. In order to predict heatmaps for each joint
ity in chronological order, and in accordance with the in this model, it has employed an encoder-decoder network
predefined SOPs as well as the object being used. design followed by another encoder that regresses directly
This paper is organized as follows. A comprehensive liter- to the coordinates of all joints. It is ideal to estimate human
ature review of existing related work is provided in Section II. pose for activity recognition due to its lightweight design and
The proposed methodology is described in Section III. real-time inference capability. However, it may fail to detect
Section IV presents the training dataset, experimental results body landmarks due to high changes in appearance, clothing
and discussions. In Section V conclusion and future research and occlusions.
are given. Recent advances in effective motion capture technolo-
gies and posture assessment algorithms have made it easier
II. RELATED WORK to obtain information about human joint coordinates. As a
Artificial intelligence (AI) models that estimate body key result, joint coordinate-based action recognition using deep
points to characterize body position have become a poten- learning methods has significantly outperformed previous
tially effective tool for assessing human actions. More specif- methods in recent years and has become the standard
ically, convolution neural networks (CNN) are frequently approach. Recurrent neural network (RNN) [25] is now one
used in human pose estimation to forecast a person’s posi- of the most used frameworks in joint coordinate-based action
tion by performing inference on input videos or images [1], recognition because of its ability to analyze sequential data.
[2]. Due to the numerous conceivable human positions, the A hierarchical RNN network [26] was proposed to classify
high degree of freedom, appearance changes like illumina- activities based on skeleton’s data. An advanced LSTM net-
tion and clothing, environmental changes, and occlusions, work [27] that is fully coupled and includes the regularization
determining precise pixel coordinates of body keypoints is strategy was developed to acquire the high-level temporal
a challenging process [3]. Despite these challenges, a num- aspects of skeleton information. All these approaches rely
ber of reliable models have been developed that function on the RNN architecture, and these features aim to improve
admirably in applications including sports training, rehabil- action recognition while failing to recognize the object being
itation, and fall detection [4], [5], [6]. While pose estimation used. Thus, many significant recognition errors are occurred
models have been successful in other applications, it is still among physically similar classes of person activity. The pri-
needed to be able to accurately identify keypoints in order to mary cause of these recognition errors is that these activities
track person’s activity because engaging in the wrong activity differ by tiny or similar body movements yet interaction with
might have side effects on the production lines. different objects.
For body joint coordinate-based action recognition, Our work belongs to activity recognition, but more focus
the human pose estimation problem is formulated as a on both body movement of the person and interacted objects,
CNN-based regression problem toward body joints by the that has not been considered in the above methods. In this
holistic model DeepPose [23]. Additionally, it employs a study, we modified YOLO (you only look once) [28] to
cascade of these regressors to improve the pose estimation. enhance its ability to detect various objects of different shapes
However, regression to XY location is challenging and raises and sizes that are used by individuals while performing activ-
learning complexity, which inhibits generalization and results ities. The proposed method is a single convolutional network
in subpar performance in some regions. A real-time multi- that predicts multiple bounding boxes and class probabilities
person posture estimation architecture made for the desktop from a single image frame in a single evaluation. By improv-
settings, called OpenPose [24], was proposed as a solution, ing the accuracy of object detection, our model can provide a
which is commonly used in the pose estimation commu- more comprehensive understanding of the actions being per-
nity. It generates a feature representation by first analyzing formed. This makes the proposed model suitable for a wide
the image using the first 10 layers of VGG-19 architecture. range of applications, including human activity recognition
The captured feature representation is then fed into a two- and surveillance.
branch multi-level CNN to generate part confidence maps
and vector fields of part affinities. One branch forecasts a III. PROPOSED METHODOLOGY
collection of 2D body part confidence maps. The other branch The proposed approach aims to develop a framework that
indicates the relationship of parts through 2D vector fields is both lightweight and robust for classifying sequential
of part affinities. These two branches are used to carry out actions in an activity. This framework focuses on capturing

VOLUME 11, 2023 67387


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

to the SoftMax layer to return a probability of each action.


Furthermore, an enhanced YOLOv4 algorithm by adding
an additional prediction head to improve the detection of
small objects and handle variations in object sizes is con-
ducted to detect the objects used during the activity. Finally,
an algorithm for activity recognition is developed by utilizing
the chronological sequence of actions in accordance with the
predefined SOPs. This approach ensures that the algorithm
can identify the correct sequence of actions and compare it
with the established procedures to determine the accuracy and
efficiency of the performed activity. In the following sections
will give detailed explanations of the steps stated above.

A. POSE ESTIMATION FRAMEWORK


Human pose estimation and tracking are crucial in a wide
range of fields, including health monitoring, surveillance
systems, and gestural control. However, in computer vision,
it faces challenges like detecting, associating, and tracking
semantic key points, such as ‘‘right shoulders,’’ ‘‘left knees,’’
or ‘‘left elbow.’’ These problems can be solved by using deep
learning models to recognize and track human body lan-
FIGURE 3. Block diagram of the activity recognition system.
guage through posture detection and tracking. Furthermore,
CNN-based models are the most efficient image processing
methods available today [29]. Therefore, the most advanced
methods often rely on the development of a CNN architecture
not only the body movements but also the objects that they
specifically designed for human posture detection. Pose esti-
interact with during the activity. The proposed architec-
mation methods can be classified to top-down and bottom-up
ture consists of five components including pose estimation,
approaches. In a bottom-up approach, each joint of the body is
feature extraction, action detection, object detection, and
evaluated individually before combining them into a distinct
activity recognition as shown in Fig. 3. Initially, the video
data was split into individual frames, following by pose pose. DeepCut [30] was the first to use bottom-up approaches.
estimation using the BlazePose architecture which returned In contrast, top-down approach begins with a person detector
33 landmarks of a person (Fig. 4). Then, any undetected and estimate body joints within the detected bounding boxes.
and wrong-detected landmarks were interpolated to enhance Although pose estimation has huge practical ramifications,
the precision of the posture estimation. The landmark val- it is challenging to estimate strong articulations, smaller,
ues are then saved as frame values to represent a sequence hardly perceptible joints, occlusions, clothes, and lighting
of events for an activity. For the purpose of understanding changes. However, significant progress has been made in pre-
the temporal components, the transformed landmark values dicting human pose, which allows for the strongest assistance
are subsequently fed into novel LSTM layers and finally of the numerous practical applications.
In this study, a powerful, robust and lightweight CNN
optimized top-down human pose estimation architecture is
implemented for the real-time detection. To achieve this,
the heatmaps and offsets from earlier frames of the person
performing actions are used. We utilize a two-step machine
learning pipeline: a detector and a tracker for the person who
is performing the actions. Since the face provides the greatest
information regarding the position of the torso, the neural
network of the pose estimation executes from the first frame
until the person’s face is detected. The tracker is then used
to track the person while performing the actions as shown
in Fig. 5.
For the person pose tracking, inspired by [31], in the pro-
cess of obtaining the landmarks of the entire human body,
we utilize two more virtual keypoints to accurately define the
human body’s center, rotation, and scale as a circle. This is
FIGURE 4. A 2D skeletal topology with 33 landmarks. consequently capable of predicting a person’s hips midpoint,

67388 VOLUME 11, 2023


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

FIGURE 5. Overview of the pose detector pipeline.

the radius of a circle that encloses the entire body, and angle
of inclination of the line joining the midpoint of the shoulders
and hips [31]. This also helps in tracking extremely complex
situations in any kind of person’s activities.
The model used an encoder-decoder network architecture
to predict heatmaps for every joint of the person, followed FIGURE 6. Architecture of the landmark detector network.
by a second encoder that regresses back to every landmark
(joint’s coordinates). Then, to make this model lightweight
enough to run on a low-end computer, heatmap output is interpolation, we fill in the gaps and correct any inaccura-
removed during inference as shown in Fig. 6. A list of 33 land- cies, ultimately boosting the overall precision of the posture
marks is returned by the architecture. The landmarks are estimation process. To address this, we use time series corre-
represented as x, y, z, and v, the visibility. The coordinates lations between identical body joints across several frames,
(x and y) show where a particular joint of the person is because the estimated human position is a collection of time
located within the normalized range between 0 and 1 of the series data.
image’s width and height. z stands as depth of the landmark, When landmarks in Blazepose are unable to be detected,
having origin as the depth at the center of hip. The term v their x and y coordinate values will always be 0. In this study,
f f −1
describes whether or not a landmark can be seen in the frame. for the person w’s landmark lw in f frame, although lw
The scale and position of the person have an impact on the f +1 f
and lw are detected, but lw is not, we represent f frame
landmarks that the pose estimation network generates for it. as ‘‘undetected landmark frame’’ f ′ .
Therefore, they are transformed to become independent of
the position and scale in the frame. As a result, the same f′ = f (3)
person in the same action may provide different landmark f
where = (0, 0),
lw
f −1
lw
f +1
̸= (0, 0), and lw ̸= (0, 0)
values in different frames depending on where they are in f
Similarly, for person w’s landmark lw in frame f , although
the frame. We grab these landmark values and save them as f −1 f +1 f
frame values to represent a sequence of events for an activity. lw and lw are detected, but lw is wrong-detected,
For an activity video, V m = [F1 , F2 , . . . , Fn ] is a matrix of we represent frame f as ‘‘wrong-detected landmark
pose-vectors with K landmarks, where V m contains n frames frame’’ f ′′ . We emphasize on the difference δ f that is provided
of change of the person conducting the actions. Each frame as the landmark’s lw spatial distance between two consecu-
is consisted of: tives f − 1 and f frames. The fixed number of pixels is given
h i as the difference δ f . Due to the possibility of resolutions and
Fi = li1 , li2 , . . . , l K
i , i ∈ [1, n] (1) frame rates varying based on the input video, we do not wish
Since our model can generate 33 landmarks (K = 33), then to specify a threshold for δ f . As a result, we set a threshold θ
the resulting vector has a length of 132 landmark values and to give importance to the ratio of the difference δ f and δ f −1 .
format: f ′′ = f (4)
Fi = [xi1 , y1i , z1i , v1i , xi2 , y2i , z2i , v2i , . . . , xi33 , y33
i , zi , vi ]
33 33 f −1 f +1
where δ f > θ · δ f −1 , lw ̸= (0, 0), and lw ̸= (0, 0)
(2) The percentage of wrong-detected frame which were not
Depending on the photography settings and conditions, wrong-detected frames was lower when the threshold was
landmarks might not be detected or wrong-detected when we set to θ = 3. As a result, we use θ = 3 as the threshold in
use pose estimation models based on CNN to a video taken this study so that only frames that are clearly wrong-detected
by a general camera. Action detection and analysis are nega- are interpolated. In this manner, we represent wrong-detected
tively impacted by this kind of inaccurate landmark detection. landmark frames according to the relative number of changes
To overcome this issue, in conjunction with the BlazePose for every landmark. Both undetected and wrong-detected
architecture, we have incorporated innovative interpolation landmark frames will be interpolated using the previous and
techniques. These techniques play a crucial role in enhancing following frames’ landmark coordinate information.
the accuracy of posture estimation by effectively address- It is crucial to extract person’s coordinate values from
ing any undetected or wrong-detected landmarks. Through various frames so as to interpolate coordinate values. We use

VOLUME 11, 2023 67389


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

linear interpolation to interpolate landmarks for undetected where xt denotes the input data; ft and it denote the forget
and wrong-detected landmark frames. This is based on the and the input gate output respectively; ht−1 denotes previous
observation that person action does not change significantly hidden state and σ indicates the sigmoid function.
over a short period of time. In most cases, the undetected Then, the intermediate cell state is calculated by:
f
or wrong-detected landmark lw will be located close to the
f −1 f +1 c̃t = tanh (Wc xt + Uc ht−1 + bc ) (9)
midpoint of landmarks lw and lw .

For undetected f frame, let the landmark of person wf ′ The cell state ct−1 and c̃t are then used to update the state of
f ′ −1 f ′ +1
in f ′ − 1 and f ′ + 1 frame be lw and lw , respectively. the cell ct :
f′
We perform the linear interpolation to landmark lw which
ct = ft · ct−1 + it ·c̃t (10)
both x and y coordinate values of the person wf ′ are 0.
where · represents inner product. Now the output of ot is
f ′ −1 f ′ +1
′ lw + lw derived by:
lwf = (5)
2
ot = σ (Wo xt + Uo ht−1 + bo ) (11)
For wrong-detected f ′′ frame, let the landmark of person
f ′′ −1 f ′′ +1 The output ht is obtained as:
wf ′′ in f ′′ − 1 and f ′′ + 1 frame be lw and lw , respec-
f ′′
tively. We perform the interpolation to landmark lw where
′′ ′′
ht = ot · tanh (ct ) (12)
f f −1
difference δw,l is larger than θ · δ w,l .
To ensure the classification of actions, the input video is
′′
f ′′ −1
lw
f ′′ +1
+ lw processed in the form of (V m , Fn , Fi ), where V m is the action
lwf = (6) video, Fn is the number of frames in the video, and Fi is the
2
coordinate values of the 33 landmarks. Then, it is fed into
This combination of BlazePose architecture and the proposed first LSTM layer with 64 LSTM units, 128 units in the second
interpolation techniques results in a model that is not only layer, and 64 units in the third layer. After passing the output
capable of providing more reliable estimations of human of the LSTM layers through two dense layers with 64 and
posture but also exhibits enhanced robustness across diverse 32 neurons, respectively for additional encoding, it is then
scenarios. By successfully handling challenging scenarios passed on to SoftMax, which returns the probability that the
and adapting to various body types, clothing variations, and input video belongs to a particular action as shown in Fig. 7.
environmental conditions, our model ensures consistent and Then, the prediction with the highest probability is considered
accurate posture estimations. to be the class of that person’s action.

B. ACTION DETECTION USING LSTM C. OBJECT DETECTION USED IN ACTIONS


Initially the interpolated landmark values are normalized, and To accurately detect the objects used during actions, our
a label map representing each of individual actions, which is approach involves implementing a modified end-to-end neu-
a categorical data variable, is converted into numerical data ral network. Unlike the standard YOLOv4 [33], our modified
by creating a new column and assigned a 1 or 0 value to model incorporates an additional prediction head that specif-
the column before being fed to an RNN to improve predic- ically enhances the detection of small objects and effectively
tions. RNNs are employed in the processing of sequential handles variations in object sizes. This modification allows
data, including speech recognition, time-series data, machine the network to extract features using convolutional layers,
translation, etc. It recognizes the sequential characteristics enabling precise computation of bounding boxes and class
of employs patterns to forecast the next likely scenarios. probabilities for each region with a high average precision
However, one drawback of RNNs is that, processing longer (AP). In this activity recognition model, a person may uti-
sequence of actions can be extremely time consuming. As a lize different objects in different actions. Thus, the main
result, we employ LSTM, a specific kind of RNN that suc- objective of the study is to determine whether a person is
cessfully handles this issue [32]. using the appropriate objects when performing an activity.
LSTM networks are a subset of RNNs, designed specifi- This is because using the wrong objects or tools for different
cally for this purpose. The fundamental principle of LSTM actions may create challenges and pose potential hazards or
is the cell’s state, which provides an extra information flow risks, which could affect safety and efficiency in different
over traditional RNN. settings. In order to address this issue, we adopt a method
To begin, the forget and input gates decide which parts of considering the object and the person’s hand as a single entity,
the information are to be forgotten and which are to be input while disregarding any similar objects of the same class in
for the recognition of action. The forget and input gates of the the same frame that are not being used during the activity.
person’s action recognition are defined as below: This approach allows us to focus on the relevant objects and
movements, and to eliminate any unnecessary or confusing
ft = σ Wf xt + Uf ht−1 + bf

(7) information that may lead to inaccurate or misleading results.
it = σ (Wi xt + Ui ht−1 + bi ) (8) By considering the object and hand as a single entity, our

67390 VOLUME 11, 2023


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

FIGURE 7. Proposed architecture of action detection model. Landmark values are the input features of the
action detection network.

method accomplishes a more comprehensive understanding feature layer in SPPNet is first convolved three times, and per-
of the activity being performed and improve the accuracy of form maximum pooling operation using different sized max
the analysis. pooling kernels. The pooled outputs are first concatenated,
Now, the input image frame is divided into S × S grids in then three times convolved, which enhances the network
order to detect the object. If the object’s center falls within receptive field. Following the operations of backbone and
a grid cell, it is detected using that grid cell to forecast a SPPNet, PANet convolves the feature layers and up-samples
bounding box: them, doubling the height and width of the original feature
layers.
CS bg = Pg,b ∗ IoUpred
truth
(13)
The feature layer obtained after convolution and
where CS bg is confidence score of the bth bounding box in up-sampling is concatenated with the feature layer obtained
the gth grid. Pg,b represents class probability value of the from CSPDarkNet53 to achieve feature fusion and finally
bth bounding box in the gth grid. IoUpred truth denotes the down-sampling. Then, it is compressed in height and width,
intersection over union (IoU) between the ground truth and and stacked with previous feature layers for even more feature
predicted bounding box of the objects. fusion. In contrast to three detection heads in YOLOv4,
The detection model structure consists of four main parts: the proposed model includes an additional prediction head
input terminal, backbone, neck, and head, which help to that enhances the ability to detect extremely small objects,
clearly describe each action flow of the suggested method. improves the stability of the detection, and mitigates the
To ensure the detection of moving and stationary objects, the negative effects of object size variance. The introduced extra
input image is processed at a resolution of 416 × 416 pix- head enhances the object detection algorithm by effectively
els. Darknet53 was created as a result of YOLOv3 [34] handling scale variations, improving localization accuracy,
incorporating the residual module and the ResNet structure’s providing contextual understanding, and enabling accurate
properties. Based on this, YOLOv4 created the CSPDark- classification of objects. These benefits collectively con-
Net53, which consists of 5 cross-stage partial (CSP) modules tribute to the algorithm’s enhanced performance and accuracy
and 72 convolutional layers, considering the superior learning in detecting objects of different shapes and sizes used during
capabilities of CSP network (CSPNet) [35]. By incorporating activities. Although this additional head incurs higher com-
gradient changes into feature maps, it minimizes computa- putational and memory costs, it results in better detection
tional bottlenecks and enables the CNN network to achieve performance due to the utilization of low-level yet high-
greater accuracy. Additionally, the initial CSP stages are resolution feature maps. The model structure is shown in
transformed into the residual layer of the original DarkNet Fig. 8. Finally, to improve mAP and object detection, the
in order to increase accuracy as well as the speed. Two head-anchor-based detection network model is used. The
convolutional layers and one skip connection are included in loss function used in the training phase for utilizing object
each residual module. A batch normalization layer and a Mish detection model mainly included bounding box location loss
activation function are included in each convolutional layer. (LBIoU ), confidence loss (Lconf ) and classification loss (Lcl )
Five CSP modules are present in the residual layers of each as defined below.
step of the CSPDarknet53 backbone (1-2-8-8-4). SPPNet and
PANet are the components of the neck portion. The input L = LBIoU + Lconf + Lcl (14)

VOLUME 11, 2023 67391


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

FIGURE 8. Enhanced object detection model for identifying objects used by a person during an action.

d2 the person’s body movements and the objects used dur-


LBIoU = 1 − IoU + + αν (15)
c2 ing the actions, as well as the chronological order of the
2B
S X
X actions. Initially, we employ the proposed pose estimation
K − log (p) + BCE n̂, n
 
Lconf = (16) architecture to obtain 132 landmark values that capture the
i=0 j=0 person’s body movements during the activity. These land-
2
S X
B marks represent keypoints on the body and provide essential
noobj
X
Lcl = 1i,j [− log (1 − pc )] (17) spatial information for recognizing actions. The landmark
i=0 j=0 values are then fed into three layers of LSTM network
BCE n̂, n = −n̂ log (n) − 1 − n̂ log(1 − n)
 
(18) which analyzes the temporal dynamics of the landmarks and
ν learns the patterns and sequences of actions performed in
α= (19) chronological order. Following the LSTM layers, two fully
(1 − IoU ) + ν
connected layers are applied for additional encoding. These
wgt w 2
 
4 layers help extract higher-level features and representations
ν = 2 arctan gt − arctan (20)
π h h from the temporal information captured by the LSTM net-
obj work. The output is then passed through a SoftMax layer,
K = 1i,j (21)
which assigns probability values to each recognized action.
where IoU stands for the intersection over union ratio of the The SoftMax layer enables the model to provide probability
predicted and ground truth bounding boxes, c and d denotes distributions, indicating the likelihood of each action being
the distance between the two bounding boxes’ centers and performed. Concurrently, we employ the improved YOLOv4
their union’s diagonal distance, respectively. The ground truth object detection model to identify the specific objects being
bounding box’s width and height are denoted by wgt and hgt , used during each action. Finally, we develop an algorithm that
respectively, whereas the predicted bounding box’s width and combines the action detection and object detection models.
height are denoted by w and h, respectively. S represents the By integrating these two components, we enable the model
total number of grids, while B is the anchor value for each to accurately recognize a person’s activity while considering
grid. When an object is found in the jth anchor of the ith both the predefined SOPs and the objects used. The algorithm
grid, the weight K has a value of 1; otherwise, it has a value takes into account the sequences of actions, matches them
of 0, while n and n̂ denote the predicted and actual classes of with predefined SOPs, and identifies the relevant objects
the jth anchor in the ith grid, respectively, and p denotes the being used during each action. The pseudocode outlining the
probability of the object. activity recognition algorithm is provided in Algorithm 1,
which details the steps for integrating action detection, object
D. ACTIVITY RECOGNITION ALGORITHM detection, and adherence to predefined SOPs.
The aim of this research is to develop an activity recog-
nition system that can identify different actions performed IV. EXPERIMENTAL SETUP AND RESULTS
within an activity, in chronological order and in accordance A. DATASET DESCRIPTION
with predefined SOPs, while also detecting the objects used The focus of our study is primarily on three tasks. The first
in each action. To achieve this, we must focus on both task involves identifying a person’s actions, while the second

67392 VOLUME 11, 2023


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

Algorithm 1 Person’s Activity Recognition TABLE 1. Activities and corresponding chronological order of actions and
Define the expected action sequence of each activity as a list of strings. objects used.
Define action and object combination condition.
Input: Read a video V m (Fn , Fi ), where Fn represents sequence of frames,
Fi represents the 132 landmark values of ith frame, n ≥ 60
Output: A person’s activity with action sequence and object being utilized.
1. Initialization: Action detection
2. Loop over the expected actions in the sequence.
3. For each expected action, read the first 60 frames, Fn = 60
4. Check if the previous 10 frames are same, Fn [−10 :] is same, then
5. If res > T , where res is normalized output vector with probabilities of
each possible outcome, threshold T=0.6, then
6. Check condition: Action sequence (Table 1)
7. If sequence of the action is true
8. Initialization: Object detection
9. if the previous 10 frames detect same object, Fn [−10 :] is same, then
10. Check condition: Action-object combination (Table 1)
11. If combination condition is true
12. Output: action, then action ++
13. Output: activity
14. Else, output an error message ‘‘wrong object detected’’.
15. Else, output an appropriate error message ‘‘wrong action sequence:
Expected, action sequence [i]’’
16. Close video

task involves detecting the object used during the actions. The
final task is to recognize the activity based on the sequence
of actions. Despite the abundance of available online datasets
for data acquisition, most of them focus solely on action
detection and disregard the objects utilized during the actions are calculated using four parameters such as true positives
and the sequence of the actions in the activity. Therefore, (TP), true negatives (TN), false positives (FP), and false
it becomes challenging to acquire a dataset for this kind of negatives (FN). The aforementioned performance metrics are
task. In this context, this research employs the approach of defined as follows.
using our own video and image dataset. We have gathered
an extensive collection of 243 videos depicting 27 distinct 1) ACCURACY
actions, where each action entails the use of an object. These it defines the ratio of correctly detected activities throughout
actions are performed in a sequence with varying objects, the total data:
forming distinct activities. As elaborated in Table 1, five TP + TN
activities were utilized, each with a distinct chronological Accuracy = (22)
TP + TN + FP + FN
order of actions, and the corresponding objects used during
these activities. The term ‘action’ here refers to the movement 2) PRECISION
of the body while using an object, whereas ‘activity’ refers to it defines the ratio of person’s activities correctly detected
the complete work being carried out. Given that each action throughout the total videos:
is composed of a sequence of frames, we have meticulously TP
compiled Fn = 60 frames for each action while developing Precision = (23)
TP + FP
the proposed action detection model.
To develop our model, we utilize an approach that involves 3) RECALL
focusing solely on objects being utilized by individuals dur- it defines the ratio of videos correctly detected as an activity
ing actions. We treat the person’s hand and the object as one to the total videos of that activity:
entity, disregarding any similar objects in the same class in the TP
same frame that are not being used during the activity. The Recall = (24)
TP + FN
dataset particulars for the object detection model are given
in Table 2. 4) F1 SCORE
the harmonic mean of precision and recall. The model per-
B. EVALUATION METRICS formance is summarized by this metric effectively and is
The performance of the proposed models was validated using calculated as follows:
a number of performance indicators, such as accuracy, preci- precision × recall
F1score = 2 × (25)
sion, recall, and F1 score. These performance measurements precision + recall
VOLUME 11, 2023 67393
Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

5) AP
the area under the precision and recall curves, denoted by
Average Precision, is defined as follows:
Z1
AP = P(r)dr (26)
0
where P and r are the precision and recall, respectively.
Precision and recall have values between 0 and 1. Finally,
after calculating the AP values of activities, the mean average
precision (mAP) is calculated as follows:
AP1 + AP2 + . . . + APn
mAP = (27)
n
C. ACTION DETECTION RESULT
A collection of Fn = 60 frames, each of which contains
Fi = 132 landmark values is obtained from each action video
using our pose estimation and landmark extraction approach.
Before feeding these values to the LSTM network for action
detection, the entire video dataset was split into training and
test datasets in an 8:2 ratio. We used the Adam optimizer [36]
to train our network for 150 epochs in an effort to reduce
the loss. Categorical cross-entropy loss function is used since FIGURE 9. Normalized confusion matrix created using the predictions of
the proposed action detection model on the test dataset.
the action detection model has twenty-seven classes. The
action detection model achieved a test accuracy of 95.91%
TABLE 2. Description of the object detection dataset.
after training. Fig. 9 shows the normalized confusion matrix
generated from the predictions made by the proposed action
detection model on the test dataset. The results indicate that
the model achieved high accuracy in recognizing the majority
of the actions. However, it appears that some similar actions,
such as opening or closing bottle, wearing socks or shoes,
were sometimes misclassified as false positives. This is likely
due to almost the identical nature of their actions.
To evaluate the quality of our model, we used OpenPose
and DeepPose as the standard reference and trained two
models, one with and the other without the proposed interpo-
lation technique, using different recurrent neural networks,
i.e., GRU (gated recurring units) and LSTM, as shown in
Table 3. Although the OpenPose model shows slightly bet-
ter performance than other estimation models, our approach
with both networks performs much faster than the rest. This
is due to the fact that the proposed model only employs Consequently, GRU consumes less memory, executes faster
two steps, detector and tracker inference pipeline, where the and trains faster than LSTM’s whereas LSTM achieves bet-
detector only runs on the first frame or until a person’s face ter accuracy on datasets with longer sequences. The output
is detected, and then the tracker is used to track the person in results of the proposed action detection model are shown
consecutive frames. To forecast heatmaps for all landmarks, in Fig. 10.
we additionally employ a compact encoder and decoder net-
work design, followed by another encoder that regresses D. OBJECT DETECTION RESULT
directly to landmark coordinates, allowing the model to Using the dataset listed in Table 2, the performance of object
become lighter and run faster in real-time inference. Also, detection model for the suggested person activity recognition
the model trained with the interpolation technique performs system was evaluated. Before feeding the datasets into our
better, as it used well-interpolated landmarks for undetected object detection model, we randomly divided the data into
and wrong-detected landmark frames. Furthermore, LSTM 80% for training and split the remaining data into 10% for
with different pose estimation algorithms perform better validation and 10% for test. The shape of input images is
because GRU has simpler structure. It has only two gates also resized to 416 × 416 before being passed into training.
(reset and update gates) and utilize fewer training parameters. After training for 500 epochs with the Adam Optimizer to

67394 VOLUME 11, 2023


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

FIGURE 10. Results of action detection using LSTM network and interpolated body landmarks obtained from pose estimation network.

TABLE 3. Performance comparison of various action detection models. the model does not detect the left sock. This is because we
consider the person’s hand and the object being used as a
single entity. Similarly, when the person is loading clothes
into the washing machine, the model does not detect other
objects such as the washing machine lid or buttons, as they
are not relevant to the action.
Performance comparison of the different object detec-
tion models is shown in Table 4. It is clear that when
IoU = 0.5, Faster R-CNN has a higher mAP but with the
lowest FPS than others. It signifies that the common fea-
tures of two-stage detection algorithm have higher detection
reduce the overall loss, and with the initial learning rate accuracy but lower real-time problems. Meanwhile, FPS and
value of 0.0001, the proposed object detection model achieves mAP of our model are reasonably high when compared to
an overall mAP of 97.68% for detecting the objects being other algorithms. Although our model is a little slower than
used while performing the actions. Fig. 11 shows detection the original YOLOv4 due to the extra computational load
of object being utilized by the person using the enhanced from the additional head, it delivers superior object detec-
YOLOv4 disregarding any similar objects in the same class tion performance for every frame in the video. This is due
in the same frame that are not being used during the activity. to the advantage of having an extra head that allows the
For example, when the person is putting on the right sock, model to detect objects of varying sizes with better accuracy.

VOLUME 11, 2023 67395


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

FIGURE 11. Object detection results using enhanced YOLOv4 algorithm to identify objects used during actions.

TABLE 4. Performance comparison with other object detection models. TABLE 5. Comparisons on various activity recognition models.

Considering both mAP and FPS metrics, the proposed


method is the most suitable for detecting objects used during and also detects the object being used during the action
activity. by examining the action and object combination (Table 1).
The results from the activity recognition model are depicted
E. RECOGNITION OF THE PERSON’S ACTIVITY in Fig. 12.
To achieve real-time predictions for an activity recognition Table 5 presents the outcomes of using the proposed action
model we employ the proposed recognition algorithm out- detection model with different state-of-the-art object detec-
lined in section III-D to analyze the person’s activity output. tion models for activity recognition. The results reveal that
We begin by looping through the frames with OpenCV the Faster R-CNN object detection model combined with
and appending them. Once we have accumulated a set of the proposed action detection model has a high mAP, but a
60 frames (Fn = 60), we feed them into the proposed action lower FPS compared to other models, making it unsuitable
detection model. This model checks for the action sequence for real-time activity prediction. However, the primary goal

67396 VOLUME 11, 2023


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

FIGURE 12. The output of the proposed activity recognition model. The model identifies different actions that are performed in a chronological order
and the objects utilized during each action.

of this research is to recognize person’s activities by detecting the enhanced YOLOv4 model combined with the proposed
action sequences and interactive objects in real-time. Thus, action detection model achieves a higher FPS and a reason-
we require a model that can quickly identify person’s actions ably high mAP, suggesting that this model is more suitable
and detect objects. According to the experimental findings, for recognition problems.

VOLUME 11, 2023 67397


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

Furthermore, it is worth noting that running the action [3] C. Xu, D. Chai, J. He, X. Zhang, and S. Duan, ‘‘InnoHAR: A deep neural
detection and object detection models independently allows network for complex human activity recognition,’’ IEEE Access, vol. 7,
pp. 9893–9902, 2019.
them to maximize their processing capabilities. Con- [4] T. Zebin, P. J. Scully, N. Peek, A. J. Casson, and K. B. Ozanyan, ‘‘Design
versely, when these two models are integrated, there is and implementation of a convolutional neural network on an edge com-
an additional coordination overhead, resulting in a slight puting smartphone for human activity recognition,’’ IEEE Access, vol. 7,
pp. 133509–133520, 2019.
decrease in frames per second (fps) compared to individual [5] Y. Li, C. Wang, Y. Cao, B. Liu, J. Tan, and Y. Luo, ‘‘Human pose estimation
execution. Nonetheless, the integration offers the advantage based in-home lower body rehabilitation system,’’ in Proc. Int. Joint Conf.
of precise activity recognition by incorporating both actions Neural Netw. (IJCNN), Glasgow, U.K., Jul. 2020, pp. 1–8.
[6] W. Liu, X. Liu, Y. Hu, J. Shi, X. Chen, J. Zhao, S. Wang, and Q. Hu,
and objects, thereby enabling a more profound comprehen- ‘‘Fall detection for shipboard seafarers based on optimized BlazePose and
sion of the activity at hand. LSTM,’’ Sensors, vol. 22, no. 14, pp. 5449–5466, Jul. 2022.
[7] M. Abbas and R. L. B. Jeannès, ‘‘Exploiting local temporal characteristics
V. CONCLUSION via multinomial decomposition algorithm for real-time activity recogni-
The proposed model incorporated a lightweight CNN opti- tion,’’ IEEE Trans. Instrum. Meas., vol. 70, pp. 1–11, 2021.
mized top-down human pose estimation architecture to find [8] W. Huang, L. Zhang, W. Gao, F. Min, and J. He, ‘‘Shallow convolutional
neural networks for human activity recognition using wearable sensors,’’
the body landmarks from a sequence of frames, followed by IEEE Trans. Instrum. Meas., vol. 70, pp. 1–11, 2021.
interpolation to enhance the accuracy of pose estimation for [9] Y. Zhang, G. Tian, S. Zhang, and C. Li, ‘‘A knowledge-based approach for
undetected or wrong-detected landmarks. The transformed multiagent collaboration in smart home: From activity recognition to guid-
ance service,’’ IEEE Trans. Instrum. Meas., vol. 69, no. 2, pp. 317–329,
landmark values were then fed to multiple layers of LSTM Feb. 2020.
network, culminating in the SoftMax layer to predict the [10] N. A. Capela, E. D. Lemaire, and N. Baddour, ‘‘Feature selection for
person’s actions. Additionally, an object detection model was wearable smartphone-based human activity recognition with able bod-
ied, elderly, and stroke patients,’’ PLoS ONE, vol. 10, no. 4, pp. 1–18,
developed by enhancing YOLOv4 to detect the object used Apr. 2015.
during the actions. Finally, the proposed activity recognition [11] A. Prati, C. Shan, and K. I.-K. Wang, ‘‘Sensors, vision and networks:
algorithm integrated these two models to create a real- From video surveillance to activity recognition and health monitoring,’’
J. Ambient Intell. Smart Environ., vol. 11, no. 1, pp. 5–22, Jan. 2019.
time, lightweight, and robust activity recognition model. Our [12] S. Sankar, P. Srinivasan, and R. Saravanakumar, ‘‘Internet of Things based
model achieved 95.91% accuracy in recognizing actions and ambient assisted living for elderly people health monitoring,’’ Res. J.
97.68% mAP for detecting the object used during the actions, Pharmacy Technol., vol. 11, no. 9, pp. 3900–3904, Dec. 2018.
[13] E. Zdravevski, P. Lameski, V. Trajkovik, A. Kulakov, I. Chorbev,
with an overall FPS of 10.47. This model can help monitor
R. Goleva, N. Pombo, and N. Garcia, ‘‘Improving activity recognition
and inspect human activities that followed a chronological accuracy in ambient-assisted living systems by automated feature engineer-
order of actions when interacting with different objects within ing,’’ IEEE Access, vol. 5, pp. 5262–5280, 2017.
the activity. In manufacturing and assembly, our activity [14] X. Ji, J. Cheng, W. Feng, and D. Tao, ‘‘Skeleton embedded motion body
partition for human action recognition using depth sequences,’’ Signal
recognition model can be utilized to ensure workers follow- Process., vol. 143, pp. 56–68, Feb. 2018.
ing predefined sequences when using tools and components, [15] A. Jalal, Y.-H. Kim, Y.-J. Kim, S. Kamal, and D. Kim, ‘‘Robust human
boosting efficiency and quality control. In sports analysis, activity recognition from depth video using spatiotemporal multi-fused
features,’’ Pattern Recognit., vol. 61, pp. 295–308, Jan. 2017.
it can accurately track players’ movements, recognize tech- [16] C. Xu, L. N. Govindarajan, and L. Cheng, ‘‘Hand action detection from
niques and equipment used, and provide valuable insights for ego-centric depth sequences with error-correcting Hough transform,’’ Pat-
coaching and strategic analysis. In healthcare and rehabili- tern Recognit., vol. 72, pp. 494–503, Dec. 2017.
[17] O. K. Oyedotun and A. Khashman, ‘‘Deep learning in vision-based
tation, it can assist in monitoring patients’ activities during static hand gesture recognition,’’ Neural Comput. Appl., vol. 28, no. 12,
therapy and offer real-time feedback to improve outcomes. pp. 3941–3951, Apr. 2016.
In industrial environments, it can analyze workers’ actions [18] L. Pigou, A. van den Oord, S. Dieleman, M. Van Herreweghe, and
J. Dambre, ‘‘Beyond temporal pooling: Recurrence and temporal convo-
and equipment interactions to ensure safety compliance.
lutions for gesture recognition in video,’’ Int. J. Comput. Vis., vol. 126,
In the future, we plan to enhance the proposed method nos. 2–4, pp. 430–439, Oct. 2016.
to recognize activity in industrial working environments and [19] J. Qi, P. Yang, M. Hanneghan, S. Tang, and B. Zhou, ‘‘A hybrid hierarchical
detect additional objects such as helmets, gloves, masks, framework for gym physical activity recognition and measurement using
wearable sensors,’’ IEEE Internet Things J., vol. 6, no. 2, pp. 1384–1393,
and shoes to ensure individual safety and prevent industrial Apr. 2019.
accidents. Additionally, we aim to enhance the fps of our [20] C. Aviles-Cruz, E. Rodriguez-Martinez, J. Villegas-Cortez, and
model without compromising accuracy by exploring model A. Ferreyra-Ramirez, ‘‘Granger-causality: An efficient single user
movement recognition using a smartphone accelerometer sensor,’’ Pattern
optimization techniques, leveraging hardware acceleration, Recognit. Lett., vol. 125, pp. 576–583, Jul. 2019.
considering algorithmic improvements, and upgrading hard- [21] I. Jegham, A. B. Khalifa, I. Alouani, and M. A. Mahjoub, ‘‘Vision-based
ware infrastructure. human action recognition: An overview and real world challenges,’’ Foren-
sic Sci. Int., Digit. Invest., vol. 32, Mar. 2020, Art. no. 200901.
[22] V. Bazarevsky, I. Grishchenko, K. Raveendran, T. Zhu, F. Zhang, and
REFERENCES M. Grundmann, ‘‘BlazePose: On-device real-time body pose tracking,’’
[1] Q. Wu, Y. Wu, Y. Zhang, and L. Zhang, ‘‘A local–global estimator based on 2020, arXiv:2006.10204.
large kernel CNN and transformer for human pose estimation and running [23] A. Toshev and C. Szegedy, ‘‘DeepPose: Human pose estimation via deep
pose measurement,’’ IEEE Trans. Instrum. Meas., vol. 71, pp. 1–12, 2022. neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[2] F. Rustam, A. A. Reshi, I. Ashraf, A. Mehmood, S. Ullah, D. M. Khan, and Columbus, OH, USA, Jun. 2014, pp. 1653–1660.
G. S. Choi, ‘‘Sensor-based human activity recognition using deep stacked [24] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, ‘‘OpenPose: Realtime
multilayered perceptron model,’’ IEEE Access, vol. 8, pp. 218898–218910, multi-person 2D pose estimation using part affinity fields,’’ IEEE Trans.
2020. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 172–186, Jan. 2021.

67398 VOLUME 11, 2023


Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions

[25] W. Li, L. Wen, M. Chang, S. N. Lim, and S. Lyu, ‘‘Adaptive RNN tree for Dr. Huang is a fellow of IET, CACS, TFSA, and the International
large-scale human action recognition,’’ in Proc. IEEE Int. Conf. Comput. Association of Grey System and Uncertain Analysis. He was a recipient
Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 1453–1461. of the 2021 Outstanding Research Award from the Ministry of Science and
[26] Y. Du, W. Wang, and L. Wang, ‘‘Hierarchical recurrent neural network Technology, Taiwan. He serves as the IEEE SMCS VP for Conferences and
for skeleton based action recognition,’’ in Proc. IEEE Conf. Comput. Vis. Meetings and the Chair of the IEEE SMCS Technical Committee on Intel-
Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 1110–1118. ligent Transportation Systems. He was the IEEE SMCS BoG, the President
[27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, ‘‘Co- of the Taiwan Association of Systems Science and Engineering, the Chair
occurrence feature learning for skeleton based action recognition using of the IEEE SMCS Taipei Chapter and the IEEE CIS Taipei Chapter, and
regularized deep LSTM networks,’’ in Proc. 30th AAAI Conf. Artif. Intell.,
the CEO of the Joint Commission of Technological and Vocational College
Phoenix, AZ, USA, Feb. 2016, pp. 12–17.
Admission Committee, Taiwan.
[28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 779–788.
[29] W. Liu, Z. Liu, Y. Li, H. Wang, C. Yang, D. Wang, and D. Zhai, ‘‘An auto-
matic loose defect detection method for catenary bracing wire components
using deep convolutional neural networks and image processing,’’ IEEE
Trans. Instrum. Meas., vol. 70, pp. 1–14, 2021.
[30] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
P. Gehler, and B. Schiele, ‘‘DeepCut: Joint subset partition and labeling for
multi person pose estimation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 4929–4937.
[31] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon, ‘‘The vitruvian manifold:
SATCHIDANAND KSHETRIMAYUM received
Inferring dense correspondences for one-shot human pose estimation,’’ in
the B.Tech. degree in computer science and engi-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA,
Jun. 2012, pp. 103–110.
neering from the National Institute of Technology
[32] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural Manipur, India, and the M.Tech. degree in oper-
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. ations research from the National Institute of
[33] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Optimal Technology Durgapur, India. He is currently pur-
speed and accuracy of object detection,’’ 2020, arXiv:2004.10934. suing the Ph.D. degree with the Department of
[34] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ Electrical Engineering, National Taipei Univer-
2018, arXiv:1804.02767. sity of Technology, Taipei, Taiwan. His current
[35] C. Wang, H. Mark Liao, Y. Wu, P. Chen, J. Hsieh, and I. Yeh, ‘‘CSPNet: research interests include human activity recogni-
A new backbone that can enhance learning capability of CNN,’’ in Proc. tion (HAR), computer vision, deep learning, and image processing.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
Seattle, WA, USA, Jun. 2020, pp. 1571–1580.
[36] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
2014, arXiv:1412.6980.

YO-PING HUANG (Fellow, IEEE) received the


Ph.D. degree in electrical engineering from Texas
Tech University, Lubbock, TX, USA.
He was a Professor and the Dean of Research
and Development, the Dean of the College of Elec- CHUN-TING CHIANG received the bachelor’s
trical Engineering and Computer Science, and the degree from the Department of Electrical Engi-
Department Chair with Tatung University, Taipei. neering, National Kaohsiung University of Sci-
He is currently the President of the National ence and Technology, Kaohsiung, Taiwan. He is
Penghu University of Science and Technology, currently pursuing the master’s degree in electrical
Penghu, Taiwan. He is also a Chair Professor with engineering with the National Taipei University of
the Department of Electrical Engineering, National Taipei University of Technology, Taipei, Taiwan. His current research
Technology, Taipei, where he was the Secretary General. His current research interests include machine learning, deep learning,
interests include fuzzy system design and modeling, deep learning modeling, and image processing.
intelligent control, medical data mining, and rehabilitation systems design.

VOLUME 11, 2023 67399

You might also like