Object-Based Hybrid Deep Learning Technique For Recognition of Sequential Actions
Object-Based Hybrid Deep Learning Technique For Recognition of Sequential Actions
ABSTRACT Using different objects or tools to perform activities in a step-by-step manner is a common
practice in various settings, including workplaces, households, and recreational activities. However, this
approach can pose several challenges and potential hazards if the correct sequence of actions is not followed
and the object or tool is not used in the appropriate sequence; therefore, it must be addressed to ensure safety
and efficiency. These issues have garnered significant attention in recent years. Previous research has relied
on using body keypoints to detect actions, but not the objects or tools used during activity. As a result, the
lack of a system to identify the target objects or tools being used while performing tasks increases the risk
of accidents and mishaps during the process. This study suggests a possible solution to the aforementioned
issue by introducing a model that is both efficient and durable. The model utilizes video data to monitor and
identify daily activities, as well as the objects involved in the process, thus enabling real-time feedback and
alerts to enhance safety and productivity. The suggested model separates the overall recognition process into
two components. Firstly, it utilizes the advanced BlazePose architecture for pose estimation, and interpolates
any undetected and wrong-detected landmarks to enhance the precision of the posture estimation. After
this, the features are forwarded to a long short-term memory network to identify the actions performed
during the activity. Secondly, the model also employs an enhanced YOLOv4 algorithm for object detection,
to accurately identify the objects used in the course of the activity. Finally, a durable and efficient activity
recognition model has been developed, which achieves 95.91% accuracy rate in identifying actions, a mean
average precision score of 97.68% for detecting objects, and overall activity recognition model that is capable
of processing at a rate of 10.47 frames per second.
INDEX TERMS Human activity recognition, long short-term memory (LSTM), object detection, pose
estimation, standard operating procedures (SOPs).
step-by-step processes to complete the task. Human pose esti- and their skeleton representation generated from body land-
mation (HPE) is a popular research field in computer vision marks. The physical differences between some actions are
that plays a significant role in activity recognition [1], [2], [3]. very small or even identical, making it difficult to iden-
The majority of these techniques rely on using optical sensors tify activities that are identical yet interacting with different
to take RGB images in order to determine body landmarks objects, such as in households, recreational, workplace
and the overall position. It is also possible to combine with activities of persons involving machine operation, mate-
other computer vision technologies for 3D animation, fitness, rial movement, maintenance, assembly, product and process
virtual and augmented reality, and rehabilitation [4], [5], [6]. design, etc.
HAR on the other hand, is a crucial computer vision task Therefore, with the growing popularity of HAR and object
that enables machines to examine the identified body land- detection in the computer vision field, it is better to have
marks from HPE models and comprehend various human a system that can accurately recognize actions sequence in
activities [7], [8], [9]. Many researchers have been driven an activity as well as detect objects used during the activity
to advance HAR systems in real-world setting by the rapid will be of profound benefit. This would aid in analyzing
growth of artificial intelligence, smart phones, and CCTV and monitoring a person’s activity to determine if they are
systems. This drive has been motivated by the role of HAR adhering to the SOPs with appropriate objects.
systems in health, security and behavioral studies. Some of The goal of this research is to create an activity recognition
their applications include patient monitoring systems [10], model for a person from video information that can detect
[11], ambient assisted living (AAL) [12], [13], surveillance their actions sequence as well as the objects being used while
systems [14], [15], gesture recognition [16], [17], behavior they are performing an activity. To achieve this, a person’s
analysis [18], and a range of healthcare systems [19], [20]. pose estimation is discovered using BlazePose [22] and unde-
In particular, vision-based human activity recognition sys- tected or wrong-detected landmarks were interpolated using
tems, which evaluate input in the form of video or image to linear interpolation method, then the information is processed
identify performed activities are quite complicated. This is by a recurrent neural network that can learn sequential order
because the appearance of the body changes dynamically due dependency, known as long short-term memory (LSTM).
to various types of clothing, occlusions caused by viewing Object detection method is carried out in the second part
angles, background context, etc. [21]. And the performance using an enhanced YOLOv4 algorithm to recognize the object
would be poor if the occlusion is very high. It is also interest- in the person’s hand while they are performing the activ-
ing to note that the majority of current studies only address the ity. Finally, a lightweight and robust system for recognizing
recognition of an action, and none really gives insight about person’s activities is created by combining the two models.
the object they use during the activity. Fig. 2 depicts the suggested architecture. Three challenges
Fig. 1 shows some example pictures of confusing cases, are considered to be resolved in this study: (1) human pose
where a person performs an action with and without object, estimation-based action detection using LSTM, (2) an object
detection model to detect objects being used in an activity,
and (3) an activity recognition model to classify the overall
activity.
2) We proposed a technique to improve the accuracy of K-partite graph matching for multi-person pose estimation.
person’s pose estimation by interpolating the undetected The primary drawbacks of this system, despite processing
and wrong-detected landmarks. at 0.4 frames per second, it demands a lot of computational
3) The object detection algorithm is further enhanced by power and is difficult to work on real-time videos. A two-
introducing extra YOLO head to detect the various step detector-tracker inference pipeline is used by Google’s
object of different shape and size used by the person (Mountain View, CA) BlazePose model [22], where the
while performing the activity. detector is employed in the initial frame and tracker is used
4) An activity recognition model is developed that can to follow the person in consecutive frames until the person
recognize different actions performed within the activ- is discovered. In order to predict heatmaps for each joint
ity in chronological order, and in accordance with the in this model, it has employed an encoder-decoder network
predefined SOPs as well as the object being used. design followed by another encoder that regresses directly
This paper is organized as follows. A comprehensive liter- to the coordinates of all joints. It is ideal to estimate human
ature review of existing related work is provided in Section II. pose for activity recognition due to its lightweight design and
The proposed methodology is described in Section III. real-time inference capability. However, it may fail to detect
Section IV presents the training dataset, experimental results body landmarks due to high changes in appearance, clothing
and discussions. In Section V conclusion and future research and occlusions.
are given. Recent advances in effective motion capture technolo-
gies and posture assessment algorithms have made it easier
II. RELATED WORK to obtain information about human joint coordinates. As a
Artificial intelligence (AI) models that estimate body key result, joint coordinate-based action recognition using deep
points to characterize body position have become a poten- learning methods has significantly outperformed previous
tially effective tool for assessing human actions. More specif- methods in recent years and has become the standard
ically, convolution neural networks (CNN) are frequently approach. Recurrent neural network (RNN) [25] is now one
used in human pose estimation to forecast a person’s posi- of the most used frameworks in joint coordinate-based action
tion by performing inference on input videos or images [1], recognition because of its ability to analyze sequential data.
[2]. Due to the numerous conceivable human positions, the A hierarchical RNN network [26] was proposed to classify
high degree of freedom, appearance changes like illumina- activities based on skeleton’s data. An advanced LSTM net-
tion and clothing, environmental changes, and occlusions, work [27] that is fully coupled and includes the regularization
determining precise pixel coordinates of body keypoints is strategy was developed to acquire the high-level temporal
a challenging process [3]. Despite these challenges, a num- aspects of skeleton information. All these approaches rely
ber of reliable models have been developed that function on the RNN architecture, and these features aim to improve
admirably in applications including sports training, rehabil- action recognition while failing to recognize the object being
itation, and fall detection [4], [5], [6]. While pose estimation used. Thus, many significant recognition errors are occurred
models have been successful in other applications, it is still among physically similar classes of person activity. The pri-
needed to be able to accurately identify keypoints in order to mary cause of these recognition errors is that these activities
track person’s activity because engaging in the wrong activity differ by tiny or similar body movements yet interaction with
might have side effects on the production lines. different objects.
For body joint coordinate-based action recognition, Our work belongs to activity recognition, but more focus
the human pose estimation problem is formulated as a on both body movement of the person and interacted objects,
CNN-based regression problem toward body joints by the that has not been considered in the above methods. In this
holistic model DeepPose [23]. Additionally, it employs a study, we modified YOLO (you only look once) [28] to
cascade of these regressors to improve the pose estimation. enhance its ability to detect various objects of different shapes
However, regression to XY location is challenging and raises and sizes that are used by individuals while performing activ-
learning complexity, which inhibits generalization and results ities. The proposed method is a single convolutional network
in subpar performance in some regions. A real-time multi- that predicts multiple bounding boxes and class probabilities
person posture estimation architecture made for the desktop from a single image frame in a single evaluation. By improv-
settings, called OpenPose [24], was proposed as a solution, ing the accuracy of object detection, our model can provide a
which is commonly used in the pose estimation commu- more comprehensive understanding of the actions being per-
nity. It generates a feature representation by first analyzing formed. This makes the proposed model suitable for a wide
the image using the first 10 layers of VGG-19 architecture. range of applications, including human activity recognition
The captured feature representation is then fed into a two- and surveillance.
branch multi-level CNN to generate part confidence maps
and vector fields of part affinities. One branch forecasts a III. PROPOSED METHODOLOGY
collection of 2D body part confidence maps. The other branch The proposed approach aims to develop a framework that
indicates the relationship of parts through 2D vector fields is both lightweight and robust for classifying sequential
of part affinities. These two branches are used to carry out actions in an activity. This framework focuses on capturing
the radius of a circle that encloses the entire body, and angle
of inclination of the line joining the midpoint of the shoulders
and hips [31]. This also helps in tracking extremely complex
situations in any kind of person’s activities.
The model used an encoder-decoder network architecture
to predict heatmaps for every joint of the person, followed FIGURE 6. Architecture of the landmark detector network.
by a second encoder that regresses back to every landmark
(joint’s coordinates). Then, to make this model lightweight
enough to run on a low-end computer, heatmap output is interpolation, we fill in the gaps and correct any inaccura-
removed during inference as shown in Fig. 6. A list of 33 land- cies, ultimately boosting the overall precision of the posture
marks is returned by the architecture. The landmarks are estimation process. To address this, we use time series corre-
represented as x, y, z, and v, the visibility. The coordinates lations between identical body joints across several frames,
(x and y) show where a particular joint of the person is because the estimated human position is a collection of time
located within the normalized range between 0 and 1 of the series data.
image’s width and height. z stands as depth of the landmark, When landmarks in Blazepose are unable to be detected,
having origin as the depth at the center of hip. The term v their x and y coordinate values will always be 0. In this study,
f f −1
describes whether or not a landmark can be seen in the frame. for the person w’s landmark lw in f frame, although lw
The scale and position of the person have an impact on the f +1 f
and lw are detected, but lw is not, we represent f frame
landmarks that the pose estimation network generates for it. as ‘‘undetected landmark frame’’ f ′ .
Therefore, they are transformed to become independent of
the position and scale in the frame. As a result, the same f′ = f (3)
person in the same action may provide different landmark f
where = (0, 0),
lw
f −1
lw
f +1
̸= (0, 0), and lw ̸= (0, 0)
values in different frames depending on where they are in f
Similarly, for person w’s landmark lw in frame f , although
the frame. We grab these landmark values and save them as f −1 f +1 f
frame values to represent a sequence of events for an activity. lw and lw are detected, but lw is wrong-detected,
For an activity video, V m = [F1 , F2 , . . . , Fn ] is a matrix of we represent frame f as ‘‘wrong-detected landmark
pose-vectors with K landmarks, where V m contains n frames frame’’ f ′′ . We emphasize on the difference δ f that is provided
of change of the person conducting the actions. Each frame as the landmark’s lw spatial distance between two consecu-
is consisted of: tives f − 1 and f frames. The fixed number of pixels is given
h i as the difference δ f . Due to the possibility of resolutions and
Fi = li1 , li2 , . . . , l K
i , i ∈ [1, n] (1) frame rates varying based on the input video, we do not wish
Since our model can generate 33 landmarks (K = 33), then to specify a threshold for δ f . As a result, we set a threshold θ
the resulting vector has a length of 132 landmark values and to give importance to the ratio of the difference δ f and δ f −1 .
format: f ′′ = f (4)
Fi = [xi1 , y1i , z1i , v1i , xi2 , y2i , z2i , v2i , . . . , xi33 , y33
i , zi , vi ]
33 33 f −1 f +1
where δ f > θ · δ f −1 , lw ̸= (0, 0), and lw ̸= (0, 0)
(2) The percentage of wrong-detected frame which were not
Depending on the photography settings and conditions, wrong-detected frames was lower when the threshold was
landmarks might not be detected or wrong-detected when we set to θ = 3. As a result, we use θ = 3 as the threshold in
use pose estimation models based on CNN to a video taken this study so that only frames that are clearly wrong-detected
by a general camera. Action detection and analysis are nega- are interpolated. In this manner, we represent wrong-detected
tively impacted by this kind of inaccurate landmark detection. landmark frames according to the relative number of changes
To overcome this issue, in conjunction with the BlazePose for every landmark. Both undetected and wrong-detected
architecture, we have incorporated innovative interpolation landmark frames will be interpolated using the previous and
techniques. These techniques play a crucial role in enhancing following frames’ landmark coordinate information.
the accuracy of posture estimation by effectively address- It is crucial to extract person’s coordinate values from
ing any undetected or wrong-detected landmarks. Through various frames so as to interpolate coordinate values. We use
linear interpolation to interpolate landmarks for undetected where xt denotes the input data; ft and it denote the forget
and wrong-detected landmark frames. This is based on the and the input gate output respectively; ht−1 denotes previous
observation that person action does not change significantly hidden state and σ indicates the sigmoid function.
over a short period of time. In most cases, the undetected Then, the intermediate cell state is calculated by:
f
or wrong-detected landmark lw will be located close to the
f −1 f +1 c̃t = tanh (Wc xt + Uc ht−1 + bc ) (9)
midpoint of landmarks lw and lw .
′
For undetected f frame, let the landmark of person wf ′ The cell state ct−1 and c̃t are then used to update the state of
f ′ −1 f ′ +1
in f ′ − 1 and f ′ + 1 frame be lw and lw , respectively. the cell ct :
f′
We perform the linear interpolation to landmark lw which
ct = ft · ct−1 + it ·c̃t (10)
both x and y coordinate values of the person wf ′ are 0.
where · represents inner product. Now the output of ot is
f ′ −1 f ′ +1
′ lw + lw derived by:
lwf = (5)
2
ot = σ (Wo xt + Uo ht−1 + bo ) (11)
For wrong-detected f ′′ frame, let the landmark of person
f ′′ −1 f ′′ +1 The output ht is obtained as:
wf ′′ in f ′′ − 1 and f ′′ + 1 frame be lw and lw , respec-
f ′′
tively. We perform the interpolation to landmark lw where
′′ ′′
ht = ot · tanh (ct ) (12)
f f −1
difference δw,l is larger than θ · δ w,l .
To ensure the classification of actions, the input video is
′′
f ′′ −1
lw
f ′′ +1
+ lw processed in the form of (V m , Fn , Fi ), where V m is the action
lwf = (6) video, Fn is the number of frames in the video, and Fi is the
2
coordinate values of the 33 landmarks. Then, it is fed into
This combination of BlazePose architecture and the proposed first LSTM layer with 64 LSTM units, 128 units in the second
interpolation techniques results in a model that is not only layer, and 64 units in the third layer. After passing the output
capable of providing more reliable estimations of human of the LSTM layers through two dense layers with 64 and
posture but also exhibits enhanced robustness across diverse 32 neurons, respectively for additional encoding, it is then
scenarios. By successfully handling challenging scenarios passed on to SoftMax, which returns the probability that the
and adapting to various body types, clothing variations, and input video belongs to a particular action as shown in Fig. 7.
environmental conditions, our model ensures consistent and Then, the prediction with the highest probability is considered
accurate posture estimations. to be the class of that person’s action.
FIGURE 7. Proposed architecture of action detection model. Landmark values are the input features of the
action detection network.
method accomplishes a more comprehensive understanding feature layer in SPPNet is first convolved three times, and per-
of the activity being performed and improve the accuracy of form maximum pooling operation using different sized max
the analysis. pooling kernels. The pooled outputs are first concatenated,
Now, the input image frame is divided into S × S grids in then three times convolved, which enhances the network
order to detect the object. If the object’s center falls within receptive field. Following the operations of backbone and
a grid cell, it is detected using that grid cell to forecast a SPPNet, PANet convolves the feature layers and up-samples
bounding box: them, doubling the height and width of the original feature
layers.
CS bg = Pg,b ∗ IoUpred
truth
(13)
The feature layer obtained after convolution and
where CS bg is confidence score of the bth bounding box in up-sampling is concatenated with the feature layer obtained
the gth grid. Pg,b represents class probability value of the from CSPDarkNet53 to achieve feature fusion and finally
bth bounding box in the gth grid. IoUpred truth denotes the down-sampling. Then, it is compressed in height and width,
intersection over union (IoU) between the ground truth and and stacked with previous feature layers for even more feature
predicted bounding box of the objects. fusion. In contrast to three detection heads in YOLOv4,
The detection model structure consists of four main parts: the proposed model includes an additional prediction head
input terminal, backbone, neck, and head, which help to that enhances the ability to detect extremely small objects,
clearly describe each action flow of the suggested method. improves the stability of the detection, and mitigates the
To ensure the detection of moving and stationary objects, the negative effects of object size variance. The introduced extra
input image is processed at a resolution of 416 × 416 pix- head enhances the object detection algorithm by effectively
els. Darknet53 was created as a result of YOLOv3 [34] handling scale variations, improving localization accuracy,
incorporating the residual module and the ResNet structure’s providing contextual understanding, and enabling accurate
properties. Based on this, YOLOv4 created the CSPDark- classification of objects. These benefits collectively con-
Net53, which consists of 5 cross-stage partial (CSP) modules tribute to the algorithm’s enhanced performance and accuracy
and 72 convolutional layers, considering the superior learning in detecting objects of different shapes and sizes used during
capabilities of CSP network (CSPNet) [35]. By incorporating activities. Although this additional head incurs higher com-
gradient changes into feature maps, it minimizes computa- putational and memory costs, it results in better detection
tional bottlenecks and enables the CNN network to achieve performance due to the utilization of low-level yet high-
greater accuracy. Additionally, the initial CSP stages are resolution feature maps. The model structure is shown in
transformed into the residual layer of the original DarkNet Fig. 8. Finally, to improve mAP and object detection, the
in order to increase accuracy as well as the speed. Two head-anchor-based detection network model is used. The
convolutional layers and one skip connection are included in loss function used in the training phase for utilizing object
each residual module. A batch normalization layer and a Mish detection model mainly included bounding box location loss
activation function are included in each convolutional layer. (LBIoU ), confidence loss (Lconf ) and classification loss (Lcl )
Five CSP modules are present in the residual layers of each as defined below.
step of the CSPDarknet53 backbone (1-2-8-8-4). SPPNet and
PANet are the components of the neck portion. The input L = LBIoU + Lconf + Lcl (14)
FIGURE 8. Enhanced object detection model for identifying objects used by a person during an action.
Algorithm 1 Person’s Activity Recognition TABLE 1. Activities and corresponding chronological order of actions and
Define the expected action sequence of each activity as a list of strings. objects used.
Define action and object combination condition.
Input: Read a video V m (Fn , Fi ), where Fn represents sequence of frames,
Fi represents the 132 landmark values of ith frame, n ≥ 60
Output: A person’s activity with action sequence and object being utilized.
1. Initialization: Action detection
2. Loop over the expected actions in the sequence.
3. For each expected action, read the first 60 frames, Fn = 60
4. Check if the previous 10 frames are same, Fn [−10 :] is same, then
5. If res > T , where res is normalized output vector with probabilities of
each possible outcome, threshold T=0.6, then
6. Check condition: Action sequence (Table 1)
7. If sequence of the action is true
8. Initialization: Object detection
9. if the previous 10 frames detect same object, Fn [−10 :] is same, then
10. Check condition: Action-object combination (Table 1)
11. If combination condition is true
12. Output: action, then action ++
13. Output: activity
14. Else, output an error message ‘‘wrong object detected’’.
15. Else, output an appropriate error message ‘‘wrong action sequence:
Expected, action sequence [i]’’
16. Close video
task involves detecting the object used during the actions. The
final task is to recognize the activity based on the sequence
of actions. Despite the abundance of available online datasets
for data acquisition, most of them focus solely on action
detection and disregard the objects utilized during the actions are calculated using four parameters such as true positives
and the sequence of the actions in the activity. Therefore, (TP), true negatives (TN), false positives (FP), and false
it becomes challenging to acquire a dataset for this kind of negatives (FN). The aforementioned performance metrics are
task. In this context, this research employs the approach of defined as follows.
using our own video and image dataset. We have gathered
an extensive collection of 243 videos depicting 27 distinct 1) ACCURACY
actions, where each action entails the use of an object. These it defines the ratio of correctly detected activities throughout
actions are performed in a sequence with varying objects, the total data:
forming distinct activities. As elaborated in Table 1, five TP + TN
activities were utilized, each with a distinct chronological Accuracy = (22)
TP + TN + FP + FN
order of actions, and the corresponding objects used during
these activities. The term ‘action’ here refers to the movement 2) PRECISION
of the body while using an object, whereas ‘activity’ refers to it defines the ratio of person’s activities correctly detected
the complete work being carried out. Given that each action throughout the total videos:
is composed of a sequence of frames, we have meticulously TP
compiled Fn = 60 frames for each action while developing Precision = (23)
TP + FP
the proposed action detection model.
To develop our model, we utilize an approach that involves 3) RECALL
focusing solely on objects being utilized by individuals dur- it defines the ratio of videos correctly detected as an activity
ing actions. We treat the person’s hand and the object as one to the total videos of that activity:
entity, disregarding any similar objects in the same class in the TP
same frame that are not being used during the activity. The Recall = (24)
TP + FN
dataset particulars for the object detection model are given
in Table 2. 4) F1 SCORE
the harmonic mean of precision and recall. The model per-
B. EVALUATION METRICS formance is summarized by this metric effectively and is
The performance of the proposed models was validated using calculated as follows:
a number of performance indicators, such as accuracy, preci- precision × recall
F1score = 2 × (25)
sion, recall, and F1 score. These performance measurements precision + recall
VOLUME 11, 2023 67393
Y.-P. Huang et al.: Object-Based Hybrid Deep Learning Technique for Recognition of Sequential Actions
5) AP
the area under the precision and recall curves, denoted by
Average Precision, is defined as follows:
Z1
AP = P(r)dr (26)
0
where P and r are the precision and recall, respectively.
Precision and recall have values between 0 and 1. Finally,
after calculating the AP values of activities, the mean average
precision (mAP) is calculated as follows:
AP1 + AP2 + . . . + APn
mAP = (27)
n
C. ACTION DETECTION RESULT
A collection of Fn = 60 frames, each of which contains
Fi = 132 landmark values is obtained from each action video
using our pose estimation and landmark extraction approach.
Before feeding these values to the LSTM network for action
detection, the entire video dataset was split into training and
test datasets in an 8:2 ratio. We used the Adam optimizer [36]
to train our network for 150 epochs in an effort to reduce
the loss. Categorical cross-entropy loss function is used since FIGURE 9. Normalized confusion matrix created using the predictions of
the proposed action detection model on the test dataset.
the action detection model has twenty-seven classes. The
action detection model achieved a test accuracy of 95.91%
TABLE 2. Description of the object detection dataset.
after training. Fig. 9 shows the normalized confusion matrix
generated from the predictions made by the proposed action
detection model on the test dataset. The results indicate that
the model achieved high accuracy in recognizing the majority
of the actions. However, it appears that some similar actions,
such as opening or closing bottle, wearing socks or shoes,
were sometimes misclassified as false positives. This is likely
due to almost the identical nature of their actions.
To evaluate the quality of our model, we used OpenPose
and DeepPose as the standard reference and trained two
models, one with and the other without the proposed interpo-
lation technique, using different recurrent neural networks,
i.e., GRU (gated recurring units) and LSTM, as shown in
Table 3. Although the OpenPose model shows slightly bet-
ter performance than other estimation models, our approach
with both networks performs much faster than the rest. This
is due to the fact that the proposed model only employs Consequently, GRU consumes less memory, executes faster
two steps, detector and tracker inference pipeline, where the and trains faster than LSTM’s whereas LSTM achieves bet-
detector only runs on the first frame or until a person’s face ter accuracy on datasets with longer sequences. The output
is detected, and then the tracker is used to track the person in results of the proposed action detection model are shown
consecutive frames. To forecast heatmaps for all landmarks, in Fig. 10.
we additionally employ a compact encoder and decoder net-
work design, followed by another encoder that regresses D. OBJECT DETECTION RESULT
directly to landmark coordinates, allowing the model to Using the dataset listed in Table 2, the performance of object
become lighter and run faster in real-time inference. Also, detection model for the suggested person activity recognition
the model trained with the interpolation technique performs system was evaluated. Before feeding the datasets into our
better, as it used well-interpolated landmarks for undetected object detection model, we randomly divided the data into
and wrong-detected landmark frames. Furthermore, LSTM 80% for training and split the remaining data into 10% for
with different pose estimation algorithms perform better validation and 10% for test. The shape of input images is
because GRU has simpler structure. It has only two gates also resized to 416 × 416 before being passed into training.
(reset and update gates) and utilize fewer training parameters. After training for 500 epochs with the Adam Optimizer to
FIGURE 10. Results of action detection using LSTM network and interpolated body landmarks obtained from pose estimation network.
TABLE 3. Performance comparison of various action detection models. the model does not detect the left sock. This is because we
consider the person’s hand and the object being used as a
single entity. Similarly, when the person is loading clothes
into the washing machine, the model does not detect other
objects such as the washing machine lid or buttons, as they
are not relevant to the action.
Performance comparison of the different object detec-
tion models is shown in Table 4. It is clear that when
IoU = 0.5, Faster R-CNN has a higher mAP but with the
lowest FPS than others. It signifies that the common fea-
tures of two-stage detection algorithm have higher detection
reduce the overall loss, and with the initial learning rate accuracy but lower real-time problems. Meanwhile, FPS and
value of 0.0001, the proposed object detection model achieves mAP of our model are reasonably high when compared to
an overall mAP of 97.68% for detecting the objects being other algorithms. Although our model is a little slower than
used while performing the actions. Fig. 11 shows detection the original YOLOv4 due to the extra computational load
of object being utilized by the person using the enhanced from the additional head, it delivers superior object detec-
YOLOv4 disregarding any similar objects in the same class tion performance for every frame in the video. This is due
in the same frame that are not being used during the activity. to the advantage of having an extra head that allows the
For example, when the person is putting on the right sock, model to detect objects of varying sizes with better accuracy.
FIGURE 11. Object detection results using enhanced YOLOv4 algorithm to identify objects used during actions.
TABLE 4. Performance comparison with other object detection models. TABLE 5. Comparisons on various activity recognition models.
FIGURE 12. The output of the proposed activity recognition model. The model identifies different actions that are performed in a chronological order
and the objects utilized during each action.
of this research is to recognize person’s activities by detecting the enhanced YOLOv4 model combined with the proposed
action sequences and interactive objects in real-time. Thus, action detection model achieves a higher FPS and a reason-
we require a model that can quickly identify person’s actions ably high mAP, suggesting that this model is more suitable
and detect objects. According to the experimental findings, for recognition problems.
Furthermore, it is worth noting that running the action [3] C. Xu, D. Chai, J. He, X. Zhang, and S. Duan, ‘‘InnoHAR: A deep neural
detection and object detection models independently allows network for complex human activity recognition,’’ IEEE Access, vol. 7,
pp. 9893–9902, 2019.
them to maximize their processing capabilities. Con- [4] T. Zebin, P. J. Scully, N. Peek, A. J. Casson, and K. B. Ozanyan, ‘‘Design
versely, when these two models are integrated, there is and implementation of a convolutional neural network on an edge com-
an additional coordination overhead, resulting in a slight puting smartphone for human activity recognition,’’ IEEE Access, vol. 7,
pp. 133509–133520, 2019.
decrease in frames per second (fps) compared to individual [5] Y. Li, C. Wang, Y. Cao, B. Liu, J. Tan, and Y. Luo, ‘‘Human pose estimation
execution. Nonetheless, the integration offers the advantage based in-home lower body rehabilitation system,’’ in Proc. Int. Joint Conf.
of precise activity recognition by incorporating both actions Neural Netw. (IJCNN), Glasgow, U.K., Jul. 2020, pp. 1–8.
[6] W. Liu, X. Liu, Y. Hu, J. Shi, X. Chen, J. Zhao, S. Wang, and Q. Hu,
and objects, thereby enabling a more profound comprehen- ‘‘Fall detection for shipboard seafarers based on optimized BlazePose and
sion of the activity at hand. LSTM,’’ Sensors, vol. 22, no. 14, pp. 5449–5466, Jul. 2022.
[7] M. Abbas and R. L. B. Jeannès, ‘‘Exploiting local temporal characteristics
V. CONCLUSION via multinomial decomposition algorithm for real-time activity recogni-
The proposed model incorporated a lightweight CNN opti- tion,’’ IEEE Trans. Instrum. Meas., vol. 70, pp. 1–11, 2021.
mized top-down human pose estimation architecture to find [8] W. Huang, L. Zhang, W. Gao, F. Min, and J. He, ‘‘Shallow convolutional
neural networks for human activity recognition using wearable sensors,’’
the body landmarks from a sequence of frames, followed by IEEE Trans. Instrum. Meas., vol. 70, pp. 1–11, 2021.
interpolation to enhance the accuracy of pose estimation for [9] Y. Zhang, G. Tian, S. Zhang, and C. Li, ‘‘A knowledge-based approach for
undetected or wrong-detected landmarks. The transformed multiagent collaboration in smart home: From activity recognition to guid-
ance service,’’ IEEE Trans. Instrum. Meas., vol. 69, no. 2, pp. 317–329,
landmark values were then fed to multiple layers of LSTM Feb. 2020.
network, culminating in the SoftMax layer to predict the [10] N. A. Capela, E. D. Lemaire, and N. Baddour, ‘‘Feature selection for
person’s actions. Additionally, an object detection model was wearable smartphone-based human activity recognition with able bod-
ied, elderly, and stroke patients,’’ PLoS ONE, vol. 10, no. 4, pp. 1–18,
developed by enhancing YOLOv4 to detect the object used Apr. 2015.
during the actions. Finally, the proposed activity recognition [11] A. Prati, C. Shan, and K. I.-K. Wang, ‘‘Sensors, vision and networks:
algorithm integrated these two models to create a real- From video surveillance to activity recognition and health monitoring,’’
J. Ambient Intell. Smart Environ., vol. 11, no. 1, pp. 5–22, Jan. 2019.
time, lightweight, and robust activity recognition model. Our [12] S. Sankar, P. Srinivasan, and R. Saravanakumar, ‘‘Internet of Things based
model achieved 95.91% accuracy in recognizing actions and ambient assisted living for elderly people health monitoring,’’ Res. J.
97.68% mAP for detecting the object used during the actions, Pharmacy Technol., vol. 11, no. 9, pp. 3900–3904, Dec. 2018.
[13] E. Zdravevski, P. Lameski, V. Trajkovik, A. Kulakov, I. Chorbev,
with an overall FPS of 10.47. This model can help monitor
R. Goleva, N. Pombo, and N. Garcia, ‘‘Improving activity recognition
and inspect human activities that followed a chronological accuracy in ambient-assisted living systems by automated feature engineer-
order of actions when interacting with different objects within ing,’’ IEEE Access, vol. 5, pp. 5262–5280, 2017.
the activity. In manufacturing and assembly, our activity [14] X. Ji, J. Cheng, W. Feng, and D. Tao, ‘‘Skeleton embedded motion body
partition for human action recognition using depth sequences,’’ Signal
recognition model can be utilized to ensure workers follow- Process., vol. 143, pp. 56–68, Feb. 2018.
ing predefined sequences when using tools and components, [15] A. Jalal, Y.-H. Kim, Y.-J. Kim, S. Kamal, and D. Kim, ‘‘Robust human
boosting efficiency and quality control. In sports analysis, activity recognition from depth video using spatiotemporal multi-fused
features,’’ Pattern Recognit., vol. 61, pp. 295–308, Jan. 2017.
it can accurately track players’ movements, recognize tech- [16] C. Xu, L. N. Govindarajan, and L. Cheng, ‘‘Hand action detection from
niques and equipment used, and provide valuable insights for ego-centric depth sequences with error-correcting Hough transform,’’ Pat-
coaching and strategic analysis. In healthcare and rehabili- tern Recognit., vol. 72, pp. 494–503, Dec. 2017.
[17] O. K. Oyedotun and A. Khashman, ‘‘Deep learning in vision-based
tation, it can assist in monitoring patients’ activities during static hand gesture recognition,’’ Neural Comput. Appl., vol. 28, no. 12,
therapy and offer real-time feedback to improve outcomes. pp. 3941–3951, Apr. 2016.
In industrial environments, it can analyze workers’ actions [18] L. Pigou, A. van den Oord, S. Dieleman, M. Van Herreweghe, and
J. Dambre, ‘‘Beyond temporal pooling: Recurrence and temporal convo-
and equipment interactions to ensure safety compliance.
lutions for gesture recognition in video,’’ Int. J. Comput. Vis., vol. 126,
In the future, we plan to enhance the proposed method nos. 2–4, pp. 430–439, Oct. 2016.
to recognize activity in industrial working environments and [19] J. Qi, P. Yang, M. Hanneghan, S. Tang, and B. Zhou, ‘‘A hybrid hierarchical
detect additional objects such as helmets, gloves, masks, framework for gym physical activity recognition and measurement using
wearable sensors,’’ IEEE Internet Things J., vol. 6, no. 2, pp. 1384–1393,
and shoes to ensure individual safety and prevent industrial Apr. 2019.
accidents. Additionally, we aim to enhance the fps of our [20] C. Aviles-Cruz, E. Rodriguez-Martinez, J. Villegas-Cortez, and
model without compromising accuracy by exploring model A. Ferreyra-Ramirez, ‘‘Granger-causality: An efficient single user
movement recognition using a smartphone accelerometer sensor,’’ Pattern
optimization techniques, leveraging hardware acceleration, Recognit. Lett., vol. 125, pp. 576–583, Jul. 2019.
considering algorithmic improvements, and upgrading hard- [21] I. Jegham, A. B. Khalifa, I. Alouani, and M. A. Mahjoub, ‘‘Vision-based
ware infrastructure. human action recognition: An overview and real world challenges,’’ Foren-
sic Sci. Int., Digit. Invest., vol. 32, Mar. 2020, Art. no. 200901.
[22] V. Bazarevsky, I. Grishchenko, K. Raveendran, T. Zhu, F. Zhang, and
REFERENCES M. Grundmann, ‘‘BlazePose: On-device real-time body pose tracking,’’
[1] Q. Wu, Y. Wu, Y. Zhang, and L. Zhang, ‘‘A local–global estimator based on 2020, arXiv:2006.10204.
large kernel CNN and transformer for human pose estimation and running [23] A. Toshev and C. Szegedy, ‘‘DeepPose: Human pose estimation via deep
pose measurement,’’ IEEE Trans. Instrum. Meas., vol. 71, pp. 1–12, 2022. neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[2] F. Rustam, A. A. Reshi, I. Ashraf, A. Mehmood, S. Ullah, D. M. Khan, and Columbus, OH, USA, Jun. 2014, pp. 1653–1660.
G. S. Choi, ‘‘Sensor-based human activity recognition using deep stacked [24] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, ‘‘OpenPose: Realtime
multilayered perceptron model,’’ IEEE Access, vol. 8, pp. 218898–218910, multi-person 2D pose estimation using part affinity fields,’’ IEEE Trans.
2020. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 172–186, Jan. 2021.
[25] W. Li, L. Wen, M. Chang, S. N. Lim, and S. Lyu, ‘‘Adaptive RNN tree for Dr. Huang is a fellow of IET, CACS, TFSA, and the International
large-scale human action recognition,’’ in Proc. IEEE Int. Conf. Comput. Association of Grey System and Uncertain Analysis. He was a recipient
Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 1453–1461. of the 2021 Outstanding Research Award from the Ministry of Science and
[26] Y. Du, W. Wang, and L. Wang, ‘‘Hierarchical recurrent neural network Technology, Taiwan. He serves as the IEEE SMCS VP for Conferences and
for skeleton based action recognition,’’ in Proc. IEEE Conf. Comput. Vis. Meetings and the Chair of the IEEE SMCS Technical Committee on Intel-
Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 1110–1118. ligent Transportation Systems. He was the IEEE SMCS BoG, the President
[27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, ‘‘Co- of the Taiwan Association of Systems Science and Engineering, the Chair
occurrence feature learning for skeleton based action recognition using of the IEEE SMCS Taipei Chapter and the IEEE CIS Taipei Chapter, and
regularized deep LSTM networks,’’ in Proc. 30th AAAI Conf. Artif. Intell.,
the CEO of the Joint Commission of Technological and Vocational College
Phoenix, AZ, USA, Feb. 2016, pp. 12–17.
Admission Committee, Taiwan.
[28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 779–788.
[29] W. Liu, Z. Liu, Y. Li, H. Wang, C. Yang, D. Wang, and D. Zhai, ‘‘An auto-
matic loose defect detection method for catenary bracing wire components
using deep convolutional neural networks and image processing,’’ IEEE
Trans. Instrum. Meas., vol. 70, pp. 1–14, 2021.
[30] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
P. Gehler, and B. Schiele, ‘‘DeepCut: Joint subset partition and labeling for
multi person pose estimation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 4929–4937.
[31] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon, ‘‘The vitruvian manifold:
SATCHIDANAND KSHETRIMAYUM received
Inferring dense correspondences for one-shot human pose estimation,’’ in
the B.Tech. degree in computer science and engi-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA,
Jun. 2012, pp. 103–110.
neering from the National Institute of Technology
[32] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural Manipur, India, and the M.Tech. degree in oper-
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. ations research from the National Institute of
[33] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Optimal Technology Durgapur, India. He is currently pur-
speed and accuracy of object detection,’’ 2020, arXiv:2004.10934. suing the Ph.D. degree with the Department of
[34] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ Electrical Engineering, National Taipei Univer-
2018, arXiv:1804.02767. sity of Technology, Taipei, Taiwan. His current
[35] C. Wang, H. Mark Liao, Y. Wu, P. Chen, J. Hsieh, and I. Yeh, ‘‘CSPNet: research interests include human activity recogni-
A new backbone that can enhance learning capability of CNN,’’ in Proc. tion (HAR), computer vision, deep learning, and image processing.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
Seattle, WA, USA, Jun. 2020, pp. 1571–1580.
[36] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
2014, arXiv:1412.6980.