sensors-22-00174
sensors-22-00174
Article
Robust Human Activity Recognition by Integrating Image and
Accelerometer Sensor Data Using Deep Fusion Network
Junhyuk Kang 1 , Jieun Shin 1 , Jaewon Shin 1 , Daeho Lee 2 and Ahyoung Choi 1, *
standing, sitting, lying, and stair climbing were measured at 10 Hz at 10 min intervals. The
measured data were subjected to a preprocessing process suitable for each feature type,
and features such as the mean, median, FFT-entropy, and signal vector magnitude were
extracted and used as inputs for the deep learning model. Six activities were classified
using two deep learning techniques: DNN and RNN. The RNN model showed an average
F1 score of 80%.
Ahmadi et al. used machine learning techniques to recognize the behavioral patterns
of adolescents and children with cerebral palsy into four classes: sedentary (SED), standing
utilitarian movements (SUM), comfortable walking (CW), and brisk walking (BW) [6]. They
used the signal vector magnitude feature extracted from 10 s non-overlapping segmented
signals, which were measured from 22 children and adolescents. Using the extracted
features, four classes were classified using three machine learning techniques: random
forest (RF), support vector machine (SVM), and binary decision tree (BDT). The results
showed that the SVM and RF performed better than the BDT, with an average of 82.0% to
89% for the SVM, 82.6% to 88.8% for the RF, and 76.1% to 86.2% for the BDT. By class, SED
was 94.1% to 97.9%, SUM was 74.0% to 96.6%, CW was 47.6% to 70.4%, and BW was 71.5%
to 86.0%, which showed a good performance in the SED and SUM classifications, but a
poor performance in the CW and BW classifications. Ignatov recognized the acceleration
signal based on a user-independent CNN model and analyzed it using the UCI database,
in which six types of behavioral data, including jogging, walking, and climbing stairs, were
collected [7]. Wang et al. recognized 19 behaviors using an auto-encoder composed of a
deep belief network in unsupervised learning based on signals collected through wearable
devices and showed a performance of 99.3% [8].
Unlike in previous studies, there is a trend of applying deep learning models based
on raw data without preprocessing [9,10]. It has been reported that a deep neural network
model performs better by finding more information than a shallow neural model [11].
Existing studies have used only accelerometer data or extracting features from the sensor
data; however, the model can find more usable features than humans are unable to find.
However, if a preprocessing is not applied, the model can be trained with better features
because it learns by extracting features directly from the raw data [12]. River et al. proposed
an RNN-based human activity recognition model to classify six hand activities [13]. They
used inertial sensor data directly in the proposed model without preprocessing. Zhao et al.
proposed a customized long short-term memory (LSTM) model by varying the window
size in the data segmentation step [14]. They used an accelerometer, a gyroscope, and
magnetometer sensors as inputs without preprocessing. They found the optimized window
size to obtain improved results; their final recognition accuracy was 93.6% using the UCI
HAR database. Hassan et al. presented an end-to-end deep neural network (DNN) model
for recognizing human actions from temporally sparse data signals generated by passive
wearable sensors [15]. Wan et al. proposed an HAR architecture based on a smartphone
inertial accelerometer [16]. The smartphone gathered the sensory data sequence while
the participants went about their everyday activities and extracted the high-efficiency
features from the original data through numerous three-axis accelerometers. To extract
the relevant feature vectors, the data were preprocessed through denoising, normalization,
and segmentation. They applied CNN, LSTM, BLSTM, MLP, and SVM models using the
UCI and Pamap2 datasets and observed 93.21% accuracy with the CNN model using the
UCI dataset.
In addition to the accelerometer signal, research on recognizing user behavior based
on image signals is being actively conducted. In the case of images, studies are actively
being applied to extract image features using a scale-invariant feature transform (SIFT)
or speeded-up robust features, and to predict poses by recognizing silhouettes, depth
information, and skeletons. Kale et al. proposed video-based human activity recognition
for a smart surveillance system [17]. The system extracted the features based on SIFT
and applied a K-nearest neighbor (KNN) and an SVM to recognize four to nine activities
including falling, fighting, walking, running, and sitting, among other general actions. The
Sensors 2022, 22, 174 3 of 19
results show that an SVM achieves a 92.91% accuracy rate, whereas a KNN has an accuracy
rate of 90.83%. Kim et al. proposed the activity recognition of elderly people using skeleton
joint features from a depth video [18]. They applied a hidden Markov model to distinguish
between diverse human behaviors. The results of the experiments demonstrate that the
elderly achieve a higher recognition rate, with a mean recognition rate of 84.33% for nine
daily regular activities.
In recent years, the accuracy of behavior recognition has been significantly improved
by applying deep learning and machine learning technologies. Khaire et al. applied a
5-CNN model using various vision cues, such as RGB images, depth images, and skeletal
data as inputs. The performance of the 5-CNN model was 95% to 96% for classifying
27 activities, including bowling, boxing, tennis, and swinging [19]. In addition, in self-
supervised learning, a small number of data is augmented, and a study based on rotation
data is being conducted. However, there is a problem in that it is difficult to recognize
whether the image has been rotated, and thus whether lying or standing can be recognized
as different poses. Amir et al. classified 60 classes using NTU RGB-D data as the input of
the proposed 2-layer Part-Aware LSTM model [20]. The class contains 40 daily actions (e.g.,
drinking, eating, and reading), nine health-related actions (e.g., sneezing, staggering, and
falling), and 11 mutual actions (e.g., punching, kicking, and hugging). It was confirmed
that the proposed model showed a cross-subject accuracy of 62.93% and a cross-view
accuracy of 70.27%. Because of the complexity of human activity sequences, Shahroudy et al.
suggested a multimodal multipart learning method that supports the sparse combination
of multimodal part-based characteristics using depth and skeleton data [21].
Research on integrating multiple heterogeneous sensory information is being con-
ducted. Some existing studies were conducted to recognize behavior by extracting and
integrating various feature values, such as silhouette and depth information, from the
video signal. Khaire et al. proposed a method integrating vision data such as RGB, silhou-
ettes, and skeletons [19]. Amir et al. proposed RGB and depth data to recognize human
activities [20]. In addition, research integrating various sensor data such as accelerometers,
gyroscopes, and magnetic field signals to recognize behavior have been conducted [22].
Wei et al. proposed a CNN-based deep learning model to integrate the video and inertial-
sensing signal in order to detect human activities [23]. In this research, continuous motion
was expressed using a three-dimensional video volume and an input translated from a
one-dimensional acceleration signal into a two-dimensional image form using a spectro-
gram. CNN was employed in the behavior recognition model, as well as two types of
fusion models. Fusion was performed at the decision level in the first model following
classification for each input, while fusion was performed at the feature extraction level in
the second model. The fusion at the feature level was 94.1% accurate, and the fusion at the
decision level was 95.6% accurate.
Recently, much research has been undertaken to examine performance according to
the enhanced method of distinguishing the backdrop from the person in the image, data
segmentation, feature extraction, and feature selection, in order to increase the accuracy of
behavior identification. Kiran et al. proposed a deep learning model with five multi-layers
based on CNN that optimizes computation time [24]. Each deep learning model was used
for database normalization, transfer-learning-based optimal feature extraction, fusion and
classification, and the Shannon entropy theory, a statistical feature, was applied to feature
selection. In this study, by applying various databases such as UCF sport, KTH, and UT
interaction, it was verified through experiments whether there was an improvement in
processing speed while maintaining accuracy. Khan et al. proposed a cascaded framework
for action recognition from video sequences [25]. They used frame enhancement by contrast
stretching, luminance channel selection, and so on, to clearly distinguish between the
background and the person. After classifying it using the removed background and
saliency map, a morphological operation to classify the human form was derived. Then,
various types of features such as HOG and SFTA were extracted from the image, fused,
and then classified by applying a neural network. The proposed method showed 97.2% to
Sensors 2022, 22, 174 4 of 19
99.9% performance in various open databases, such as KTH, UIUC, Muhavi, and WVU.
Helmi et al. suggested a light-weight feature selection method, called GBOBWO algorithm,
for human activity categorization based on a gradient-based optimizer algorithm and a
support vector machine-based classifier. They used accelerometer signals as inputs and
extract features in a general way [26]. They selected appropriate features based on the
GBOBWO method. They achieved 98% accuracy with the UCI-HAR and WISDM database.
In summary, the accuracy of video and sensor-based behavior identification varies
greatly depending on how the properties of the constantly changing input signal are ex-
tracted, how segmentation is performed, how features are selected, and how the recognition
model is used. As a result, there have been studies conducted to normalize data and extract
step-by-step features through multiple CNN layers [24], a study on a method of deriving
features that can be distinguished from the background through preprocessing [25], and a
study on fusion of various feature values in the signal extraction and selection stage and ex-
tracting features of a light input signal that can be operated in a wearable environment [26].
Another study on a method of deriving features that can be distinguished from the A study
was also recently undertaken to integrate the picture signal and the acceleration signal and
analyze it with a CNN in order to complement the constraints of the input data and thereby
increase the accuracy of behavior recognition [23].
However, among existing studies, no study on generalized behavior recognition
models that support various types of input signals or minimize the preprocessing of input
signals, while also being robust to any noise that may occur in daily life, has been conducted.
In addition, in previous studies, deep learning models other than CNN were not applied
to fusion. Therefore, in this study, we propose a generalized deep learning model that
is robust to noise not dependent on input signals by extracting features through a deep
learning model for each heterogeneous input signal that can maintain performance while
minimizing preprocessing of, and while integrating, the input signal. In this work, we
propose a hybrid deep learning network that can recognize user behavior patterns using
heterogeneous signals of image signals and accelerometer sensor signals. We also propose
a fusion network that uses two sensors to maximize the identification rate for unfamiliar
actions with a single signal and the development of a noise-resistant generalized recognition
model. With the proposed model, an image signal and an accelerometer signal are inputted
at the same time (for approximately 1 s) in the form of a time-series signal. A ResNet feature
is produced for an image signal, and a CNN and CBAM model is used to create a feature
for an accelerometer signal. The two signals are then concatenated to identify the activity.
The technical contribution of this paper is as follows. First, we propose a generalized
deep learning model that guarantees the recognition rate even when noise occurs in the
acceleration signal or when distortion such as left and right reversal of the image occurs.
The suggested model extracts features using a deep learning model that is appropriate for
the type of input signal. ResNet is used to extract the features of the image signal, and
CNN is used to extract the features of the multi-channel acceleration signal. LSTM and
CBMA, which add attention to weights, are used to reflect temporal characteristics. Second,
we use heterogeneous input data such as image and sensor data simultaneously to classify
human activity, which can maintain performance while minimizing preprocessing of the
input signal. Third, with respect to computational time, we offer an optimal input and a
model for behavior recognition that takes into account the computational resources and
processing time required based on the kind and size of input data. For activity recognition,
normalized skeleton data can be enough to classify the actions while maintaining accuracy.
The majority of present research has focused on increasing accuracy, although concerns
such as training time and processing speed are crucial for practical usage in everyday life.
The remainder of this chapter describes the materials, proposed methods, and experiment
results in detail.
Sensors 2022, 22, x FOR PEER REVIEW 5 of 19
although concerns such as training time and processing speed are crucial for practical us-
Sensors 2022, 22, 174 age in everyday life. The remainder of this chapter describes the materials, proposed
5 of 19
methods, and experiment results in detail.
Figure 1. Model
Figure 1. Model architecture.
architecture.
2.1. Preprocessing
2.1. Preprocessing
Each input dataset took each step during the preparation stage. The image signal
Each input
processing was as dataset took
follows. Ineach step during
this study, the preparation
we classified stage.
11 activities, Theasimage
such jumping signal
in
processing was as follows. In this study, we classified 11 activities,
place and clapping hands. We only used 9 activities among the 11 activities because A09 such as jumping in
place and
(sitting down)clapping hands. We only
is a combination of A10 used 9 activities
(sitting) and A11 among the 11
(standing activities
up), and it wasbecause A09
assumed
(sitting down) is a combination of A10 (sitting) and A11 (standing up),
that sufficient information could be transmitted without discriminating between the two and it was assumed
that sufficient
actions in terms information could be transmitted without discriminating between the two
of type recognition.
actions in terms of type recognition.
To input consecutively stored image frames for activities, we removed the background
from To the input
imageconsecutively stored image
based on the background frames for
subtraction activities,
algorithm. Wewe removed
then extracted the25back-
joint
ground
data fromfrom the image
a skeleton based
detection on theOpenPose
through background API subtraction
[27]. Becausealgorithm.
they have aWe thenthat
feature ex-
tracted
can 25 joint
identify data from
activities solelya skeleton
using jointdetection through
information, OpenPose
joint data were API [27]. Because
extracted they
to enhance
havespeed
the a feature
whilethat can identify
maintaining theactivities solelyaccuracy
identification using jointandinformation,
classification joint datauser’s
of the were
extracted to enhance the speed while maintaining the identification
activity. The JSON file format of the person in the frame was used to generate the joint accuracy and classifi-
cationwhich
data, of thewasuser’s activity.
obtained The the
using JSON fileyformat
x and of the of
coordinates person in the frame
the keypoints was used
through a JSON to
generate
file parser. the joint data, which was obtained using the x and y coordinates of the keypoints
throughThe aacquired
JSON filejointparser.
data were normalized through the normalization process after
The acquired
recognizing joint data
the skeleton were normalized
to consider only thethrough the normalization
parts necessary for activityprocess after rec-
recognition. In
ognizing
the the skeleton
normalization to the
step, consider only acquired
joint data the parts were
necessary fortoactivity
scaled recognition.
a size of 100 × 100.InThe the
normalization process was applied to minimize the differences in individual physical
conditions, such as tallness and shortness or bulkiness and smallness, and to improve the
learning speed by lightening the data. To reflect the change over time in the input, the data
frame was bundled for approximately 1 s to become a single input.
normalization step, the joint data acquired were scaled to a size of 100 × 100. The normal-
ization process was applied to minimize the differences in individual physical conditions,
Sensors 2022, 22, 174 such as tallness and shortness or bulkiness and smallness, and to improve the learning 6 of 19
speed by lightening the data. To reflect the change over time in the input, the data frame
was bundled for approximately 1 s to become a single input.
InIn this
this study,
study, 2020 single-skeleton
single-skeleton images
images were
were grouped
grouped and
and framed.The
framed. Thereason
reasonforfor
creatinga aframe
creating frameset setby
bygrouping
grouping 20 20 each
each in the framing
framing process
processwas wasthat
thatthe
theratio
ratioofofac-
celeration data to image
acceleration image data
data was
was1.4:1.
1.4:1.InInorder
orderforforthethecollected
collecteddatadatafor
forapproximately
approximately
1 s1 of acceleration
s of acceleration data
datatoto
bebe
inputted
inputted asasinput,
input,and
and forforthe skeleton
the skeleton image
imagedata
datatotoalso
alsobebe
collected
collectedforfor
approximately
approximately 1 s1to use
s to thethe
use data asas
data input,
input, a set of of
a set 2020 groups
groupswaswascreated.
created.
The
Thedata
data format
format used
usedasas
the
theinput
inputtotothe final
the finalacquired
acquired deep
deeplearning
learning model
model is is
shown
shown
ininFigure
Figure2. 2.
Each
Eachdataset
datasetconsisted
consisted ofof
nine
nineactions,
actions,oneone channel,
channel, andandfive repetitions.
five repetitions. After
After
finding
findingthe
theskeleton
skeletonby byinputting
inputting sample
sample data with withaasize sizeofof640 × 480,
640× 480, thethe
sizesize
of theof input
the
input
imageimage
was was reduced,
reduced, and and the image
the image was was normalized
normalized to a of
to a size size100 100×to100
of×100 to refine
refine unnec-
unnecessary information.
essary information. Framing
Framing was was
then then applied
applied to express
to express the flow
the flow of time,
of time, and and an
an input
input form of 3978 × 20 × 100 × 100 was finally
form of 3978 × 20 × 100 × 100 was finally configured. configured.
Figure Skeleton
2. 2.
Figure image
Skeleton input
image data
input forfor
data deep learning.
deep learning.
InIn
addition,
addition,wewe
used
usedthe skeleton
the skeleton coordinate
coordinatevector
vector asas
another feature
another obtained
feature obtained from
from
the image, as shown in Figure 3. After detecting the skeleton, we obtained
the image, as shown in Figure 3. After detecting the skeleton, we obtained the joints the joints ofof
each frame and normalized them into 100 × 100-sized images to simplify the
each frame and normalized them into 100 × 100-sized images to simplify the input. Each input. Each
25-skeleton
25-skeletonjoint vector
joint was
vector wasthen
thenflattened to to
flattened input thethe
input deep learning
deep model.
learning model.
For accelerometer data, the three-axis accelerometer data were measured in 6 bodily
locations of 12 subjects: wrists, hips, and feet. After obtaining raw three-axis accelerometer
data, we applied bandpass filtering to de-noise the signal, and then segmented the filtered
signal into 1 s windows that matched the image frame data. As input, we simultaneously
employed continuously collected image data and accelerometer data from the same action.
Both signals used 1 s of data as input because the behavior did not consist of a single
moment but rather varied over time. We employ a 1 s time-series segmented image
frame and an accelerometer signal to classify the time-series activity recognition. For
segmentation, we use a 1 s non-overlapping window, as shown in Figure 4. All data
consists of 3-channel accelerometer data (x, y, z). As a result, using three axes measured at
six locations on the body as each channel, 18 channels were segmented into 28 lengths of
approximately 1 s each.
Sensors2022,
Sensors 2022,22,
22,174
x FOR PEER REVIEW 77 of 19
19
For accelerometer data, the three-axis accelerometer data were measured in 6 bodi
locations of 12 subjects: wrists, hips, and feet. After obtaining raw three-axis acceleromet
data, we applied bandpass filtering to de-noise the signal, and then segmented the filtere
signal into 1 s windows that matched the image frame data. As input, we simultaneous
employed continuously collected image data and accelerometer data from the same a
tion. Both signals used 1 s of data as input because the behavior did not consist of a sing
moment but rather varied over time. We employ a 1 s time-series segmented image fram
and an accelerometer signal to classify the time-series activity recognition. For segment
tion, we use a 1 s non-overlapping window, as shown in Figure 4. All data consists of
channel accelerometer data (x, y, z). As a result, using three axes measured at six location
on the body as each channel, 18 channels were segmented into 28 lengths of approximate
Figure3.3. Skeleton
Figure Skeleton coordinate
1 scoordinate
each. inputdata
input datafor
fordeep
deeplearning.
learning.
For accelerometer data, the three-axis accelerometer data were measured in 6 bodily
locations of 12 subjects: wrists, hips, and feet. After obtaining raw three-axis accelerometer
data, we applied bandpass filtering to de-noise the signal, and then segmented the filtered
signal into 1 s windows that matched the image frame data. As input, we simultaneously
employed continuously collected image data and accelerometer data from the same ac-
tion. Both signals used 1 s of data as input because the behavior did not consist of a single
moment but rather varied over time. We employ a 1 s time-series segmented image frame
and an accelerometer signal to classify the time-series activity recognition. For segmenta-
tion, we use a 1 s non-overlapping window, as shown in Figure 4. All data consists of 3-
channel accelerometer data (x, y, z). As a result, using three axes measured at six locations
on the body as each channel, 18 channels were segmented into 28 lengths of approximately
1 s each.
Figure 4.input
Figure 4. Accelerometer Accelerometer inputlearning.
data for deep data for deep learning.
2.2. Feature Extraction and Multimodal Data Fusion with Deep Learning
Features were extracted from each set of input data using the proposed deep learn-
ing model. The proposed model architecture is illustrated in Figure 5. We used three
different types of inputs in this investigation and built a deep learning model for each
input. ResNet10 was used to prevent vanishing gradient problems according to depth for
the input images generated by connecting joints, such as the ankles and knees, neck, and
shoulder [28]. ResNet10 was used instead of ResNet101 or ResNet151 because the learning
results of ResNet10 and ResNet101 did not differ significantly during the test.
Figure 5.
Figure Deep learning
5. Deep learning model
model of
of three
three input
input cases.
cases.
In the model using the vector of joint data as input, features were extracted based
In the model using the vector of joint data as input, features were extracted based on
on the two-layer bidirectional LSTM model. The reason for using the bidirectional LSTM
the two-layer bidirectional LSTM model. The reason for using the bidirectional LSTM was
was to learn the difference in the change of the joint vector with the flow of time in the
to learn the difference in the change of the joint vector with the flow of time in the front
front and the back in a balanced manner [29]. In the case of using only two layers, it was
and the back in a balanced manner [29]. In the case of using only two layers, it was con-
confirmed experimentally that an over-optimization occurred when a deeper layer was
firmed experimentally that an over-optimization occurred when a deeper layer was used,
used, confirming that two layers were optimized. The joint coordinates finally obtained
confirming that two layers were optimized. The joint coordinates finally obtained went
went through the process of flattening the dimension to be used as an input to the LSTM.
through the process of flattening the dimension to be used as an input to the LSTM. Be-
Because the coordinate data were extracted from the skeleton image, the number of data
cause the coordinate data were extracted from the skeleton image, the number of data
points is the same as the number of skeleton images. Therefore, similarly to the skeleton
points
image is the same
input, as the number
the coordinates of skeleton
obtained images. into
were grouped Therefore,
a singlesimilarly
input ofto20the skeleton
data points
image
for classification of a specific length, and not into single data points. Adam waspoints
input, the coordinates obtained were grouped into a single input of 20 data used
for classification of a specific length, and not into single data points.
as the overall optimizer, and the learning rate used to train the model was 0.001. There Adam was used as
the overall optimizer, and the learning rate used to train the model was
were 512 feature values from the image processing model. A total of 12,800 features of the 0.001. There were
512 feature
skeleton values from
coordinate the were
values imageprocessed
processing model.
using A total of 12,800 features of the skel-
Bi-LSTM.
eton coordinate values were processed using Bi-LSTM.
The preprocessed accelerometer data inputs were used to extract features through the
CNNThe andpreprocessed
convolutionalaccelerometer
block attentiondata inputs
(CBAM) were used
models. If the to
CNN extract
blockfeatures
consistedthrough
of two
the CNN and convolutional
1d convolutions blocknormalization
and two 1d batch attention (CBAM) parts,models. If the CNN block
batch normalization consisted
proceeded after
of two
each 1d convolutions
convolution and after
layer, and two 1d batch
batch normalization
normalization, parts, batch
attention normalization
was applied throughpro-the
ceeded after each convolution layer, and after batch normalization, attention
CBAM attention technique [30]. CBAM attention was largely divided into channel attention was applied
through
and spatialtheattention
CBAM attention
and proceededtechnique [30]. CBAM
sequentially. attention
Attention was largely
was processed for divided into
each channel
channel attention and spatial attention and proceeded sequentially.
through the channel attention part, and attention was processed by focusing on where Attention was pro-
cessed
features forwere
eachlocated
channel through
through the
the channel
spatial attention
attention part,
part. andwas
There attention
a totalwas processed
of 256 feature
by focusing
values of theon where features
accelerometer datawere located
obtained by through
processing thethe
spatial
CNNattention
and CBAM part.models.
There was
The
aCNN
totalmodel
of 256consists
feature of
values of the accelerometer
two convolutional layers, anddatathe
obtained by processing
first convolutional layerthe CNN
consists
and CBAM
of 128 filtersmodels. The CNN
with a filter size ofmodel consists
3. After of two convolutional
the convolution layer, batchlayers, and the first
normalization and
an ReLU function were applied, and the CBAM attention model was then applied. If the
second convolutional layer consisted of 256 filters with a filter size of 3, batch normalization
and the ReLU function were applied as in the first step, and the CBAM attention model was
then applied. CBAM stands for the convolutional block attention module as an attention
module to which the self-attention technique is applied. The CBAM is an improved module
for BAM attention. BAM uses the attention obtained through the parallel processing of
the channel attention and spatial attention, whereas CBAM obtains the context and local
information by sequentially applying channel attention and spatial attention.
Sensors 2022, 22, 174 9 of 19
The feature set from ResNet10 and the feature set from the CNN+CBAM model were
integrated to classify the activities through the fully connected layer and the output layer.
The batch size of the deep learning model was 16, and the input size of the feature stage
was 512 for the images and 256 for the accelerometer signals. The number of nodes in the
concatenation stage was 768, the number of nodes in the fully connected layer was 512, and
the model classified the 9 activities. The hyperparameters used in this study are listed in
Table 1. The hyperparameters were determined according to the type of the proposed deep
learning fusion model, and the parameters showing the best performance in each case are
as follows: The hyperparameters were determined according to the type of the proposed
deep learning fusion model. The best overall performance was achieved when ResNET10,
CNN, and CBAM were combined, and the learning rate was set to 0.001, the batch size to 4,
and the epoch to 36. The epoch is distinct since it is a result of the process of determining
an optimum parameter based on the training and validation losses. The magnitude of the
input data is related to the substantial difference in epochs depending on the model. The
size of the input data in the ResNet10+ Bi-LSTM+CNN+CBAM model, which incorporates
all input data, was as follows: (3978, 20, 100, 100) is the skeleton data matrix, (3978, 20, 50)
is the coordinate, and (3978, 28) is the accelerometer data. As a result, 1 epoch time for
training, or the time it takes to train the model, was the longest at 300 s, while the number
of epochs, or the iteration for learning the full model, was the shortest at 16. Conversely,
the data size was the smallest when vector data and sensor were utilized, but the training
time was 10 s and the epoch size was 135. That is, when the input data to be trained was
small, it was confirmed that the training time was reduced, but the iterations to find the
optimized model increased.
ResNet10+
ResNet10+ Bi-LSTM+
Training Models Bi-LSTM+
CNN+CBAM CNN+CBAM
CNN+CBAM
Image + vector +
Input type Image + sensor vector + sensor
sensor
Learning rate 0.001 0.001 0.001
Training
Batch size 16 4 16
parameters
Epoch size 13 36 135
Performance 1 epoch time 1 300 s 275 s 10 s
measures Overall accuracy 93.1% 94.8% 91.8%
1 One epoch time for training.
3. Results
3.1. Dataset
The dataset used was Berkeley MHAD [31]. The data used in this dataset were
Multiview video and acceleration data. The multi-view video data were in a 640 × 480 pgm
file and consisted of 12 camera data points. We used 4 out of the 12 camera data points.
The data from the 12 cameras were divided into three categories: data from the front, data
from above, and data from the back. In this work, we used data from the front camera
to reduce the number of cameras in order to use this study in everyday life. The degree
of generalization of the model was proven using image inversion experiments (left-right
inversion and up-down inversion). In this dataset, 11 actions were repeated 5 times by
7 males and 5 females aged 23–30. There were approximately 660 sequences in total. Table 2
lists the 11 action lists used and the number of samples. We used 9 of the 11 activities
because A09 (sitting down) is a combination of (A10 sitting) and (A11 standing up), and
it was assumed that sufficient information could be transmitted without discriminating
between the two actions in terms of type recognition.
Sensors 2022, 22, 174 10 of 19
The acceleration data consisted of six three-axis wireless accelerometers measuring the
wrist, ankle, and hip movements. The accelerometer signal data were obtained from the
three-axis accelerometer sensor data from six places on the body, i.e., both wrists, both hips,
and both feet, from the 12 subjects. The accelerometer signal input data points numbered
111,384, which included nine activities, three channels, six parts, and five repetitions. The
skeleton data consisted of nine actions, one channel, and five repetitions. After finding the
skeleton by inputting 79,560 sample data with a size of 640 × 480 as the input, the size
of the input image was reduced, and the image was normalized to a size of 100 × 100 to
refine the unnecessary information. Framing was then applied to express the flow of time,
and the input form of 3978 × 20 × 100 × 100 was finally configured. To achieve a lower
storage capacity than the image data, the joint coordinate data were encoded as a vector.
The coordinate values of the 25 joints were expressed in real integers before being flattened
into a single numeric vector. Finally, a 3978 × 20 × 50 input vector was constructed in the
form of nine actions with one channel, which were repeated five times.
Joint data were acquired from a human image based on the OpenPose program [27].
We used 25 joint points, as shown in Table 3. Joint data were acquired from the segmented
human image based on the Open Pose program. The joint data were composed of a JSON
file format of the person in the frame, which was acquired through the x- and y-coordinates
of the keypoints through the JSON file parser. In the normalization step, the acquired
joint data were scaled to a size of 100 × 100. The normalization process was conducted
to minimize the differences in the individual physical conditions, such as tallness and
shortness and bulkiness and smallness, and to improve the learning speed by lightening
the data.
Accuracy (%)
Activity Code # of Instances Image + Vector + Image + Vector +
Image Only Sensor Only
Sensor Sensor Sensor
Jumping in place A01 6025 53% 95% 96% 100% 100%
Jumping jacks A02 7824 67% 95% 98% 98% 97%
Bending—hands up A03 18,762 86% 99% 99% 98% 99%
Punching (boxing) A04 10,195 56% 72% 86% 75% 78%
Waving—two hands A05 9742 70% 85% 90% 90% 96%
Waving—one hand (right) A06 10,763 69% 81% 100% 94% 93%
Clapping hands A07 5242 48% 65% 89% 80% 88%
Throwing a ball A08 3584 49% 79% 93% 90% 83%
Sitting down, then standing up A09 21,086 83% 91% 98% 98% 100%
Average accuracy 70.9% 84.7% 94.8% 91.8% 93.1%
Standard deviation of activity accuracy 14% 11% 5% 9% 8%
When only the sensor was used, the accuracy was approximately 85%, and when the
ResNet-based prediction applying only an image was used, the accuracy was approximately
70%. However, the analysis using the accelerometer and image information confirmed
an accuracy of 94.8%. As a result of the analysis using the accelerometer and coordinate
information, it was 91.8%. As a result of the analysis using an accelerometer, image
information, and coordinate information, the accuracy was 93.1%. Through the confusion
matrix, we examined which behavioral analyses did not work well for a more extensive
investigation, as shown in Figure 6. The relative color indicates the relative accuracy in
confusion matrix. We found that when both images and sensor data were used, the accuracy
of each behavior recognition was higher than when only sensors or image data were used
in most circumstances. However, all classes that are difficult to categorize are those that
include the use of the hands and arms. In the case of the sensor-only model, A04 (punching),
A05 (waving—two hands), A06(waving—one hand (right)), A07 (clapping hands), and A08
(throw balls), which are most of the actions using hands, had low individual classification
accuracy and were difficult to classify because of the similarities among the data, as shown
in Figure 6a. When the sensor and image data were used together, it was confirmed that the
performance of distinguishing A04, A07, and A08 was improved, as shown in Figure 6b–d.
A04 and A07 were difficult to differentiate in the image because they both comprised
stretched hands in a standing position, with no leg movement in the sensor data, making it
difficult to distinguish them solely on the basis of identical hand movements.
vidual classification accuracy and were difficult to classify because of the similarities
among the data, as shown in Figure 6a. When the sensor and image data were used to-
gether, it was confirmed that the performance of distinguishing A04, A07, and A08 was
improved, as shown in Figure 6b–d. A04 and A07 were difficult to differentiate in the
image because they both comprised stretched hands in a standing position, with no leg
Sensors 2022, 22, 174 12 of 19
movement in the sensor data, making it difficult to distinguish them solely on the basis of
identical hand movements.
(a) (b)
(c) (d)
Figure 6. Confusion matrix of activity classification: (a) sensor-only confusion matrix, (b) accelerome-
ter and skeleton image, (c) skeleton coordinates and accelerometer, and (d) accelerometer, skeleton
vector, and skeleton image.
The performance, analyzed in terms of time is shown in Figure 7 and Table 1. Figure 7
shows the train loss and validation loss of our proposed method. To avoid overfitting, we
used train loss and validation loss to generate optimal parameters. The moment where
the train loss converges to 0 and the validation loss converges was chosen as the stopping
criterion as the overall accuracy improves. We measured the time it takes for one epoch to
train to evaluate performance over time, as shown in Table 1. For skeleton and accelerometer
data using the ResNet10+CNN+CBAM model, one epoch took approximately 275 s. It took
10 s to apply Bi-LSTM+CNN+CBAM with skeleton coordinate data and acceleration data
as inputs under the same circumstances. It took 300 s to take all three forms of data and
apply them to the ResNet10+ Bi-LSTM+CNN+CBAM model. The size of the input data
we used train loss and validation loss to generate optimal parameters. The moment where
the train loss converges to 0 and the validation loss converges was chosen as the stopping
criterion as the overall accuracy improves. We measured the time it takes for one epoch
to train to evaluate performance over time, as shown in Table 1. For skeleton and accel-
erometer data using the ResNet10+CNN+CBAM model, one epoch took approximately
Sensors 2022, 22, 174 13 of 19
275 s. It took 10 s to apply Bi-LSTM+CNN+CBAM with skeleton coordinate data and ac-
celeration data as inputs under the same circumstances. It took 300 s to take all three forms
of data and apply them to the ResNet10+ Bi-LSTM+CNN+CBAM model. The size of the
in thedata
input ResNet10+ Bi-LSTM+CNN+CBAM
in the ResNet10+ Bi-LSTM+CNN+CBAM model, which model, incorporates all input data,
which incorporates was
all input
as follows:
data, was as(3978,
follows:20, (3978,
100, 100)
20, is100,
the100)
skeleton
is thedata matrix,
skeleton (3978,
data 20, 50)
matrix, is the
(3978, 20,coordinate,
50) is the
and (3978, 28)
coordinate, andis (3978,
the accelerometer data. This indicates
28) is the accelerometer that indicates
data. This 1 epoch time
thatfor training,
1 epoch timeor for
the
time it takes to train the model,
training, or the time it takes to train was the longest at 300 s. Conversely, the data
model, was the longest at 300 s. Conversely, thesize was the
smallest
data when
size was vector
the datawhen
smallest and sensor
vector were utilized,
data and sensorbutwere
the training
utilized,time wastraining
but the 10 s. That
time is,
when the input data to be trained was small, it was confirmed that
was 10 s. That is, when the input data to be trained was small, it was confirmed that thethe training time was
reduced,time
training but was
the iterations
reduced, but to find
the the optimized
iterations to findmodel
the increased.
optimized model increased.
Finally,
Finally,in
inthis
thisstudy,
study,a aperformance
performanceanalysis
analysis was
was conducted
conducted from
fromthethe
perspective
perspective of
model
of modelrobustness
robustnesswhen various
when noises
various were
noises added
were to the
added to input signal.
the input In the
signal. In case of be-
the case of
havior recognition achieved through images, if the angle of the camera is
behavior recognition achieved through images, if the angle of the camera is changed, a changed, a per-
Sensors 2022, 22, x FOR PEER REVIEW
formance degradation may may
occuroccur
during testing. We generated two kinds inputof14 of 19
performance degradation during testing. We generated two ofkinds signal
input
noise, as shown in Figure 8. First, we tested the input data with Gaussian
signal noise, as shown in Figure 8. First, we tested the input data with Gaussian noise noise added to
1,added
3, andto61,channels among 18 channels to the acceleration data. Second, in order
3, and 6 channels among 18 channels to the acceleration data. Second, in order to add
to addtonoise
noise to the data,
the image imagethe
data, the was
image image wasasused
used inputastoinput to the by
the model model by inverting
inverting the imagethe
image left and right and up
left and right and up and down. and down.
(a) (b)
Figure
Figure 8.
8. Input
Input data
data distortion:
distortion: (a)
(a)adding
adding Gaussian
Gaussian noise
noise to
to one
one channel
channel of
of an
an accelerometer
accelerometer sensor
sensor
and (b) a flipped image from left to right, bottom to top.
and (b) a flipped image from left to right, bottom to top.
Among the 18-channel acceleration data, all 1-channel, 3-channel, and 6-channel val-
ues were replaced with Gaussian noise, as shown in the Figure 9a–c. Through the experi-
ment, it was confirmed that the performance was affected as the number of channels with
errors increased. When noise was added to 1 channel, around 0.5% of noise was formed,
16.7% of noise was generated on 3 channels, and 33.3% of noise was generated on 6 chan-
nels. As a result of the experiment, the accuracy was 94.8% while using a skeleton image
Sensors 2022, 22, x FOR PEER REVIEW 14 of 19
noise to the image data, the image was used as input to the model by inverting the image
Sensors 2022, 22, 174 left and right and up and down. 14 of 19
Among the 18-channel acceleration data, all 1-channel, 3-channel, and 6-channel values
were replaced with Gaussian noise, as shown in the Figure 9a–c. Through the experiment,
it was confirmed that the performance was affected as the number of channels with errors
increased. When noise was added to 1 channel, around 0.5% of noise was formed, 16.7%
(a)of noise was generated on 3 channels, and 33.3% of noise(b) was generated on 6 channels.
As a result of the experiment, the accuracy was 94.8% while using a skeleton image and
Figure 8. Input data distortion: (a) adding Gaussian noise to one channel of an accelerometer sensor
sensor as an input, but 93.8% when noise was generated in one channel, confirming that the
and (b) a flipped image from left to right, bottom to top.
performance degradation was 1%. The overall accuracy was 77.1% when noise was applied
to three
Amongchannels, and 76.3%
the 18-channel when noise
acceleration wasalladded
data, to six3-channel,
1-channel, channels. and
It was confirmed
6-channel val-
that when the channel noise was 0.5% or above, the performance of the
ues were replaced with Gaussian noise, as shown in the Figure 9a–c. Through the experi-acceleration signal
degraded
ment, it wasdramatically.
confirmed In addition,
that the ninth action,
the performance sittingas
was affected down, exhibited
the number of achannels
considerable
with
decline in performance dependent on the channel distortion, as indicated in the confusion
errors increased. When noise was added to 1 channel, around 0.5% of noise was formed,
matrix. As shown in Figure 9d–f, when the image in which the input signal was inverted
16.7% of noise was generated on 3 channels, and 33.3% of noise was generated on 6 chan-
toward the left and right was mixed and tested with the existing image, the analysis
nels. As a result of the experiment, the accuracy was 94.8% while using a skeleton image
showed 93.41% accuracy, and when tested by mixing the vertically inverted image and
and sensor as an input, but 93.8% when noise was generated in one channel, confirming
the existing image, the performance was 94.16%. In the case of mixing the left and right
that the performance degradation was 1%. The overall accuracy was 77.1% when noise
images, the upside-down image, and the existing image, a good performance of 93.44%
was applied to three channels, and 76.3% when noise was added to six channels. It was
was demonstrated.
confirmed that when the channel noise was 0.5% or above, the performance of the accel-
Figure 10 shows the ROC (receiver operating characteristic) plot and AUC (area under
eration signal degraded dramatically. In addition, the ninth action, sitting down, exhibited
ROC curve) value of the result of classifying the behavior using the skeleton image and
a considerable decline in performance dependent on the channel distortion, as indicated
sensor data. In most classes, the AUC value was 1, except for A04, A07, and A08. The
in the confusion matrix. As shown in Figure 9d–f, when the image in which the input
AUC of A04 was 0.99, the AUC of A07 was 0.97, and the AUC of A08 was 0.99. In general,
signal
a was inverted
classifier toward
is considered the ifleft
good itsand
AUC right wasor
is 0.95 mixed andBecause
higher. tested with the
all of existing
the im-
classifiers
age, the analysis showed 93.41% accuracy, and when tested by mixing the
in this study had a score of 0.95 or above, it can be concluded that the proposed strategy vertically in-
verted image and the existing image, the performance was 94.16%. In the
has produced an efficient classifier. In the case of the ROC plot, it shows good classifiercase of mixing
the left and right
performance images,
in most the upside-down
classes. It was confirmed image,that
andthetheclassifier
existing image, a gooddid
performance perfor-
not
mance of 93.44% was demonstrated.
decrease significantly even in the case of A07 and A08, which had relatively low accuracy.
(a) (b)
Figure 9. Cont.
174
Sensors 2022, 22, x FOR PEER REVIEW 1515of
of 19
19
(c) (d)
(e) (f)
Figure
Figure 9.
9. Confusion
Confusion matrix
matrix of
ofactivity
activityclassification:
classification: (a)
(a)11channel
channelsensor
sensordata
datadistortion
distortion(0.5%
(0.5%noise),
noise),
(b)
(b) 3 channels sensor data distortion (16.7% noise), (c) 6 channels sensor data distortion (33.3%noise)
3 channels sensor data distortion (16.7% noise), (c) 6 channels sensor data distortion (33.3% noise)
(d) image data distortion (normal and left-right reverse data), (e) image data distortion (normal and
(d) image data distortion (normal and left-right reverse data), (e) image data distortion (normal and
top-bottom reverse data), (f) image data distortion (normal and left-right reverse and top-bottom
top-bottom reverse data), (f) image data distortion (normal and left-right reverse and top-bottom
reverse data).
reverse data).
Figure 10 shows the ROC (receiver operating characteristic) plot and AUC (area un-
der ROC curve) value of the result of classifying the behavior using the skeleton image
and sensor data. In most classes, the AUC value was 1, except for A04, A07, and A08. The
AUC of A04 was 0.99, the AUC of A07 was 0.97, and the AUC of A08 was 0.99. In general,
a classifier is considered good if its AUC is 0.95 or higher. Because all of the classifiers in
this study had a score of 0.95 or above, it can be concluded that the proposed strategy has
produced an efficient classifier. In the case of the ROC plot, it shows good classifier per-
formance in most classes. It was confirmed that the classifier performance did not decrease
significantly even in the case of A07 and A08, which had relatively low accuracy.
Sensors
Sensors2022,
2022,22,
22,x174
FOR PEER REVIEW 16 16
ofof1919
4.4.Discussion
Discussion
InInthe
thecase
caseofofdeep-learning-based
deep-learning-basedrecognition,
recognition,aarobust
robustdeep
deeplearning
learningmodel
modelcannot
cannot
bebegenerated
generatedififthere
thereare
areinsufficient
insufficientdata
dataororififthetheactivity
activitytotobeberecognized
recognizedisisdeformed.
deformed.
For
Forexample,
example,ififthetheimage
imageisisrotated,
rotated,ititisisnot
notpossible
possibletotorecognize
recognizewhether
whetherthetheimage
imagehashas
beenrotated,
been rotated,and
andthethepossibility
possibilityofofdetermining
determiningaadifferent
differentposture
postureincrease.
increase.That
Thatis,
is,even
even
ininthe
thecase
caseofofthe
thesame
samestanding
standingaction,
action,therethereisisaaproblem
problemininthatthataadifferent
differentpose
posecan
canbebe
recognizedwhen
recognized whenlyinglyingdown
downororstanding.
standing.To Tosolve
solvethis
thisproblem,
problem,aastudystudywas
wasconducted
conducted
totogenerate
generatelearning
learningdata
databybyapplying
applyingaaself-supervised
self-supervisedlearning
learningtechnique
techniquetotoaugment
augmentaa
small amount of data
small amount of data [19,20]. [19,20].
Fromthis
From thisstudy,
study,weweobserved
observedthatthatthe theproposed
proposeddeep deeplearning
learningmodel
modelpreserves
preservesthethe
recognitionrate
recognition rateeven
evenwhen
whenvarious
variousposes
posesare areinputted
inputtedforforconfirming
confirmingexperimentally
experimentallythe the
complementarybehavior
complementary behaviorrecognition
recognitionofofthe theaccelerometer
accelerometerand andimage
imagesensor.
sensor.The
Theoverall
overall
performancewas
performance was demonstrated
demonstrated by by establishing
establishing aa deepdeepfusion
fusionnetwork
networkwithwithheterogeneous
heterogene-
inputs, and the fluctuation in the recognition rate for each behavior
ous inputs, and the fluctuation in the recognition rate for each behavior was also reduced. was also reduced.
Boxing or punching movements, for instance, have a similar pattern
Boxing or punching movements, for instance, have a similar pattern of arm motions and of arm motions and
arm bending, which are comparable to those of clapping and throwing a ball. As a result,
Sensors 2022, 22, 174 17 of 19
arm bending, which are comparable to those of clapping and throwing a ball. As a result,
when only an image signal was used, the accuracy was only approximately 50%; however,
when both the image and the sensor signal were used, the accuracy was determined to be
approximately 86%.
In addition, when noise or distortion occurs in image or sensor data, it was confirmed
that the proposed system can recognize the behavior while maintaining accuracy in the
absence of noise. The accuracy analysis indicated a 93.41% performance when the image
with the input signal was inverted toward the left or right and the existing image was
mixed and evaluated, and a 94.16% performance when the vertically inverted image and
the existing image were mixed and tested. The system performed well when merging the
left and right images, the upside-down image, and the current image, reaching an accuracy
of 93.44%. White noise was also added to one channel value in the case of the acceleration
signal, and the test confirmed that the performance was 93.23%. After testing the data with
inversion and noise data, it has been determined that the suggested model is resilient, with
a performance deterioration of approximately 1%.
By comparing the performance with the training time, we confirmed that activity
recognition above a certain level is possible if only skeleton data are needed for such
recognition. In terms of a simple temporal efficiency, the model using sensor and coordinate
information showed a good performance, after which the accelerometer and image model
and the model using accelerometer, image, and coordinate information showed similar
performance. Instead of using the complete image as the input for behavior recognition,
the proposed method employs a skeleton image and its coordinate values. This reduces
both learning time and learning accuracy. When all data were analyzed at the same time,
it took approximately 300 s to learn the three proposed deep learning networks, but only
approximately 10 s when the skeleton coordinate vector was utilized as an input. As a
result, a performance improvement of approximately 30-fold in terms of time efficiency
was confirmed.
The limitations of the proposed study are as follows. The deep learning model
proposed in this study applied ResNet for image data, CNN and CBAM for time series
signals, and LSTM for skeleton vector data to apply a model suitable for input signals. This
model is widely used in the deep learning field, and it seems that additional performance
improvements can be expected. Especially, the existing skeleton vector produces poor
results. Additional speed increases can be predicted if one employs a learning method that
weighs coordinate changes based on a transformer or BERT model used for sequential data
modeling instead of LSTM. In addition, to optimize performance and processing power, a
skeleton, not an original image, was extracted and used as an input. However, considering
that research on generalizing models based on limited input signals in recent studies is
ongoing, further research on end-to-end models based on raw input signals is needed.
Lastly, behavior recognition was performed using current 18 channel data extracted from
six wearable devices and camera images in four directions. More research is needed to
figure out how to simplify the input signal and put it to use in the actual world.
5. Conclusions
In this study, we proposed a hybrid deep learning model that uses heterogeneous
sensor data, an acceleration sensor, and an image as inputs. We used CNN and CBAM
models for the accelerometer data, and bi-LSTM and ResNet10 for the image data. From
the analysis of nine behaviors using the Berkeley MHAD dataset, the overall accuracy
was 94.08% with a skeleton image and accelerometer data, and 93.09% with a skeleton
image, the coordinates, and accelerometer data. In addition, when the image in which
the input signal was inverted toward the left or right was mixed and tested with the
existing image, the analysis showed an accuracy of 93.41%, and when tested by mixing
the vertically inverted image and the existing image, the performance was 94.16%. In the
case of mixing the left and right images, the upside-down image, and the existing image, a
good performance of 93.44% was demonstrated. In addition, in the case of the acceleration
Sensors 2022, 22, 174 18 of 19
signal, white noise was added to one channel value, and the test results confirmed that the
performance reached 93.23%. As a result of testing using inversion and noise data added, it
has been confirmed that the proposed model is robust, with a degradation in performance
of only approximately 1%. This enables a stable behavior recognition even when various
poses are changed or when noise is applied to the sensor data.
Author Contributions: Conceptualization, J.K. and A.C.; methodology, J.K., J.S. (Jieun Shin), J.S.
(Jaewon Shin) and A.C.; software, J.K., J.S. (Jieun Shin) and J.S. (Jaewon Shin); validation, J.K., J.S.
(Jieun Shin), J.S. (Jaewon Shin), D.L. and A.C.; formal analysis, J.K., J.S. (Jieun Shin), J.S. (Jaewon Shin),
D.L. and A.C.; investigation, J.K., J.S. (Jieun Shin), J.S. (Jaewon Shin) and A.C.; resources, J.K., J.S.
(Jieun Shin), J.S. (Jaewon Shin) and A.C.; data curation, J.K., J.S. (Jieun Shin) and J.S. (Jaewon Shin);
writing—original draft preparation, J.K., J.S. (Jieun Shin) and J.S. (Jaewon Shin); writing—review and
editing, D.L. and A.C.; visualization, J.K., J.S. (Jieun Shin) and J.S. (Jaewon Shin); supervision, D.L.
and A.C.; project administration, D.L. and A.C.; funding acquisition, D.L. and A.C. All authors have
read and agreed to the published version of the manuscript.
Funding: This work was supported by a National Research Foundation of Korea (NRF) grant funded
by the Korean government (MSIT) (No. NRF-2021R1F1A1062181) and by the Gachon University
Research Fund of 2019 (GCU-2019-0386).
Institutional Review Board Statement: Not applicable. In this study, we use the open databases to
collect and to analyze the data.
Informed Consent Statement: Not applicable. In this study, we use the open databases to collect
and to analyze the data.
Data Availability Statement: Berkeley MHAD. Available online: https://tele-immersion.citris-uc.
org/berkeley_mhad (accessed on 21 December 2021).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Bieber, G.; Voskamp, J.; Urban, B. Activity Recognition for Everyday Life on Mobile Phones. In Proceedings of the International
Conference on Universal Access in Human-Computer Interaction, San Diego, CA, USA, 19–24 July 2009; pp. 289–296.
2. Puyau, M.R.; Adolph, A.L.; Vohra, F.A.; Zakeri, I.; Butte, N.F. Prediction of Activity Energy Expenditure Using Accelerometers in
Children. Med. Sci. Sports Exerc. 2004, 36, 1625–1631. [CrossRef] [PubMed]
3. Andrea, M.; Mary, R.; Haskell, W.L.; Angelo, M.S.; Stephen, S.I. Activity Recognition in Youth Using Single Accelerometer Placed
at Wrist or Ankle. Med. Sci. Sports Exerc. 2017, 49, 801.
4. Stewart, G.T.; Yonglei, Z.; Weng-Keen, W. Machine learning for activity recognition: Hip versus wrist data. Physiol. Meas. 2014,
35, 2183–2189.
5. Anahita, H.; Shayan, F.; Eleanne, V.; Lia, V.; Rima, H.; Majid, S.; Alex, B. Children Activity Recognition: Challenges and Strategies.
In Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI,
USA, 18–21 July 2018.
6. Ahmadi, M.; O’Neil, M.; Fragala-Pinkham, M.; Lennon, N.; Trost, S. Machine learning algorithms for activity recognition in
ambulant children and adolesecents with cerebral pasly. J. Neuroeng. Rehabiliation 2018, 15, 1–9.
7. Ignatov, A. Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl. Soft
Comput. 2018, 62, 915–922. [CrossRef]
8. Wang, L. Recognition of human activities using continuous autoencoders with wearable sensors. Sensors 2016, 16, 189. [CrossRef]
[PubMed]
9. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef] [PubMed]
10. Nweke, H.F.; Teh, Y.W.; Al-Garadi, M.A.; Alo, U.R. Deep learning algorithms for human activity recognition using mobile and
wearable sensor networks: State of the art and research challenges. Expert Syst. Appl. 2018, 105, 233–261. [CrossRef]
11. Hammerla, N.Y.; Halloran, S.; Plötz, T. Deep, convolutional, and recurrent models for human activity recognition using wearables.
arXiv 2016, arXiv:1604.08880.
12. Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep Learning for Sensor-based Human Activity Recognition: Overview,
Challenges, and Opportunities. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [CrossRef]
13. River, P.; Valarezo, E.; Kim, T.S. Recognition of human hand activities based on a single wrist IMU using recurrent neural
networks. Int. J. Pharma Med. Biol. Sci. 2017, 6, 114–118. [CrossRef]
14. Zhao, Y.; Yang, R.; Chevalier, G.; Xu, X.; Zhang, Z. Deep residual Bidir-LSTM for human activity recognition using wearable
sensors. Math. Prob. Eng. 2018, 7316954, 1–13. [CrossRef]
Sensors 2022, 22, 174 19 of 19
15. Hassan, M.M.; Ullah, S.; Hossain, M.S.; Alelaiwi, A. An end-to-end deep learning model for human activity recognition from
highly sparse body sensor data in Internet of Medical Things environment. J. Supercomput. 2021, 77, 2237–2250. [CrossRef]
16. Wan, S.; Qi, L.; Xu, X.; Tong, C.; Gu, Z. Deep learning models for real-time human activity recognition with smartphones. Mob.
Netw. Appl. 2020, 25, 743–755. [CrossRef]
17. Kale, G.V. Human activity recognition on real time and offline dataset. Int. J. Intell. Syst. Appl. Eng. 2019, 7, 60–65. [CrossRef]
18. Kim, K.; Jalal, A.; Mahmood, M. Vision-based human activity recognition system using depth silhouettes: A smart home system
for monitoring the residents. J. Electr. Eng. Technol. 2019, 14, 2567–2573. [CrossRef]
19. Khaire, P.; Kumar, P.; Imran, J. Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern
Recognit. Lett. 2018, 115, 107–116. [CrossRef]
20. Amir, S.; Jun, L.; Tian-Tsong, N.; Gang, W. NTU RGB+D: A Large-Scale Dataset for 3D Human Activity Analysis. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
21. Shahroudy, A.; Ng, T.T.; Yang, Q.; Wang, G. Multimodal multipart learning for action recognition in depth videos. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 38, 2123–2129. [CrossRef]
22. Ordóñez, F.; Roggen, D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition.
Sensors 2016, 16, 115. [CrossRef]
23. Wei, H.; Jafari, R.; Kehtarnavaz, N. Fusion of Video and Inertial Sensing for Deep Learning–Based Human Action Recognition.
Sensors 2019, 19, 3680. [CrossRef]
24. Kiran, S.; Khan, M.A.; Javed, M.Y.; Alhaisoni, M.; Tariq, U.; Nam, Y.; Sharif, M. Multi-Layered Deep Learning Features Fusion for
Human Action Recognition. Comput. Mater. Contin. 2021, 69, 4061–4075. [CrossRef]
25. Khan, M.A.; Akram, T.; Sharif, M.; Javed, M.Y.; Muhammad, N.; Yasmin, M. An implementation of optimized framework
for action classification using multilayers neural network on selected fused features. Pattern Anal. Appl. 2019, 22, 1377–1397.
[CrossRef]
26. Helmi, A.M.; Al-Qaness, M.A.; Dahou, A.; Damaševičius, R.; Kavičius, T.; Elaziz, M.A. A novel hybrid gradient-based optimizer
and grey wolf optimizer feature selection method for human activity recognition using smartphone sensors. Entropy 2021, 23,
1065. [CrossRef] [PubMed]
27. OpenPose API. Available online: https://github.com/CMU-Perceptual-Computing-Lab/openpose (accessed on 21 December 2021).
28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778.
29. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
30. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference
on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19.
31. Berkeley MHAD. Available online: https://tele-immersion.citris-uc.org/berkeley_mhad (accessed on 21 December 2021).