Triple Cross-Domain Attention On Human Activity Recognition Using Wearable Sensors
Triple Cross-Domain Attention On Human Activity Recognition Using Wearable Sensors
Abstract—Efficiently identifying activities of daily living (ADL) Random Forest and native Bayesian methods have been widely
provides very important contextual information that is able to adopted in HAR areas [4], [5], which have achieved remarkable
improve the effectiveness of various sports tracking and healthcare performance. However, these shallow learning methods often
applications. Recently, attention mechanism that selectively focuses
on time series signals has been widely adopted in sensor based require feature extraction from the data, which heavily depends
human activity recognition (HAR), which can enhance interesting on expert knowledge from specific domain [6]. The handcrafted
target activity and ignore irrelevant background activity. Several feature engineering inevitably restricts the practicability of the
attention mechanisms have been investigated, which achieve re- HAR model when the task is transferred from one domain to the
markable performance in HAR scenario. Despite their success, other.
prior these attention methods ignore the cross-interaction between
different dimensions. In the paper, in order to avoid above short- Lately, deep learning techniques [7]–[9] have broken the
coming, we present a triplet cross-dimension attention for sensor- limit to shallow learning methods, which enables richer fea-
based activity recognition task, where three attention branches are ture representations to be learned automatically with no need
built to capture the cross-interaction between sensor dimension, of domain-specific knowledge. In particular, compared with
temporal dimension and channel dimension. The effectiveness of these shallow learning methods with handcrafted features that
triplet attention method is validated through extensive experiments
on four public HAR dataset namely UCI-HAR, PAMAP2, WISDM only can recognize low-level or simple activities, convolutional
and UNIMIB-SHAR as well as the weakly labeled HAR dataset. neural networks (CNNs) [7] are more suitable for recognizing
Extensive experiments show consistent improvements in classifi- more complex activities because of its advantages of local
cation performance with various backbone models such as plain dependencies and scale invariance. CNNs have significantly
CNN and ResNet, demonstrating a good generality ability of the pushed state-of-the-art performance in HAR scenario given its
triplet attention. Visualization analysis is provided to support our
conclusion, and actual implementation is evaluated on a Raspberry rich representation ability. Despite its effectiveness, deep HAR
Pi platform. still faces many key challenges, one of which is ground truth
annotation [10]. In a supervised learning setting, the use of
Index Terms—Activity recognition, attention, weakly supervised
learning, wearable sensors, convolutional neural networks.
deep CNNs relies heavily on strictly labeled activity sensor data
for training. Nevertheless, compared with HAR that uses video
data (e.g. GoPro motion camera), the high dimensional time
I. INTRODUCTION series data from motion sensors such as accelerometer is harder
URING recent years, human activity recognition (HAR) to interpret and annotate, which has brought cumbersome and
D using various motion sensors embedded in smartphones
or other wearable devices has become a new research hotspot in
arduous difficulties to HAR.
Such challenges can be tackled by utilizing attention mecha-
ubiquitous and mobile computing due to the rapid growth of ap- nism [11], [12], which shows great potential in a large variety
plication demands in domains such as health care, life assistance of computer vision or natural language processing tasks. The
and exercise monitoring. Sensor-based HAR task [1]–[3] can be learning of attention weights can aid the model to focus on the
regarded as a multi-channel time series classification problem, in target object, thereby improving the recognition accuracy. On
which a fixed length sliding window is utilized to split time series the other hand, for an annotator who is in charge of recording
signal into equal segments. Various traditional machine learn- sensor data, it is much simpler to identify whether a target
ing approaches such as Logistic Regression, Decision Trees, activity occurs in a long sensor sequence. If a specific activity
can be recognized according to coarse or weakly labels, it will
Manuscript received 15 July 2021; revised 2 November 2021; accepted 22 significantly ease the burden of manual labeling. Intuitively,
November 2021. Date of publication 5 January 2022; date of current version the attention mechanism is capable of aiding to tell where or
23 September 2022. This work was supported in part by the National Science
Foundation of China under Grant 61971228 and in part by the Natural Science what to focus via enhancing selectively the interesting target
Foundation of Jiangsu Province under Grant BK20191371. (Corresponding activity while weakening redundant or even other irrelevant
author: Lei Zhang.) information. Therefore, it deserves further research whether the
Yin Tang, Lei Zhang, Qi Teng, and Fuhong Min are with the School
of Electrical and Automation Engineering, Nanjing Normal University, attention mechanism can promote the state-ot-the-art perfor-
Nanjing 210023, China (e-mail: [email protected]; [email protected]; mance of HAR via consciously improving output feature maps
[email protected]; [email protected]). of convolutional network.
Aiguo Song is with the School of Instrument Science and Engineering,
Southeast University, Nanjing 210096, China (e-mail: [email protected]). Recently, hard attention [13] and soft attention [14] have been
Digital Object Identifier 10.1109/TETCI.2021.3136642 proposed respectively in weakly supervised learning scenario,
2471-285X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
1168 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 6, NO. 5, OCTOBER 2022
in which sensor data does not need to be strictly labeled. One II. RELATED WORKS
only needs to know which kind of activity has occurred in a
Attention in human perception is everywhere, which selec-
long sensor sequence without the specific location of the target tively focus on interesting parts while suppressing the other
activity. The learned attention weights can help to focus on
irrelevant or even misleading information. During the past few
the target activity from a long background sequence. However,
years, the attention mechanism has been widely incorporated
the two attention mechanisms can only tell us where to focus, into various deep CNN architectures, which can significantly
ignoring channel information, which plays an important role
improve performance on large scale computer vision tasks.
in deciding what to focus on. The dual attention network [15]
Several related attention mechanisms to our work are introduced
in weakly supervised HAR applications has demonstrated as follows. Hu et al. for the first time proposed the Squeeze-and-
the advantages of computing multi-attention. Although
Excitation Networks (SENet) [20], which successfully utilizes
dual attention mechanism provides significant performance global average-pooled features to compute channel attention
improvements in HAR scenario, it does not account for the in an efficient way. This was followed by the introduction of
importance of capturing cross-dimension interaction, which
Convolutional Block Attention Module (CBAM) [21], in which
have successfully shown a favorable impact in computer vision the combination of channel attention and spatial attention leads
task. to significant performance improvement. Global-Context Net-
In the paper, we firstly propose a novel triplet attention
works (GC-Net) [22] proposed a novel NL-block, which takes
network in HAR scenario, which mainly blends three attention into account global context modeling and lightweight modular
branches. Given a standard convolutional layer, let us consider design. More recently, Landskape et al. [23] adopted triplet
its input tensor with shape C × T × S, in which C, T and S are
attention mechanism for a variety of computer vision tasks,
the channel, temporal and sensor modality respectively. Each which concentrates on cross dimension interaction. However,
branch is responsible for capturing cross-dimension interaction
attention mechanism has been rarely explored in sensor based
between the spatial dimensions (T × S) and channel dimension
HAR scenario.
(C) of sensor input. We conduct extensive experiments to eval- Due to the popularity of attention mechanism in deep learning,
uate the triplet attention network on several public benchmark
a surge of research hotspot has been emerging to utilize attention
HAR datasets consisting of UCI-HAR dataset [16], PAMAP2
for handling HAR tasks. Recently, Ma et al. [24] proposed a
dataset [17], WISDM dataset [18] and UNIMIB-SHAR [19] novel AttnSense for HAR, which has incorporated the attention
dataset, as well as the weakly labeled HAR dataset. The experi-
mechanism into a Gated Recurrent Units (GRU) subnet for
mental results manifest that triplet attention perform better than
capturing the dependencies of sensor signals in both spatial and
one or two attention respectively. The main contributions of this temporal domains. Zeng et al. [25] highlighted the important
method are summarized as follows:
r Firstly, we propose a new architecture relying on triple part of different time series and sensor modalities by designing
temporal attention and sensor attention with Long Short Term
attention mechanism for HAR task, which could aid to Memory (LSTM). When compared to recurrent neural networks,
extract richer activity feature representations via building
CNN has better ability of feature extraction. In recent works,
three attention branches to capture cross-interaction be- two mainstream attention mechanisms, hard attention [13] and
tween sensor dimension, temporal dimension, and channel soft attention [14], have been incorporated into convolutional
dimension.
r Second, the triple attention tends to strength the impor- architecture to perform the weakly supervised HAR tasks, which
ignores the importance of sensor channels. Gao et al. [15]
tance of cross-dimension interaction, which is superior to proposed a novel dual attention method for HAR that blends
its corresponding predecessors, i.e., one or two attention
channel attention and spatial attention, demonstrating obvious
respectively.
r Finally, extensive experiments are conducted on several superiority in handling multimodal HAR task. In order to capture
cross-domain interaction of sensor signals, we for the first time
public HAR datasets, and several key hyperparameters
propose a new triple attention network for HAR task, which
are analyzed in details. We also examine actual imple-
is able to extract meaningful cross-dimensional features via
mentation on a Raspberry Pi platform with ARM-based
building three main attention branches.
computing core. The experimental results manifest that
triplet attention method could provide competitive results
at a negligible computational cost. III. MODEL
The rest of the paper is organized as follows. Section II Actually, the channel attention [20] often needs to compute
introduces related works on attention based HAR methods. a singular weight, i.e., a scalar for each channel of input sen-
Section III presents an overall architecture of the proposed sor tensor, which can be used to scale these feature maps for
triplet attention. In Section IV and Section V, we detail ex- generating attention effect. Although the lightweight channel
perimental results obtained on four public HAR datasets and attention is very effective, there is an obvious shortcoming
the weakly labeled HAR dataset, which are compared with the in its computing process. Usually, in order to produce these
existing SOTAs. Moreover, several ablation studies about the singular weights for each channel, one has to use global average
triplet attention method are provided. Section VI summarizes pooling to spatially subsample the input sensor tensor along
our conclusion. each channel, which inevitably leads to a significant loss in
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
TANG et al.: TRIPLE CROSS-DOMAIN ATTENTION ON HUMAN ACTIVITY RECOGNITION USING WEARABLE SENSORS 1169
Fig. 1. The overview of our proposed triplet attention (TA) module for HAR system. It simply describes the three pipelines: Data collection and preprocessing,
model training as well as activity recognition. T &S, T &C and C&S represent temporal and sensor interaction, temporal and channel interaction, as well as
channel and sensor interaction, respectively.
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
1170 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 6, NO. 5, OCTOBER 2022
The first branch is in charge of calculating the cross- on four publicly available HAR datasets including UCI-HAR,
interaction between temporal dimension and channel dimension. WISDM, PAMAP2 and UNIMIB-SHAR. All datasets have been
Firstly, the tensor χ with input shape (C × T × S) is rotated 90◦ recorded by various sensors such as accelerometers and gyro-
counter-clockwise along the T axis to generate a new tensor χ 1 scope, which can reflect human activities in different scenarios.
with the shape (S × T × C); χ 1 is then fed into Z_Pooling, Secondly, detailed ablation experiments are provided to analyze
which can generate a tensor χ ∗1 with the shape (2 × T × C); the impact of several hyperparatmers. Finally, we evaluate the
∗
As a third stage, χ 1 is passed through a standard convolution performance of triplet attention in the weakly supervised activity
with k × 1 kernel size (e.g., 3 × 1, 5 × 1), followed by a batch recognition task, which uses the weakly labeled dataset collected
normalization, which results in an intermediate output (shape by He et al. [26]. The impact of different cross dimension
is 1 × T × C); After passed through a sigmoid activation, the attention for HAR is explored.
intermediate output is turned into attention weights ω1 , which
are applied to χ1 , then rotated 90◦ clockwise along the T axis to A. Training Details
keep the shape of input χ.
Our model is trained by minimizing cross-entropy (CE) loss
In the second branch, the cross interaction between sensor
using mini-batch gradient descent, where the batch size is set
dimension and channel dimension can be computed in a similar
to 200. An Adam optimizer with dynamic learning rate is used.
way. The tensor χ with input shape (C × T × S) is rotated
The initial learning rate is set to 0.001, which will be reduced
90◦ counter-clockwise along the S axis, which provides a new
by a factor of 0.1 after every 100 epochs. All the experiments
tensor χ2 with the shape (T × C × S); χ 2 is then passed through
are implemented in Python using PyTorch framework backend
Z_Pooling layer, which can generate a tensor χ ∗2 with the shape
∗ on a server with an Intel i7-6850 K CPU, 64 GB RAM and
(2 × C × S); At the third stage, χ 2 is passed through a standard
NVIDIA RTX 3090 GPU. Since there is highly imbalanced class
convolution with k × 1 kernel size (e.g., 3 × 1, 5 × 1), followed
in various naturalistic activity datasets, different class weights
by a batch normalization, which results in an intermediate output
need to be reconsidered according their sample proportion. Thus,
(1 × C × S); After passed through a sigmoid activation, the
the mean F1 score [27] is used as metric to evaluate final
intermediate output is turned into attention weights ω2 , which
performance.
are applied to χ2 , then rotated 90◦ clockwise along the S axis to
maintain the shape of input χ.
For the third branch, the channels of input tensor χ are reduced B. Datasets
to two via using Z_Pooling operation, which provides the tensor A comprehensive evaluation of the proposed method is con-
3 with the shape (2 × T × S); χ
χ 3 is then fed into a standard ducted using four popular HAR datasets that include both high-
convolution with k × 1 kernel size (e.g., 3 × 1, 5 × 1), followed dimensional and low-dimensional sensor modalities. The sensor
by a batch normalization, which results in an intermediate data is segmented using sliding window technique with different
output; The output is then fed into a sigmoid activation, which window size and step length, which has an important influence
generates the attention weights ω3 with shape (1 × T × S); The on recognition system’s practical performance. We select the
attention weights ω3 are then applied to the input χ. same window size and step length adopted in previous successful
Finally, the three refined tensors from three branches are cases [15], [27] to ensure fair comparison.
aggregated via learning three weight parameters. For simplicity, • UCI-HAR [16]: This dataset was collected by recruiting 30
it can be represented as: volunteers. Everyone is required to wear a Samsung Galaxy S
1 1 1 II smartphone around their waist to perform six simple daily
Y= (R (ω1 χ1 )) + (R (ω2 χ 2 )) + (ω3 χ 3 ) , (4) activities consisting of “Walking,” “Going upstairs,” “Going
3 3 3
downstairs,” “Sitting,” “Standing,” “Laying”. Three-axis ac-
where ω1 , ω2 and ω3 are the three cross-dimensional attention
celerometer and gyroscope sensor signals are recorded at a fixed
weights. The χ1 , χ
2 and χ3 represent the refined tensor, which
frequency of 50 Hz. The raw data is firstly preprocessed by the
can be obtained via rotating the input tensor χ 90◦ counter-
noise filter, which is then segmented by the sliding window with
clockwise along T axis and S axis respectively. R means 90◦
a fixed length of 128 and 50% overlap. Finally, the whole dataset
clockwise rotation. Compared with above simple averaging, the
has been randomly split into two parts, where 70% for training
model performance can be further improved by introducing a
and 30% for test.
combination of three learnable weight parameters, which can be
• PAMAP2 [17]: The Physical Activity Monitoring for Aging
formulated as:
People 2 dataset is collected from 9 participants to perform 12
Y = α1 (R (ω1 χ
1 )) + α2 (R (ω2 χ
2 )) + α3 (ω3 χ
3 ) , (5) daily activities (“Walking”, “Lying down”, “Standing”, etc.)
and excises (“Watching TV,” “Computer work,” “Car driving,”
which will be detailed in Section V. B.
etc.) The three inertial measurement units (IMUs) were placed
on the hand, chest, and ankle of each subject to collect raw
IV. EXPERIMENT
sensor data from accelerometer, gyroscope, magnetometer, and
In the following, we will describe the experimental setup heart rate. At a 100 Hz sampling rate, the collection process lasts
and main results in detail. All the experiments are divided into around 10 hours. To perform fair comparisons with previous
three parts. Firstly, to demonstrate the superiority of the pro- works [27], the sensor signal is down-sampled into 33.3 Hz and
posed triplet attention method, we compare classification results with a 5.12 s sliding window and 78% overlap. Generally, this
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
TANG et al.: TRIPLE CROSS-DOMAIN ATTENTION ON HUMAN ACTIVITY RECOGNITION USING WEARABLE SENSORS 1171
V. DISCUSSION
The proposed method is compared with both baselines on four
public HAR datasets. We have three major observations from
Table II. Firstly, it can be seen that the ResNet outperforms all
original CNN by a large margin due to its strong feature extrac-
tion ability. For instance, the ResNet outperforms standard CNN
by 0.21% in terms of accuracy on UCI-HAR dataset. Secondly,
the results indicate that our triplet attention can further improve
performance by clear gains compared to these baselines. Results
from Table II, it can be easily seen that the proposed method
achieves 1.35% and 0.62% performance gains on PAMAP2
dataset when using CNN and ResNet as backbones respectively.
Similar results are also reflected on WISDM dataset. Meanwhile,
dataset are randomly divided into two parts, in which 80% is the triplet attention with almost the same complexity is superior
used for training and 20% for test. to the original CNN and equally-sized ResNet by 0.96% and
• WISDM [18]: The WISDM samples belong to 29 volun- 1.47% in terms of accuracy on UNIMIB-SHAR dataset, respec-
teer subjects who performed 6 discriminative human activities tively. This comparison consistently verifies the effectiveness of
(“Walking”, “Jogging,” “Sitting,” “Standing,” “Going down- our model on different baselines. That is to say, it can boost
stairs” and “Going upstairs”) by placing their mobile phones the accuracy of baselines significantly, demonstrating that it
with Android operating system in front leg pocket. It contains can generalize well on various models on HAR dataset. Lastly,
1,098,213 samples sampled at a rate of 20 Hz from a triaxial we note that there is no extra parameter caused by the triplet
accelerometer sensor. Accordingly, the accelerometer sensor attention compared to their plain counterparts, which motivates
data will be preprocessed by a sliding window of 10 seconds us to update new light-weight network by applying our proposed
and 95% overlap (200 readings/window). This dataset will be module.
split into two parts, in which 80% for training and 20% for test. In addition, the triplet attention method is compared with the
• UNIMIB-SHAR [19]: This dataset includes 11,771 samples other state-of-the-art algorithms [15], [31], [32], [35] accord-
from 30 test subjects for the use of human pose estimation ingly. Table II summarizes main experimental results. Com-
and fall detection. During data collection, a Samsung Galaxy pared with recent state-of-the-art methods, it obtains better or
Nexus I9250 smartphone is embedded with a Bosh BMA220 competitive results without increasing model complexity. As
3D accelerometer, which measured sensor signals at a frequency shown in Table II, we observe that the integration of the triplet
of 50 Hz. The dataset consists of 17 fine-grained categories, attention with ResNet is superior to Xiao et al.’s [31] result
which is further split into 9 classes of activities of daily living by 0.44% that uses a federated learning method on UCI-HAR
and 8 classes of falls. Accordingly, the sliding windows of data dataset. Compared with Teng et al.’s [27] result using local loss
are produced by a size T = 151 (151 readings/window). Our method, the triplet attention achieves 0.23% performance gain
experiment requires dividing up this dataset into two parts, where in terms of accuracy on PAMAP2 dataset. On WISDM dataset,
70% for training and the rest for test. our method is also able to beat Janarthanan et al.’s [35] result by
1.11%. Finally, the triplet attention also achieves very competi-
C. Comparison Algorithms tive accuracy on UNIMIB-SHAR dataset, which outperforms all
previous results [15], [19], [27], [36]. In particular, as mentioned
The triplet attention mechanism can be used to update the
above, it indicates that the triplet attention can be used to update
existing network architectures at a negligible cost. Extensive
the existing network architecture.
experiments are conducted to evaluate the performance gain
brought by the triplet attention part. To demonstrate generaliza-
tion ability of the triplet attention and analyze how it influence A. Visualization Analysis
the classification results, we use standard CNN, equally-sized To evaluate whether the cross-dimensional interaction pro-
ResNet [28] as our backbones, which is introduced as follows. vided by triplet attention can capture richer internal represen-
Table I presents their detailed architectures. tations of sensor signals, we provide sample visualization to
• Standard CNN: The baseline CNN consists of three standard better understand the cross-dimensional interaction between
convolution layers. Batch normalization and ReLU activation sensor dimension, temporal dimension and channel dimension
are applied after each convolutional layer. on PAMAP2 dataset. The results show that our triplet attention
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
1172 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 6, NO. 5, OCTOBER 2022
TABLE II
THE CLASSIFICATION PERFORMANCE ON FOUR HAR DATASETS
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
TANG et al.: TRIPLE CROSS-DOMAIN ATTENTION ON HUMAN ACTIVITY RECOGNITION USING WEARABLE SENSORS 1173
TABLE III
PERFORMANCE FOR DIFFERENT TRIPLET ATTENTION BRANCHES
Fig. 6. The test mean F1 (%) score at different sliding window sizes.
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
1174 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 6, NO. 5, OCTOBER 2022
TABLE V
THE MEAN F1 (%) SCORE OF LEAVE-ONE-SUBJECT-OUT
EXPERIMENT ON PAMAP2 DATASET
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
TANG et al.: TRIPLE CROSS-DOMAIN ATTENTION ON HUMAN ACTIVITY RECOGNITION USING WEARABLE SENSORS 1175
REFERENCES
[1] Z. Wang, M. Jiang, Y. Hu, and H. Li, “An incremental learning method
based on probabilistic neural networks and adjustable fuzzy clustering for
human activity recognition by using wearable sensors,” IEEE Trans. Inf.
Technol. Biomed., vol. 16, no. 4, pp. 691–699, Jul. 2012.
[2] M. A. Alsheikh, A. Selim, D. Niyato, L. Doyle, S. Lin, and H. P. Tan,
“Deep activity recognition models with triaxial accelerometers,” in Proc.
30th AAAI Conf. Artif. Intell., 2016, pp. 8–13.
[3] A. Akbari and R. Jafari, “Personalizing activity recognition models
through quantifying different types of uncertainty using wearable sensors,”
IEEE Trans. Biomed. Eng., vol. 67, no. 9, pp. 2530–2541, Sep. 2020.
[4] Z. Wang, D. Wu, J. Chen, A. Ghoneim, and M. A. Hossain, “A triaxial
accelerometer-based human activity recognition via EEMD-based features
and game-theory-based feature selection,” IEEE Sensors J., vol. 16, no. 9,
pp. 3198–3207, May 2016.
[5] Z. Chen, Q. Zhu, Y. C. Soh, and L. Zhang, “Robust human activity
Fig. 12. Some example of location for target activity of the weakly sensor recognition using smartphone sensors via CT-PCA and online SVM,” IEEE
data. Trans. Ind. Informat., vol. 13, no. 6, pp. 3070–3080, Dec. 2017.
[6] A. Bulling, U. Blanke, and B. Schiele, “A tutorial on human activity recog-
nition using body-worn inertial sensors,” ACM Comput. Surv., vol. 46,
no. 3, pp. 1–33, 2014.
[7] M. Zeng et al., “Convolutional neural networks for human activity recog-
nition using mobile sensors,” in Proc. 6th Int. Conf. Mobile Comput. Appl.
in Table VI. Respectively, the proposed method achieves 2.88%, Serv., 2014, pp. 197–205.
2.24% and 2.92% performance gains over all baselines using [8] B. Meng, X. Liu, and X. Wang, “Human action recognition based on
CNN, VGGNet and ResNet as backbones. At the same time, quaternion spatial-temporal convolutional neural network and LSTM in
RGB videos,” Multimedia Tools Appl., vol. 77, no. 20, pp. 26901–26918,
our method is also superior to DeepConvLSTM [37] by a large 2018.
margin of 2.91%. Compared with Wang et al’s work [38], [9] X. Li, Y. Wang, B. Zhang, and J. Ma, “PSDRNN: An efficient and effective
the triplet attention achieves 0.3% performance gain. The re- HAR scheme based on feature extraction and deep learning,” IEEE Trans.
Ind. Informat., vol. 16, no. 10, pp. 6703–6713, Oct. 2020.
sults show that cross-dimensional attention is also conducive [10] A. Joulin, L. Van Der Maaten, A. Jabri, and N. Vasilache, “Learning visual
to enhance the feature representation of weakly supervised features from large weakly supervised data,” in Proc. Eur. Conf. Comput.
learning. Vis., New York, NY, USA: Springer, 2016, pp. 67–84.
[11] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
In the final step, the visualizing analysis is provided so as Process. Syst., 2017, pp. 5998–6008.
to identify what part of the target signal is the most important [12] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A2 -nets: Double atten-
along the temporal dimension. For the weakly labeled dataset, tion networks,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 352–361.
[13] K. Xu et al., “Show, attend and tell: Neural image caption generation with
every signal window often contains the target activity and the visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
background activity that submerges it, such as “walking”, which [14] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using
is different from strictly labeled HAR dataset. The four sensor visual attention,” in Proc. Neural Inf. Process. Syst. Time Ser. Workshop,
2015.
signal windows, that are roughly labeled as “jogging,” “jump- [15] W. Gao, L. Zhang, Q. Teng, J. He, and H. Wu, “DanHAR: Dual atten-
ing,” “going downstairs” and “going upstairs”, are shown in tion network for multimodal human activity recognition using wearable
Fig. 12. Due to the reason that our triplet attention method sensors,” Appl. Soft Comput., vol. 111, 2021, Art. no. 107728.
[16] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A public
can focus on only the interesting part of the target activity and domain dataset for human activity recognition using smartphones,” in
weaken the background activities, it will be more beneficial for Proc. 21th Int. Eur. Symp. Artif. Neural Netw. Comput. Intell. Mach. Learn.,
ground truth data annotation. vol. 3, pp. 437–442, 2013.
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.
1176 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, VOL. 6, NO. 5, OCTOBER 2022
[17] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for Yin Tang received the B.S. degree from the Hu-
activity monitoring,” in Proc. 16th Int. Symp. Wearable Comput., 2012, nan University of Engineering, Xiangtan, China, in
pp. 108–109. 2018. He is currently working toward the M.S. degree
[18] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recognition using with Nanjing Normal University, Nanjing, China. His
cell phone accelerometers,” ACM SigKDD Explorations Newslett., vol. 12, research interests include activity recognition, com-
no. 2, pp. 74–82, 2011. puter vision, and machine learning.
[19] D. Micucci, M. Mobilio, and P. Napoletano, “Unimib shar: A dataset for
human activity recognition using acceleration data from smartphones,”
Appl. Sci., vol. 7, no. 10, 2017, Art. no. 1101.
[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
[21] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutional block
attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19. Lei Zhang received the B.Sc. degree in computer sci-
[22] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks ence from Zhengzhou University, Zhengzhou, China,
meet squeeze-excitation networks and beyond,” in Proc. IEEE Int. Conf. the M.S. degree in pattern recognition and intelli-
Comput. Vis. Workshops, 2019. gent system from the Chinese Academy of Sciences,
[23] D. Misra, T. Nalamada, A. U. Arasanipalai, and Q. Hou, “Rotate to attend: Beijing, China, and the Ph.D. degree from Southeast
Convolutional triplet attention module,” in Proc. IEEE/CVF Winter Conf. University, Nanjing, China, in 2011. In 2008, he was a
Appl. Comput. Vis., 2021, pp. 3139–3148. Research Fellow with IPAM, UCLA. He is currently
[24] H. Ma, W. Li, X. Zhang, S. Gao, and S. Lu, “Attnsense: Multi-level an Associate Professor with the School of Electrical
attention mechanism for multimodal human activity recognition,” in Proc. and Automation Engineering, Nanjing Normal Uni-
28th Int. Joint Conf. Artif. Intell., 2019, pp. 3109–3115. versity, Nanjing, China. His research interests include
[25] M. Zeng et al., “Understanding and improving recurrent networks for machine learning, human activity recognition, and
human activity recognition by continuous attention,” in Proc. ACM Int. computer vision.
Symp. Wearable Comput., 2018, pp. 56–63.
[26] J. He, Q. Zhang, L. Wang, and L. Pei, “Weakly supervised human activity
recognition from wearable sensors by recurrent attention learning,” IEEE Qi Teng received the B.S. degree from the Henan
Sensors J., vol. 19, no. 6, pp. 2287–2297, Mar. 2019. University of Engineering, Zhengzhou, China, in
[27] Q. Teng, K. Wang, L. Zhang, and J. He, “The layer-wise training con- 2017. He is currently working toward the M.S. degree
volutional neural networks using local loss for sensor based human with Nanjing Normal University, Nanjing, China. His
activity recognition,” IEEE Sensors J., vol. 20, no. 13, pp. 7265–7274, research interests include activity recognition, com-
Jul. 2020. puter vision, and machine learning.
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
pp. 770–778.
[29] Z. N. Khan and J. Ahmad, “Attention induced multi-head convolutional
neural network for human activity recognition,” Appl. Soft Comput.,
vol. 110, 2021, Art. no. 107671.
[30] A. Ignatov, “Real-time human activity recognition from accelerometer Fuhong Min received the master’s degree from the
data using convolutional neural networks,” Appl. Soft Comput., vol. 62, School of Communication and Control Engineering,
pp. 915–922, 2018. Jiangnan University, Wuxi, China, in 2003, and the
[31] Z. Xiao, X. Xu, H. Xing, F. Song, X. Wang, and B. Zhao, “A federated Ph.D. degree from the School of Automation, Nan-
learning system with enhanced feature extraction for human activity jing University of Science and Technology, Nanjing,
recognition,” Knowl.-Based Syst., vol. 229, 2021, Art. no. 107338. China, in 2007. From 2009 to 2010, she was a Post-
[32] S. Wan, L. Qi, X. Xu, C. Tong, and Z. Gu, “Deep learning models for doctoral Fellow with the School of Mechanical Engi-
real-time human activity recognition with smartphones,” Mobile Netw. neering, University of Southern Illinois, Carbondale,
Appl., vol. 25, no. 2, pp. 743–755, 2020. IL, USA. She is currently a Professor with the School
[33] K. Walse, R. Dharaskar, and V. Thakare, “Performance evaluation of of Electrical and Automation Engineering, Nanjing
classifiers on WISDM dataset for human activity recognition,” in Proc. Normal University, Nanjing, China. Her research in-
Second Int. Conf. Inf. Commun. Technol. Competitive Strategies, 2016, terests include circuits and signal processing.
pp. 1–7.
[34] D. Ravi, C. Wong, B. Lo, and G.-Z. Yang, “Deep learning for human
activity recognition: A. resource efficient implementation on low-power Aiguo Song (Senior Member, IEEE) received the B.S.
devices,” in Proc. IEEE 13th Int. Conf. Wearable Implantable Body Sensor degree in automatic control and the M.S. degree in
Netw., 2016, pp. 71–76. measurement and control from the Nanjing Univer-
[35] R. Janarthanan, S. Doss, and S. Baskar, “Optimized unsupervised deep sity of Aeronautics and Astronautics, Nanjing, China,
learning assisted reconstructed coder in the on-nodule wearable sen- in 1990 and 1993, respectively, and the Ph.D. degree
sor for human activity recognition,” Measurement, vol. 164, 2020, in measurement and control from Southeast Univer-
Art. no. 108050. sity, Nanjing, China, in 1998. He was an Associate
[36] T. Liu, S. Wang, Y. Liu, W. Quan, and L. Zhang, “A lightweight Researcher with Intelligent Information Processing
neural network framework using linear grouped convolution for hu- Laboratory, Southeast University. From 1998 to 2000,
man activity recognition on mobile devices,” J. Supercomput., pp. 1–21, he was an Associate Professor with the Department
2021. of Instrument Science and Engineering, Southeast
[37] F. J. Ordóñez and D. Roggen, “Deep convolutional and LSTM recurrent University. From 2000 to 2003, he was the Director of Robot Sensor and
neural networks for multimodal wearable activity recognition,” Sensors, Control Laboratory, Southeast University. From April 2003 to April 2004, he
vol. 16, no. 1, p. 115, 2016. was a Visiting Scientist with the Laboratory for Intelligent Mechanical Systems,
[38] K. Wang, J. He, and L. Zhang, “Sequential weakly labeled multiactivity Northwestern University, Evanston, IL, USA. He is currently a Professor with
localization and recognition on wearable sensors using recurrent attention the School of Instrument Science and Engineering, Southeast University. His
networks,” IEEE Trans. Hum.-Mach. Syst., vol. 51, no. 4, pp. 355–364, research interests include teleoperation control, haptic display, Internet teler-
Aug. 2021. obotics, and distributed measurement systems.
Authorized licensed use limited to: MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on October 04,2023 at 05:23:17 UTC from IEEE Xplore. Restrictions apply.