GRU-based Attention Mechanism For Human Activity Recognition
GRU-based Attention Mechanism For Human Activity Recognition
Abstract—Sensor data based Human Activity Recognition This techniques mainly depend on heuristic based hand-crafted
(HAR) has gained interest due to its application in practical field.feature engineering that rely on low level representations. As
With increasing number of approaches incorporating feature the traditional machine learning models use low level repre-
learning of sequential time-series sensor data, in particular the
deep learning based ones has performed reasonably in uniform sentations, it lacks the characteristics of generalization [4], [5].
labeled data distribution scenario. However, most of these meth- High level abstraction along with low level representations are
ods do not capture properly the temporal context of time-steps necessary for a likely solution to this generalization problem.
in sequential time-series data. Moreover, the situation becomes Deep learning based methods deal with both low and high
worse for imbalanced class distribution which is a usual case for level representations of data. Therefore, recently two variants
HAR using body-worn sensor devices. To solve this issues, we
have integrated hierarchical attention mechanism with recurrent of deep learning methods namely convolutional [6] and re-
units of neural network in order to obtain temporal context current [7] neural network models are become dominant over
within the time-steps of data sequence. The introduced model in traditional methods in terms of performance. For example, to
this paper has achieved better performance with respect to the identify HAR, convolutional neural network(CNN) is used in
well-defined e valuation m etrics i n b oth u niform a nd imbalanced[8] and [9] and recurrent neural network(RNN) is used in [7]
class distribution than the existing state-of-the-art deep learning
based model. and [10].
Index Terms—Human Activity Recognition, Attention Mecha- Although deep learning based methods show promising
nism, Gated Recurrent Unit performance, conventional sliding window based approach for
CNN is unable to fully capture the temporal context of the
I. I NTRODUCTION sensor reading [11] which is required for better classification
Human Activity Recognition(HAR) is a domain of research of activities. For sequence data, RNN performs better than
aimed at recognizing human actions and movements from a CNN in most cases as it captures sequence information [12].
series of observations. The increasing public adoption of smart However, RNN faces long term dependency problems [13]
devices with sensors such as accelerometer and gyroscope has when the sequence is long enough. Note that, the sequence
created the opportunity to organize considerable amount of information found in HAR data is usually long. So, it is
sensor data for classification o f h uman a ctivity. T he research necessary to capture the long term dependency information
activities focused on HAR incorporates the compilation of for better classification.
sensor readings into sequential time-series data and develops Gated Recurrent Unit (GRU) is a variant of RNN which
models for recognition of activities by analyzing the acquired incorporate long term dependency information [14]. It is
sequential sensor readings. HAR poses a variety of promising expected that GRU will perform better in case of HAR data
application domain which includes physical activity annotation as it is able to capture temporal context of sensor data. It is
in the field o f m edical d ata a nalysis [ 1], p ersonal assistant noteworthy to mention here that, all temporal context are not
system [2], augmented and virtual reality [3] and many others. equally important for classification, some are more important
In the past years, the state-of-the-art solutions to HAR than others. Hence, it is necessary to give more attention to the
mainly based on the traditional machine learning techniques. important temporal context than others. Moreover, during the
c
978-1-7281-3445-1/19/$31.00
2019 IEEE
acquisition of HAR data, usually the training data found for Recurrent Neural Networks (RNN) based appoach proposed
different activities are not equal which causes class imbalance in [7] to recognize human activities and abnormal behaviour,
problem. The emphasis on important temporal context also shows some promise but leaves room for improvements.
helps to solve this problem. When the sequence is long, RNN faces log term dependency
To capture the nature of different continuous movements problems. To solve this problem, a combination of convnet and
and to extract the salient features, in this paper, we propose long short term memory [10] is used in [24] that outperforms
an attention mechanism based GRU model architecture. This other models on the KTH dataset. Gated Recurrent Units [25]
architecture plays a crucial role in capturing context of the are a variant of RNN that also addresses long term dependency
sensor reading and extracts the class imbalance tolerance issues. Adam optimizer [26] is a popular choice for training
characteristics. The main contributions of this paper are as such neural networks.
follows: Attention mechanism, introduced for sequence to sequence
• We propose to use a hierarchical temporal attention with tasks tasks such as neural machine translation [27], speech
GRU for capturing important temporal contexts. recognition [28] has also been used for classification tasks in
• The hierarchical model propose to use here is paralleliz- the domain of natural language processing [29]. The context
able. vector computed by attention helps the network to learn where
• The model is able to handle class imbalance problem. to focus on the representation generated by the encoder part for
The paper describes related work in section II where differ- generating output sequence at each time step instead of com-
ent approaches for recognition of human activity are discussed. pressing entire sequence to a fixed vector at once. Simplified
Section III has been used for describing the proposed method- form of attention mechanism [30] has been proposed for feed
ology. Section IV contains the results as well as interpretations forward network which captures some long term dependencies.
of the outcome of the proposed method. Section V concludes The approaches described so far for activity recognition fail
the paper. to capture the temporal context of sensor reading at different
time steps of activity data which is required for better accuracy
II. R ELATED W ORK and generalization. Another approach proposed by [31] uses
The most referred work proposed in [15] use fast Fourier attention mechanism on top of a complex DeepConvLSTM
transform algorithm for feature extraction, useful in recog- architecture for finding relevant temporal context for activity
nizing different activities, that produce satisfactory results recognition. In this work attention score is generated by
with numerous sensors set on distinctive parts of the body applying attention after convolutional and pooling layers in
in conjunction with various data mining algorithms. Different DeepConvLSTM. This score does not reflect the hierarchy of
approaches like K-nearest neighbors [16], decision trees [17], simple features detected from raw sensor data and complex
multi-class support vector machine [18] are used to classify features detected from hidden state outputs in case of RNN (or
human activities. All of these approaches require the use of from deeper layers in CNN). For finding relevant features for
hand-crafted features and show poor results for classifying in activity recognition, feature selection approaches used in [32],
similar type of activities like walking down and walking up. [33] can be applied. However, In this work, we propose an
In the modern age, deep learning has become prominent in attention mechanism with GRU which distills more complex
the area of learning models that represent features from low- features that would be helpful for better classification.
level to high-level abstraction used in [4], [5] which allow
to extract features automatically without hand-crafted feature III. P ROPOSED M ETHOD
engineering. A common form of neural network called fully The proposed method combines several building blocks for
connected neural network (FCNN) with Principal Component constructing the network. We use Gated Recurrent Units, two
Analysis based feature technique is used in [8] and [9] for different types of attention mechanism which are described in
HAR and sensor data. But FCNN is very expensive in terms the following sections.
of memory (weights) and computation (connections). It also
has a great chance of overfitting problem as every node is A. Gated Recurrent Unit
connected with every node in every layer. To extract additional
GRUs are a variant of Recurrent Networks that have been
features a new technique called Shift-invariant sparse coding
shown to be able to capture long term dependencies in tem-
[9] was proposed and used in combination with FCNN and
poral data while not suffering from similar vanishing gradient
handcrafted features. Convolutional neural network (CNN or
problem as regular RNN and requiring fewer parameters than
convnet) [19] with dropout [20] for reducing overfitting is a
LSTM. Hidden state h<t> calculation is based on (3) with the
recent breakthrough for feature extraction. It is used by [21] in
input vector X<t> and previous hidden state h<t−1> going
gesture recognition that give state-of-the-art result. A hierar-
through update and reset gates in (1) and (2).
chical model using convnets is proposed in [22]. To recognize
human activity for unlabeled as well as labeled data, [23] used Z<t> = σ(Wzx · X<t> + Wzh · h<t−1> + bz ) (1)
semi-supervised convnet model to learn discriminative hidden
features. Where convnet learns to recognize features of an
object and combine these features to recognize larger object. Γt = σ(WΓx · X<t> + WΓh · h<t−1> + bΓ ) (2)
Fig. 1: Stacked 2 layer GRU model with simplified and context sensitive attention mechanism
TABLE II: 2/3 Drop of Specific Class Data in Training Set and Measurement using given Test Data in terms of accuracy (%)
Model Class - 1 Class - 2 Class - 3 Class - 4 Class - 5 Class - 6
Baseline (CNN) 90.56667798 89.68442484 89.41296233 89.88802172 91.44893112 91.58466237
2-Stacked GRU
90.73634204 91.38106549 92.26331863 92.02578894 92.56871395 93.31523583
+ Attention
TABLE III: Subject-wise Class Drop in K-Fold and Measurement using given Test Data in terms of AUC
Model Architecture k - fold Class - 1 Class - 2 Class - 3 Class - 4 Class - 5 Class - 6
0 0.989897104 0.988005576 0.989058376 0.990233114 0.986666755 0.992600584
1 0.982893556 0.986482948 0.984618942 0.981161394 0.99192285 0.984878402
2 0.980885725 0.987111175 0.98570861 0.987078789 0.99061635 0.983950708
Baseline (CNN)
3 0.989421716 0.988371167 0.984282503 0.983464878 0.99116295 0.982648438
4 0.990690272 0.990301677 0.99302656 0.986080276 0.987868021 0.989988681
Average 0.9867576746 0.9880545086 0.9873389982 0.9856036902 0.9896473852 0.9868133626
0 0.9891089821 0.9917239879 0.9937438938 0.9885901347 0.9892608509 0.9915665231
1 0.9898042512 0.9845693577 0.9922561365 0.9879438003 0.9901914546 0.9906566782
Stacked GRU (2 Layers) 2 0.992904892 0.990711794 0.992136949 0.988169278 0.991100119 0.992741053
+ Attention 3 0.988042117 0.992132516 0.990958787 0.987103942 0.989934521 0.993510194
4 0.991783724 0.993914724 0.989005982 0.992895329 0.992081285 0.992756454
Average 0.9903287933 0.9906104759 0.9916203497 0.9889404968 0.9905136461 0.9922461805
Fig. 2: Confusion Matrix when half of the training data for Fig. 3: Confusion Matrix when half of the training data for
class ’Walking Downstairs’ is dropped for stacked GRU model class ’Walking Downstairs’ is dropped for CNN model