0% found this document useful (0 votes)
213 views

Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Network Based Study

This article presents a deep learning model to detect multi-class stress levels based on heart rate variability data. The model achieves 99.9% accuracy in classifying no stress, interruption stress, and time pressure stress using both time and frequency domain features of HRV. The deep learning model outperforms existing methods. Feature selection is also able to achieve high accuracy using only a subset of the available HRV features.

Uploaded by

linyudong0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views

Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Network Based Study

This article presents a deep learning model to detect multi-class stress levels based on heart rate variability data. The model achieves 99.9% accuracy in classifying no stress, interruption stress, and time pressure stress using both time and frequency domain features of HRV. The deep learning model outperforms existing methods. Feature selection is also able to achieve high accuracy using only a subset of the available HRV features.

Uploaded by

linyudong0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received 11 April 2023, accepted 4 May 2023, date of publication 8 May 2023, date of current version 14 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3274478

Multi-Class Stress Detection Through Heart Rate


Variability: A Deep Neural Network Based Study
JON ANDREAS MORTENSEN1 , MARTIN EFREMOV MOLLOV1 , AYAN CHATTERJEE 1,2 ,

DEBASISH GHOSE 1,3 , (Senior Member, IEEE), AND FRANK Y. LI 1


1 Department of Information and Communication Technology, University of Agder, N-4898 Grimstad, Norway
2 Department of Holistic Systems, Simula Metropolitan Center for Digital Engineering, N-0167 Oslo, Norway
3 School of Economics, Innovation, and Technology, Kristiania University College, N-5022 Bergen, Norway

Corresponding author: Debasish Ghose ([email protected])


This work was supported by the Research Council of Norway through the Orchestrating Internet of Things and Machine Learning for Early
Risk Detection to Ensure Inpatients Safety (StaySafe) Program under Grant 309257.

ABSTRACT Stress is a natural human reaction to demands or pressure, usually when perceived as harmful
or/and toxic. When stress becomes constantly overwhelmed and prolonged, it increases the risk of mental
health and physiological uneasiness. Furthermore, chronic stress raises the likelihood of mental health
plagues such as anxiety, depression, and sleep disorder. Although measuring stress using physiological
parameters such as heart rate variability (HRV) is a common approach, how to achieve ultra-high accuracy
based on HRV measurements remains as a challenging task. HRV is not equivalent to heart rate. While heart
rate is the average value of heartbeats per minute, HRV represents the variation of the time interval between
successive heartbeats. The HRV measurements are related to the variance of RR intervals which stand for the
time between successive R peaks. In this study, we investigate the role of HRV features as stress detection
bio-markers and develop a machine learning-based model for multi-class stress detection. More specifically,
a convolution neural network (CNN) based model is developed to detect multi-class stress, namely, no stress,
interruption stress, and time pressure stress, based on both time- and frequency-domain features of HRV.
Validated through a publicly available dataset, SWELL−KW, the achieved accuracy score of our model has
reached 99.9% (Precision = 1, Recall = 1, F1−score = 1, and MCC = 0.99), thus outperforming the existing
methods in the literature. In addition, this study demonstrates the effectiveness of essential HRV features for
stress detection using a feature extraction technique, i.e., analysis of variance.

INDEX TERMS Stress detection, heart rate variability, convolution neural network, feature extraction.

I. INTRODUCTION Typically, people with anxiety disorders have chroni-


Physical or mental imbalances caused by noxious stimuli cally lower resting HRV compared with healthy people.
trigger stress to maintain homeostasis. Under chronic stress, As revealed in [2] and [3], HRV increases with relaxation and
the sympathetic nervous system becomes overactive, leading decreases with stress. Indeed, HRV is usually higher when a
to physical, psychological, and behavioral abnormalities [1]. heart is beating slowly and vice versa. Therefore, heart rate
Stress levels are often measured using subjective methods and HRV generally have an inverse relationship [2], [3]. HRV
to extract perceptions of stress. Stress level measurement varies over time based on activity levels and the amount of
based on collected heart rate viability (HRV) data can help work-related stress.
to remove the presence of stress by observing its effects on Furthermore, stress is usually associated with a negative
the autonomic nervous system (ANS) [2]. notion of a person and is considered to be a subjective feeling
of human beings that might affect emotional and physical
well-being. It is described as a psychological and biological
The associate editor coordinating the review of this manuscript and reaction to internal or external stressors [4], including a bio-
approving it for publication was Gustavo Callico . logical or chemical agent and environmental stimulation that
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
57470 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

induce stress in an organism [5]. On a molecular scale, stress is based on the convolution operation. CNN reduces number
impacts the ANS [6], which uses sympathetic and parasym- of training parameters as MLP takes vector as input and CNN
pathetic components to regulate the cardiovascular system. takes tensor as input so that CNN can understand spatial
The sympathetic component in a human body [7] works relation.
analogously to a car’s gas pedal. It activates the fight-or-flight While the accuracy achieved with full features is nearly
response, giving the body a boost of energy to respond to 100%, we have also introduced a feature reduction algorithm
negative influences. In contrast, the parasympathetic com- based on analysis of variance (ANOVA) F-test and demon-
ponent is the brake for a body. It stimulates the body’s rest strate that it is possible to achieve an accuracy score of 96.5%
and digests reaction by relaxing the body when a threat has with less than half of the features that are available in the
passed. Given the fact that the ANS regulates the mental stress SWELL−KW dataset. Such a feature extraction reduces the
level of a human being, physiological measurements such as computational load during the model training phase.
electrocardiogram (ECG), electromyogram (EMG), galvanic In a nutshell, the novelty and the main contributions of this
skin response (GSR), HRV, heart rate, blood pressure, breath study are summarized as follows:
frequency, and respiration rate can be used to assess mental
stress [8]. • We have developed a novel 1D CNN model to detect
ECG signals are commonly adopted to extract HRV [9]. multi-class stress status with outstanding performance,
HRV is defined as the variation across intervals between achieving 99.9% accuracy with a Precision, F1-score,
consecutive regular RR intervals,1 and it is measured by and Recall score of 1.0 respectively and a Matthews cor-
determining the length between two successive heartbeat relation coefficient (MCC) score of 99.9%. We believe
peaks from an ECG reading. Conventionally, HRV has been this is the first study that achieves such a high score of
accepted as a term to describe variations of both instantaneous accuracy for multi-class stress classification.
heart rate and RR intervals [12]. • Furthermore, we reveal that not all 34 HRV features
Obtaining HRV from ECG readings requires clinical set- are necessary to accurately classify multi-class stress.
tings and specialized technical knowledge for data interpre- We have performed feature optimization to select an
tation. Thanks to the recent technological advances on the optimized feature set to train a 1D CNN classifier,
Internet of medical things (IoMT) [17], it is possible to achieving a performance score that beats the existing
deploy a commercially available wearable or non-wearable classification models based on the SWELL-KW dataset.
IoMT devices to monitor and record heart rate measurements. • Our model with selected top-ranked HRV features
Based on ECG data analysis (or HRV features, various does not require resource-intensive computation and it
machine learning (ML) and deep learning (DL) algorithms achieves also excellent accuracy without sacrificing crit-
have been developed in recent years for stress prediction [20], ical information.
[21], [22], [23], [24], [25], [26], [27] (see more details in Sec. The remainder of the paper is organized as follows. After
II). Among the publicly available datasets for stress detection, summarizing related work and pointing out the distinction
SWELL−KW developed in [13] and [14] one of the two most between our work and the existing work in Sec. II, we intro-
popular ones. However, none of the existing ML and DL stud- duce briefly the framework for stress status classification,
ies based on the SWELL−KW dataset for multi-class stress dataset, and data preprocessing in Sec. III. Then the devel-
classification have achieved ultra-high accuracy, especially oped CNN model is presented in Sec. IV. Afterwards, Sec. V
for multi-class stress level classification [15], [16]. Therefore, defines the performance metrics to evaluate the proposed
there exists a research gap on developing novel ML models classifier and Sec. VI presents the numerical results. Further
which are able to achieve ultra-high accurate prediction. discussions are provided in Sec. VII. Finally, the paper is
Motivated by various existing applied ML and DL concluded in Sec. VIII.
based studies on HRV feature processing for stress
level classifications, we have designed and developed a
one-dimensional convolutional neural network (1D CNN) II. RELATED WORK
model for multi-class stress classification and demonstrate The related work considered in this study covers HRV data
its superiority over the state-of-the-art models based on the quality and various state-of-the-art ML/DL algorithms devel-
SWELL-KW dataset in term of prediction accuracy. More oped for stress detection.
specifically, we have performed studies on stress detection For HRV data quality, a detailed review on data received
using both traditional machine learning algorithms and/or from ECG and IoMT devices such as Elite HRV, H7, Polar,
multi-layer perceptron (MLP) algorithms which are inspired and Motorola Droid can be found in [18]. 23 studies indicated
from the fully connected neural network (FCNN) architec- minor errors when comparing the HRV values obtained from
ture. In our work, we have developed a 1D CNN model which commercially available IoMT devices with ECG instrument-
based measurements. In practice, such a small-scale error
1 An RR internal represents the time from an R-peak to the next in HRV measurements is reasonable, as getting HRVs using
R-peak [10]. It defines the time elapsed between two successive R-waves of portable IoMT devices is more practical, cost-effective, and
the Q-wave, R-wave and S-wave (QRS) signal on the electrocardiogram [11]. no laboratory/clinical equipment is required [18], [19].

VOLUME 11, 2023 57471


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

FIGURE 1. Framework of the proposed stress status classification model: From data
collection to stress level classification.

On the other hand, there have been a lot of recent research required and such a topic is beyond the scope of this
efforts on ECG data analysis to classify stress through ML paper.
and DL algorithms [20], [21], [22], [23]. Existing algo- As summarized in Tab. 5 of [15], in a fresh study published
rithms have focused mainly on binary (stress versus non- online in August 2022, the best results for stress detection
stress) and multi-class stress classifications. For instance, the based on the SWELL−KW dataset for the single-dataset
authors in [4] classified HRV data into stressed and normal models developed therein are 88.64% (Accuracy), 93.01%
physiological states. The authors compared different ML (Precision), 92.68% (Recall), and 82.75% (F1-scores) respec-
approaches for classifying stress, such as naive Bayes, k- tively. Compared with these state-of-the-art models, the
nearest neighbour (KNN), support vector machine (SVM), model developed in this study has achieved much better
MLP, random forest, and gradient boosting. The best recall performance (see more details in Subsec. VI-F especially Tab.
score they achieved was 80%. A similar comparison study 3 of this paper).
was performed in [27], where the authors showed that SVM
with radial basis function (RBF) provided an accuracy score III. FRAMEWORK OVERVIEW AND DATA
of 83.33% and 66.66% respectively, using the time-domain PREPROCESSING
and frequency-domain features of HRV. Moreover, dimension In this section, we give an overview about the framework
reduction techniques have been applied to select best tempo- for multi-class stress classification. While the overview and
ral and frequency domain features in HRV [24]. Binary clas- model preparation (including data collection, dataset, and
sification, i.e., stressed versus not stressed, was performed data preprocessing) are outlined in this section, the CNN
using CNN in [25] through which the authors achieved an model itself is presented in the next section.
accuracy score of 98.4%. Another study, StressClick [26],
employed a random forest algorithm to classify stressed A. FRAMEWORK OVERVIEW
versus not stressed based on mouse-click events, i.e., the Fig. 1 illustrates the schematic diagram of the proposed stress
gaze-click pattern collected from the commercial computer level classification framework. Briefly, the framework consti-
webcam and mouse. tutes the following procedures.
In [14], tasks for multi-class stress classification (e.g., • Data collection and datasets. HRV signals are collected
no stress, interruption stress, and time pressure stress) were and separated into a training dataset and a testing dataset.
performed using SVM based on the SWELL−KW dataset. They will use to define the model’s architecture and to
The highest accuracy they achieved was 90%. Furthermore, assess the proposed model’s effectiveness.
another publicly available dataset, WESAD, was used in [27] • Data preprocessing and feature extraction. Data are pre-
for multi-class (amusement versus baseline versus stress) processed to fit into the feature ranking algorithm. In this
and binary (stress versus non-stress) classifications. In their study, ANOVA F-tests [28] and forward sequential fea-
investigations, ML algorithms achieved accuracy scores up ture selection are employed for feature ranking and
to 81.65% for three-class categorization. The authors also selection respectively.
checked the performance of deep learning algorithms, where • Classification and validation. The designed DL-based
they achieved an accuracy level of 84.32% for three-class multi-class classifier is trained, tested, and validated
stress classification. Furthermore, it is worth mentioning with significant features and annotations (e.g., no stress,
that novel deep learning techniques, such as genetic deep interruption condition, and time pressure) labeled by
learning convolutional neural networks (GDCNNs) [38], medical professionals.
[39], have appeared as a powerful tool for two-dimensional • Testing. In the testing phase, distinctive features are
data classification tasks. To apply GDCNN to 1D data, considered from the new test samples, and the class label
however, comprehensive modifications or adaptations are is resolved using all classification parameters estimated

57472 VOLUME 11, 2023


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

TABLE 1. Explanation.

FIGURE 2. Distribution of data in SWELL−KW [13].

in training. Different numbers of features are extracted


and tested.
• Performance assessment. The performance of the clas-
sifier is measured against discrimination analysis met-
rics, such as Accuracy, Precision, Recall, F1-score, and
MCC.

B. DATA COLLECTION AND DATASET


We adopt the SWELL−KW dataset, which was collected in
a study reported in [13] and [14]. Various types of data have
been recorded, including computer logging, facial expression
from camera recordings, body postures from a Kinect 3-
dimensional (3D) sensor, heart rate (variability), and skin
conductance from body sensors.
In the experiments, 25 volunteers performed typical knowl-
edge tasks (writing reports, making presentations, reading
emails, searching for information) during which their psycho-
logical and biological status data were recorded. The working
conditions of the participants were manipulated with two
types of stressors: email interruptions and time pressure. The
SWELL−KW dataset comprises HRV computed for stress
and user modeling. The subjective experiences of participants
with task load, mental effort, mood, and perceived stress
were also recorded. Each participant was exposed to three
different working environments and the data are then labeled
by medical professionals as follows.
• No stress: The participants are permitted to work on the
activities for as long as they need, up to 45 minutes.
However, they are unaware of the maximum duration of
the task.
• Time pressure: Under time pressure, the time to com-
plete the same job was decreased to 2/3 of its time in the
normal condition.
• Interruption: The participants were interrupted when
they received 8 emails in the middle of a given activity.
Some emails were pertinent to their tasks, and the par-
ticipants were asked to take particular actions, whereas
others were totally irrelevant to the ongoing tasks.
The distribution of the collected data with three different
stress classes is presented in Fig. 2. The HRV indices were
computed by extracting an inter-beat interval (IBI) signal e.g., time intervals between consecutive heart beats (RR
from each participant’s peaks of the ECG signals. For each interval) and hear rate of HRV signals. Correspondingly, the
participant, the experiment lasted for approximately 3 hours. frequency-domain features, i.e., the signal power levels with
From the HRV data, various time-domain and frequency- respect to low frequency (LF) and high frequency (HF), are
domain features are extracted, as presented in Tab. 1. Fur- illustrated in Fig. 4. These plots are generated using the first
thermore, we illustrate in Fig. 3 the time-domain features, 1000 samples from the SWELL−KW dataset.

VOLUME 11, 2023 57473


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

• Reshaping of each row of the training features into a 1D


vector so that it becomes an input to the input layer of
the deep learning model.

IV. A CNN MODEL FOR STRESS STATUS CLASSIFICATION


In this section, we present the developed deep learning model
for stress status classification. As shown on the right-side
hand of Fig. 1, the model consists of feature ranking, feature
extraction, and tress level classification.

A. FEATURE RANKING AND EXTRACTION


Firstly, we rank the essential features based on their relevance
to the classification task. To do so, the ANOVA [31] F-
test is adopted to select the significant features from the
FIGURE 3. Time-domain features of HRV. SWELL−KW dataset for feature ranking and extraction.
ANOVA is a popular tool to perform a parametric statistical
hypothesis test that assesses whether the means of two or
more data samples (typically three or more) are from the same
distribution or not. An F-statistic or F-test is a statistical test
method that adopts ANOVA to calculate the ratio between
variance values, such as variance from two different sam-
ples, or explained and unexplained variance. Furthermore,
ANOVA can be used when one variable is numeric, and the
other one is categorical, such as when a numerical input
data and a classification outcome variable are compared in
a classification task.
In this study, we first employ all features for stress classifi-
cation and then drop the minor significant features based on
the importance of features (i.e., feature ranking) before per-
forming the classification task. In the latter case, the training
FIGURE 4. Frequency-domain features of HRV. time is shortened while keeping the accuracy of the model.

B. A CNN DL MODEL FOR STRESS CLASSIFICATION


C. DATA PREPROCESSING
The designed DL model for stress level classification is devel-
The collected HRV data in the SWELL−KW dataset are
oped based on the conventional, well-known CNN architec-
time-variant. For classification, we re-construct the HRV
tures [32]. CNN is a powerful tool for automatic feature
data, which was a discrete time series with timestamps, to a
extraction and learning from 1D data sequences. The HRV
series indexed with sequence numbers without timestamps.
features of the CNN architecture that are used in our model
Moreover, we convert all data into the numerical format.
are illustrated in Tab. 1. For our model design, we retain a
We also remove participants’ noisy, incomplete, or missing
reasonable number of neurons in each layer based on the
data. These processing steps result in 25 participant’s data
common heuristics (e.g., validation loss, hidden units are
with 410322 number of records and 34 number of features
a fraction of the input). The CNN kernels slide over the
for stress level classification.
components of the 1D input pattern during convolution.
Moreover, we perform normality tests using methods, such
More specifically, our 1D CNN model consists of an input
as Shapir–Wilk [29], on each feature of the datasets and the
layer, multiple hidden layers, a max-pooling layer, a flatten-
results reveal that the data samples do not look like Gaussian.
ing layer, and an output layer, as depicted in Fig. 5. The
The normality tests are performed following the standard
input layer is a 1D convolutional layer, and it consists of
hypothesis testing method with a P-value α ≥ 0.05 (i.e.,
64 filters, a kernel of size 2, and a relative light unit (ReLU)
sample looks like Gaussian). Further data preprocessing steps
activation function. The ReLU activation function helps to
are performed as follows.
avoid the vanishing gradient so that a faster convergence can
• Splitting data for training and testing as 80|20 for be obtained. The 1D max-pooling layer has been introduced
train|test datasets, respectively; to reduce the dimensions of the feature maps. The flattening
• Normalization with a standard scalar method to confine layer has been adopted to convert the down-sampled data into
the feature values within the range of {0,1}, as some of a 1D vector that acts as an input to the output layer. A softmax
the selected features were in different magnitudes; and activation function has been adopted in the output layer for

57474 VOLUME 11, 2023


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

The cells, or a collection of cells, considered by the


ratios for a particular class in multi-class classification are
explained as follows [33]. TP is an outcome where the model
estimates the positive class accurately; TN is an outcome in
which the model correctly predicts the negative class; FP
is an outcome where the model estimates the positive class
inaccurately; and FN is an outcome in which the model
forecasts the negative class incorrectly. Accordingly, The per-
formance metrics for a given class are expressed respectively
as follows [29].
TP
Precision = (1)
TP + FP
TP
Recall = (2)
TP + FN
TP + TN
Accuracy = (3)
TP + TN + FP + FN
2 × Recall × Precision
F1-score = (4)
Recall + Precision
A higher value from the above expressions represents bet-
ter performance of a model, and this applies to all perfor-
mance metrics. On the other hand, bias is an error due to
erroneous assumptions in the learning algorithm, and vari-
FIGURE 5. The structure of the developed 1D CNN model for stress ance is an error from sensitivity to small fluctuations in
classification. the training set. While high bias leads to under-fitting, high
variance results in overfitting. Accuracy and F1-scores can be
multi-class, i.e., no stress, time pressure, and interruption misleading because they do not fully account for the sizes of
classification based on probability distribution. the four categories of the confusion matrix in the final score
For loss calculation, we introduce the categorical cross- calculation. In comparison, the MCC is more informative
entropy loss function to compile our 1D CNN model. For than the F1-score and Accuracy because it considers the
model training, we adopt the adaptive moment estimation balanced ratios of the four confusion matrix categories (i.e.,
(ADAM) optimizer, as it is computationally efficient and TP, TN, FP, and FN ). The F1-score depends on which class is
claims less memory. To reduce the learning rate and improve defined as a positive class. However, MCC does not depend
the performance of our model, a validation split step of 0.05 is on which class is the positive class, and it has an advantage
configured. over the F1-score as it avoids incorrectly defining the positive
As the platform to train and validate the developed class [34]. The MCC is expressed as follows [30].
model, we rely on Google Colab. Specifically, the model is
trained with the default configuration of Google Colab, e.g., TP ∗ TN − FP ∗ FN
MCC = √
Intel(R) Xeon(R) central processing unit (CPU)@2.20 GHz (TP + FP)(TP + FN )(TN + FP)(TN + FN )
and 12 GB random access memory (RAM). The initial input (5)
data shape is (328257, 34). Then the input data is reshaped to
(328257, 1, 34) where each row of the input data is formed VI. CLASSIFICATION RESULTS AND DISCUSSIONS
into a one-dimensional vector. The Fit() generator turns train- In this section, we present the experimental results
ing data into many batches, each with a size 64, for training. and reveal the importance of ANOVA-based feature
selection.
V. PERFORMANCE METRICS
The performance of the developed 1D CNN model for A. FEATURE RANKING AND SELECTION FOR SWELL−KW
multi-class stress classification has been evaluated through In this study, we have considered all 34 features provided
discrimination analysis based on the SWELL−KW dataset. by the SWELL−KW dataset. However, some of the fea-
The discrimination analysis metrics are Precision (eq. (1)), tures are irrelevant and act as outliers. With this regard,
Recall (eq. (2)), Accuracy (eq. (3)), F1-score (eq. (4)), MCC the ANOVA method has been very significant. Initially,
(eq. (5)), classification report, and confusion matrix [29], it ranks the 34 features based on their F-values. Fig. 6
[30]. A confusion matrix is a 2-dimensional table (actual presents the ranking of the HRV features that are avail-
versus predicted) and both dimensions have four options, able in the SWELL−KW dataset. Typically, features with
namely, true positives (TP), false positives (FP), true nega- higher F-values are more important for final stress level
tives (TN), and false negatives (FN). categorization. The most relevant and important subset of

VOLUME 11, 2023 57475


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

FIGURE 6. Feature ranking of the 34 features using ANOVA.

FIGURE 7. Accuracies with ANOVA-sorted features.

the rated features is further identified via a forward sequen- TABLE 2. Performance of the proposed 1D CNN model for three level
classifications with all features.
tial feature selection method. The forward sequential fea-
ture selection forms the optimal subset of features from the
34 features in their ranked order by sequentially selecting the
features.
In Fig. 7, we demonstrate the accuracy scores by sequen-
tially selecting the ANOVA-sorted features. It can be
observed that accuracy increases with the number of features algorithm for stress level detection when the top 15 features
adopted for model training. More specifically, the developed are selected.
model achieves above 95% accuracies with less than half of
the ANOVA-sorted features, i.e., less than 17 features. In the B. PERFORMANCE WHEN ALL FEATURES ARE APPLIED
following two subsections, we first evaluate the performance The developed CNN model has classified the SWELL−KW
of our model in terms of Precision, Recall, F1-score, and dataset into the following three stress categories based on
MCC when all available features are applied to the classifier emotional states, i.e., no stress, time pressure, and interrup-
and then demonstrate the efficacy of the feature reduction tion, and it has obtained an extremely high level of accuracy.

57476 VOLUME 11, 2023


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

C. PERFORMANCE WITH TOP FIFTEEN FEATURES


We further investigate the performance of the model by
employing only the top 15 ANOVA-sorted features, and the
obtained results are listed in Tab. 4. Through the values shown
in the table, we demonstrate that the average scores for Pre-
cision, Recall, F1-score, and MCC achieved by the proposed
model are still excellent, reaching a score of 96.5%, 94.6%,
97.0% and 92.9%, respectively. Overall, we have achieved a
score of 96.5% accuracies on average. Furthermore, the per-
formance of the model using a 70/30 train-test split resulted
in an accuracy of 0.961, precision of 0.960, recall of 0.956,
F1 score of 0.957, and MCC of 0.935.
On the other hand, it is worth reiterating that the perfor-
mance of our 1D CNN model with all features is extraordi-
nary, outperforming the case with top 15 features. However,
such a benefit comes at a cost of a longer training time,
FIGURE 8. Confusion matrix obtained based on stress class classification. specially when the size of a dataset is massive. In general,
there is always a trade-off between performance and resource
consumption. Therefore, whether to select all features or not
depends on the key performance requirements of a system
or service. In our experiments, the model training time with
15 features is 1733 seconds, which is 8 seconds less than the
model training time with all features.

D. K-FOLD CROSS-VALIDATION
To validate the obtained results with the top 15 features,
a k-fold cross-validation procedure has been performed and
the results are compared with the ones obtained from the
developed 1D CNN model. K-fold cross-validation divides
the dataset into k equal-sized folds, training and evaluating
the model k times, with each fold serving as the test set
once and the remaining k-1 folds serving as the training set.
FIGURE 9. Training versus validation accuracy. The evaluation scores are then averaged across the k folds to
obtain a more robust estimate of the model’s performance.
More specifically, Tab. 2 demonstrates the performance of For our validation, the default value, i.e., 5 splits is con-
the developed 1D CNN model on stress level classifications. figured. In each split, the model is trained and evaluated on
Clearly, we have achieved the highest accuracy score of the test data, and performance metrics in terms of Preci-
0.99 with Precision = 1, recall = 1, F1−score = 1, and sion, Recall, Accuracy, F1 score, and MCC are calculated.
MCC = 0.99 respectively. Overall, the accuracy of the devel- The evaluation results based on these five splits show that
oped 1D CNN model reaches an accuracy level of 99.9% for the model achieves an average score of Precision = 0.944,
all three classification levels. Accuracy = 0.945, Recall = 0.933, F1 = 0.908, and MCC =
Fig. 8 presents the confusion matrix obtained from the 0.908, obtained based on the same test dataset. As such, it is
developed 1D CNN model based on the SWELL−KW evident that the developed model is capable of classifying the
dataset. It is evident from the figure that the proposed clas- samples into their respective classes with ultra-high accuracy.
sifier correctly predicts the true label with less than 0.01%
error for all three classes. E. HYPERPARAMETER OPTIMIZATION
Furthermore, we have verified whether the proposed model Initially the model parameters are selected based on experi-
is overfitted or not. Fig. 9 illustrates the training versus vali- ence (as explained in Sec. IV-B). In what follows, we further
dation accuracy obtained through our experiments. From this investigate the impact of hyperparameter optimization on the
figure, it is clear that the validation accuracy and training performance of the developed model, using the Hyperband
accuracy are nearly identical, with the validation loss being Tuning technique.
slightly higher than the training loss. In other words, the Using the top 15 features of the SWELL−KW dataset,
model is not overfitted, and it meets the criteria for a good hyperband [40] tuning is employed to optimize the hyper-
fit model. parameters of our model. The purpose of the tuning process

VOLUME 11, 2023 57477


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

TABLE 3. Quantitative comparison of the results with other state-of-the-art models.

TABLE 4. Performance of the proposed 1D CNN model for three level concentrated on binary and multi-class stress detection when
classifications with top 15 ANOVA-sorted features.
assessing the effectiveness of their ML/DL models. It is
worth mentioning that we used the SWELL−KW dataset for
multi-class stress detection. Regarding performance evalu-
ation, prior studies, e.g., [13] and [24], considered merely
the accuracy score as the key performance metric. Although
accuracy is a popular indicator, it is sufficient only if the false
is to maximize the model’s validation accuracy. Through the positive and false negative rates are essentially similar, and
validation procedure illustrated in Appendix A, the best set of the dataset is symmetric.
hyperparameters is found by the algorithm to be filters = 160, Furthermore, Tab. 3 reveals that, when all features are con-
kernel size = 5, and dense units = 48, resulting in a validation sidered during model training, none of the existing ML/DL
accuracy of 0.99. models reported in the literature outperform the one devel-
On the other hand, it is worth noting that, although oped in this study in terms of Accuracy, Precision, Recall,
hyper-parameter tuning can be effective in improving the F1-score, and MCC for categorizing stress levels.
performance of ML models, it can be a challenging task to When a subset of features is selected for model training,
apply it in real-life applications. This is due to its demand the model presented in [25] shows higher performance than
for a significant amount of computational resources, espe- the proposed model in this study with top 15 ANOVA-sorted
cially for large-volume datasets and complex models which features. The reason is that the authors in [25] considered all
may not always be available. Additionally, the optimal set available features in the datasets, and they did not apply any
of hyperparameters may be specific to the dataset, model, dimension reduction technique for performance evaluation of
and the problem at hand, making it difficult to develop a their model.
generalizable approach to hyperparameter tuning [41], [42].
Thus, default hyperparameters or a small set of manually VII. FURTHER DISCUSSIONS
tuned hyperparameters may suffice in many cases including Execution time of full features versus top-15 features: The
this study to achieve satisfactory performance. execution time difference between the all feature-based
model and the top-15 feature-based model reported in Subsec.
F. QUANTITATIVE COMPARISON WITH EXISTING STUDIES VI-C seems small. There are two reasons for this result. 1)
Finally, we make a quantitative comparison of our model ver- The SWELL−KW dataset which serves as the basis for this
sus other related studies appeared in the literature. In Tab. 3, study has a moderate amount of data (410322 number of
the performance indicators from a few recent studies for records and 34 features as mentioned in Subsec. III-C) and 2)
automatic classification of stress levels are compared with our our training and validation procedures are performed based
1D CNN model. on Google Colab which has powerful CPUs and graphics
Existing studies that are based on publicly accessible processing unit (GPUs) as well as a huge amount of RAMs.
datasets such as SWELL−KW, WESAD, and AMIGOS When the volume of a dataset becomes huge which is typical

57478 VOLUME 11, 2023


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

for big data processing, or/and the data processing machine [8] S. Goel, P. Tomar, and G. Kaur, ‘‘ECG feature extraction for stress recogni-
is less powerful, e.g., based on a personal computer or a tion in automobile drivers,’’ Electron. J. Biol., vol. 12, no. 2, pp. 156–165,
Mar. 2016.
server located at a clinic, the benefit of our model with feature [9] V. N. Hegde, R. Deekshit, and P. S. Satyanarayana, ‘‘A review on ECG
reduction will be more significant, specially for validation. signal processing and HRV analysis,’’ J. Med. Imag. Health Informat.,
This is because, after the data collection phase, data training vol. 3, no. 2, pp. 270–279, Jun. 2013.
can be still performed offline based on powerful CPUs/GPUs. [10] M. Vollmer, ‘‘A robust, simple and reliable measure of heart rate variability
using relative RR intervals,’’ in Proc. Comput. Cardiol. Conf. (CinC),
Model Applicability: The model developed in this study Sep. 2015, pp. 609–612.
is built based on the SWELL−KW dataset. Nevertheless, [11] M. H. Kryger, T. Roth, and W. C. Dement, Principles and Practice of Sleep
we believe that, with proper parameter tuning or enhance- Medicine, 5th ed. Amsterdam, The Netherlands: Elsevier, 2011.
ment, the model may be applicable to other datasets that target [12] M. Malik, J. T. Bigger, A. J. Camm, R. E. Kleiger, A. Malliani, A. J. Moss,
and P. J. Schwartz, ‘‘Heart rate variability: Standards of measurement,
at similar mental health status analysis. Within the frame- physiological interpretation, and clinical use,’’ Eur. Heart J., vol. 17, no. 3,
work of an ongoing research project acknowledged below, pp. 354–381, Mar. 1996.
we are collecting real-life data including HR and RR for [13] S. Koldijk, M. Sappelli, S. Verberne, M. A. Neerincx, and W. Kraaij,
‘‘The SWELL knowledge work dataset for stress and user modeling
mental health inpatients in a Norwegian hospital based on research,’’ in Proc. 16th Int. Conf. Multimodal Interact., Nov. 2014,
non-wearable Internet of things (IoT) devices. We plan to pp. 291–298.
assess the performance of the developed model based on our [14] S. Koldijk, M. A. Neerincx, and W. Kraaij, ‘‘Detecting work stress in
own datasets. However, to include the validation results based offices by combining unobtrusive sensors,’’ IEEE Trans. Affect. Comput.,
vol. 9, no. 2, pp. 227–239, Apr. 2018.
on these inpatient datasets is beyond the scope of this paper. [15] M. Albaladejo-González, J. A. Ruipérez-Valiente, and F. G. Mármol,
‘‘Evaluating different configurations of machine learning models and their
transfer learning capabilities for stress detection using heart rate,’’ J. Ambi-
VIII. CONCLUDING REMARKS ent Intell. Human. Comput., pp. 1–11, Aug. 2022, doi: 10.1007/s12652-
In this study, we have developed novel a 1D CNN model for 022-04365-z.
stress level classification using HRV signals and validated [16] R. Walambe, P. Nayak, A. Bhardwaj, and K. Kotecha, ‘‘Employing
multimodal machine learning for stress detection,’’ J. Healthcare Eng.,
the proposed model based on a publicly available dataset, vol. 2021, Oct. 2021, Art. no. 9356452.
SWELL−KW. In our model, we also applied an ANOVA [17] A. Ibaida, A. Abuadbba, and N. Chilamkurti, ‘‘Privacy-preserving com-
feature selection technique for dimension reduction. Through pression model for efficient IoMT ECG sharing,’’ Comput. Commun.,
extensive training and validation, we demonstrate that our vol. 166, pp. 1–8, Jan. 2021.
[18] W. C. Dobbs, M. V. Fedewa, H. V. MacDonald, C. J. Holmes, Z. S. Cicone,
model outperforms the state-of-the-art models in terms of D. J. Plews, and M. R. Esco, ‘‘The accuracy of acquiring heart rate
major performance metrics, i.e., Accuracy, Precision, Recall, variability from portable devices: A systematic review and meta-analysis,’’
F1-score, and MCC when all features are employed. Fur- Sports Med., vol. 49, no. 3, pp. 417–435, Mar. 2019.
thermore, our approach with ANOVA feature reduction also [19] C.-M. Chen, S. Anastasova, K. Zhang, B. G. Rosa, B. P. L. Lo,
H. E. Assender, and G.-Z. Yang, ‘‘Towards wearable and flexible sensors
achieves excellent performance. For future work, we plan to and circuits integration for stress monitoring,’’ IEEE J. Biomed. Health
further investigate the feasibility of optimizing the model to Informat., vol. 24, no. 8, pp. 2208–2215, Aug. 2020.
fit it into edge devices so that real-time stress detection can [20] R. A. Rahman, K. Omar, S. A. M. Noah, M. S. N. M. Danuri,
and M. A. Al-Garadi, ‘‘Application of machine learning methods in
become a reality. mental health detection: A systematic review,’’ IEEE Access, vol. 8,
pp. 183952–183964, 2020.
REFERENCES [21] S. H. Jambukia, V. K. Dabhi, and H. B. Prajapati, ‘‘Application of machine
learning methods in mental health detection: A systematic review,’’ in Proc.
[1] H.-G. Kim, E.-J. Cheon, D.-S. Bai, Y. H. Lee, and B.-H. Koo, ‘‘Stress Int. Conf. Adv. Comput. Eng. Appl., 2015, pp. 714–721.
and heart rate variability: A meta-analysis and review of the literature,’’ [22] S. Celin and K. Vasanth, ‘‘ECG signal classification using various machine
Psychiatry Invest., vol. 15, no. 3, pp. 235–245, Mar. 2018. learning techniques,’’ J. Med. Syst., vol. 42, no. 12, p. 241, Oct. 2018.
[2] D. Muhajir, F. Mahananto, and N. A. Sani, ‘‘Stress level measurements [23] A. Padha and A. Sahoo, ‘‘A parametrized quantum LSTM model for
using heart rate variability analysis on Android based application,’’ Proc. continuous stress monitoring,’’ in Proc. 9th Int. Conf. Comput. Sustain.
Comput. Sci., vol. 197, pp. 189–197, Jan. 2022. Global Develop. (INDIACom), Mar. 2022, pp. 261–266.
[3] J. Held, A. Vîslă, C. Wolfer, N. Messerli-Bürgy, and C. Flückiger, ‘‘Heart [24] S. Sriramprakash, V. D. Prasanna, and O. V. R. Murthy, ‘‘Stress detection
rate variability change during a stressful cognitive task in individuals with in working people,’’ Proc. Comput. Sci., vol. 115, pp. 359–366, Dec. 2017.
anxiety and control participants,’’ BMC Psychol., vol. 9, no. 1, p. 44, [25] P. Sarkar and A. Etemad, ‘‘Self-supervised learning for ECG-based emo-
Mar. 2021. tion recognition,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
[4] K. M. Dalmeida and G. L. Masala, ‘‘HRV features as viable physiological (ICASSP), May 2020, pp. 3217–3221.
markers for stress detection using wearable devices,’’ Sensors, vol. 21, [26] M. X. Huang, J. Li, G. Ngai, and H. V. Leong, ‘‘Stressclick: Sensing stress
no. 8, p. 2873, Apr. 2021. from gaze-click patterns,’’ in Proc. 24th ACM Int. Conf. Multimedia (MM),
[5] J. A. Miranda-Correa, M. K. Abadi, N. Sebe, and I. Patras, ‘‘AMI- Oct. 2016, pp. 1395–1404.
GOS: A dataset for affect, personality and mood research on individuals [27] P. Bobade and M. Vani, ‘‘Stress detection with machine learning and deep
and groups,’’ IEEE Trans. Affect. Comput., vol. 12, no. 2, pp. 479–493, learning using multimodal physiological data,’’ in Proc. 2nd Int. Conf.
Apr./Jun. 2021. Inventive Res. Comput. Appl. (ICIRCA), Jul. 2020, pp. 51–57.
[6] E. Won and Y.-K. Kim, ‘‘Stress, the autonomic nervous system, and [28] B. J. Feir-Walsh and L. E. Toothaker, ‘‘An empirical comparison of the
the immune-kynurenine pathway in the etiology of depression,’’ Current ANOVA F-test, normal scores test and Kruskal–Wallis test under viola-
Neuropharmacol., vol. 14, no. 7, pp. 665–673, Aug. 2016. tion of assumptions,’’ Educ. Psychol. Meas., vol. 34, no. 4, pp. 789–799,
[7] B. Olshansky, H. N. Sabbah, P. J. Hauptman, and W. S. Colucci, ‘‘Parasym- Dec. 1974.
pathetic nervous system and heart failure: Pathophysiology and poten- [29] A. Chatterjee, M. W. Gerdes, and S. G. Martinez, ‘‘Identification of
tial implications for therapy,’’ Circulation, vol. 118, no. 8, pp. 863–871, risk factors associated with obesity and overweight—A machine learning
Aug. 2008. overview,’’ Sensors, vol. 20, no. 9, art., p. 2734, May 2020.

VOLUME 11, 2023 57479


J. A. Mortensen et al.: Multi-Class Stress Detection Through Heart Rate Variability

[30] A. Chatterjee, N. Pahari, A. Prinz, and M. Riegler, ‘‘Machine learning and AYAN CHATTERJEE received the B.Eng. degree
ontology in eCoaching for personalized activity level monitoring and rec- in computer science and engineering (CSE) from
ommendation generation,’’ Sci. Rep., vol. 12, no. 1, pp. 1–26, Nov. 2022. the West Bengal University of Technology, India,
[31] L. Stahle and S. Wold, ‘‘Analysis of variance (ANOVA),’’ Chemometrics in 2009, the master’s degree in information tech-
Intell. Lab. Syst., vol. 6, no. 4, pp. 259–272, Nov. 1989. nology from Jadavpur University, India, in 2016,
[32] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J. Inman, and the Ph.D. degree from the University of
‘‘1D convolutional neural networks and applications: A survey,’’ Mech. Agder, Norway, in 2022. His Ph.D. thesis was on
Syst. Signal Process., vol. 151, Apr. 2021, Art. no. 107398. ICT-eHealth. He worked as an Associate Consul-
[33] F. Mattioli, C. Porcaro, and G. Baldassarre, ‘‘A 1D CNN for high accu-
tant with Tata Consultancy Services, Ltd., India,
racy classification and transfer learning in motor imagery EEG-based
from 2009 to 2019, and was deputed to Denmark
brain-computer interface,’’ J. Neural Eng., vol. 18, no. 6, Jan. 2022,
Art. no. 066053. and the Netherlands for 3.4 years as a Java Solution Designer and a Data
[34] D. Chicco and G. Jurman, ‘‘The advantages of the Matthews correlation Analyst. He is currently a Senior Researcher in AI and semantics with the
coefficient (MCC) over F1 score and accuracy in binary classification Simula Research Laboratory (SimulaMet), Oslo, Norway, and an Adjunct
evaluation,’’ BMC Genomics, vol. 21, no. 1, p. 6, Jan. 2020. Associate Professor of object-oriented programming with the University of
[35] K. Nkurikiyeyezu, K. Shoji, A. Yokokubo, and G. Lopez, ‘‘Thermal com- Agder, Kristiansand, Norway. He has a strong aptitude for object-oriented
fort and stress recognition in office environment,’’ in Proc. 12th Int. Joint programming concepts. His research interests include AI, eHealth, recom-
Conf. Biomed. Eng. Syst. Technol., 2019, pp. 256–263. mendation technology, semantics, human-centered design, software engi-
[36] P. Schmidt, A. Reiss, R. Duerichen, C. Marberger, and K. V. Laerhoven, neering, and bioinformatics.
‘‘Introducing WESAD, a multimodal dataset for wearable stress and affect
detection,’’ in Proc. 20th ACM Int. Conf. Multimodal Interact., Oct. 2018,
pp. 400–408.
[37] A. Arsalan, M. Majid, A. R. Butt, and S. M. Anwar, ‘‘Classification of per-
ceived mental stress using a commercially available EEG headband,’’ IEEE DEBASISH GHOSE (Senior Member, IEEE)
J. Biomed. Health Informat., vol. 23, no. 6, pp. 2257–2264, Nov. 2019. received the Ph.D. degree in information and com-
[38] R. G. Babukarthik, V. A. K. Adiga, G. Sambasivam, D. Chandramo- munication technology from the University of
han, and J. Amudhavel, ‘‘Prediction of COVID-19 using genetic deep
Agder, Grimstad, Norway, in 2019. He was a Sys-
learning convolutional neural network (GDCNN),’’ IEEE Access, vol. 8,
tem Developer with Confirmit, Grimstad, Norway,
pp. 177647–177666, 2020.
[39] R. G. Babukarthik, D. Chandramohan, D. Tripathi, M. Kumar, and
from 2020 to 2021. From 2021 to 2022, he was
G. Sambasivam, ‘‘COVID-19 identification in chest X-ray images using a Post-Doctoral Researcher with the University
intelligent multi-level classification scenario,’’ Comput. Electr. Eng., of Agder. He is currently an Associate Profes-
vol. 104, Dec. 2022, Art. no. 108405. sor with the School of Economics, Innovation,
[40] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, and Technology, Kristiania University College,
‘‘Hyperband: A novel bandit-based approach to hyperparameter optimiza- Bergen, Norway. His research interests include protocol design, modeling,
tion,’’ J. Mach. Learn. Res., vol. 18, no. 1, pp. 6765–6816, 2017. and performance evaluation of the Internet of Things, edge and fog comput-
[41] M. Feurer, L. Kotthoff, and J. Vanschoren, ‘‘Hyperparameter optimiza- ing, data analytics, cyber security, and machine learning.
tion,’’ in Automated Machine Learning: Methods, Systems, Challenges.
Springer, 2019.
[42] J. Bergstra and Y. Bengio, ‘‘Random search for hyper-parameter optimiza-
tion,’’ J. Mach. Learn. Res., vol. 13, pp. 281–305, Feb. 2012.

FRANK Y. LI received the Ph.D. degree from the


Department of Telematics (now the Department
of Information Security and Communication
JON ANDREAS MORTENSEN is currently pur- Technology), Norwegian University of Science
suing the bachelor’s degree in computer sci- and Technology (NTNU), Trondheim, Norway,
ence with the Department of Information and in 2003. He was a Senior Researcher with
Communication Technology, University of Agder, the UniK-University Graduate Center (now the
Norway. His research interests include data analyt- Department of Technology Systems), University
ics and machine learning. of Oslo, Norway, before joining the Department
of Information and Communication Technology,
University of Agder, Norway, in 2007, as an Associate Professor and then
a Full Professor. From 2017 to 2018, he was a Visiting Professor with
the Department of Electrical and Computer Engineering, Rice University,
Houston, TX, USA. During the past few years, he has been an active
participant in multiple Norwegian and EU research projects. His research
interests include MAC mechanisms and routing protocols in 5G and beyond
MARTIN EFREMOV MOLLOV is currently pur- mobile systems and wireless networks, the Internet of Things, mesh and
suing the bachelor’s degree in computer sci- ad-hoc networks, wireless sensor networks, D2D communications, cooper-
ence with the Department of Information and ative communications, cognitive radio networks, green wireless communi-
Communication Technology, University of Agder, cations, dependability and reliability in wireless networks, QoS, resource
Norway. His research interests include data analyt- management, traffic engineering in wired and wireless IP-based networks,
ics and machine learning. and the analysis, simulation, and performance evaluation of communication
protocols and networks. He was listed as a Lead Scientist by the European
Commission DG RTD Unit A.03—Evaluation and Monitoring of Program
in 2007.

57480 VOLUME 11, 2023

You might also like