0% found this document useful (0 votes)
3 views

Human Activity Recognition 2

Uploaded by

SIDDHANT SAWANT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Human Activity Recognition 2

Uploaded by

SIDDHANT SAWANT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Journal of Reliable Intelligent Environments (2021) 7:189–213

https://doi.org/10.1007/s40860-021-00147-0

REVIEW

Trends in human activity recognition using smartphones


Anna Ferrari1 · Daniela Micucci1 · Marco Mobilio1 · Paolo Napoletano1

Received: 15 April 2021 / Accepted: 24 June 2021 / Published online: 3 July 2021
© The Author(s) 2021

Abstract
Recognizing human activities and monitoring population behavior are fundamental needs of our society. Population security,
crowd surveillance, healthcare support and living assistance, and lifestyle and behavior tracking are some of the main appli-
cations that require the recognition of human activities. Over the past few decades, researchers have investigated techniques
that can automatically recognize human activities. This line of research is commonly known as Human Activity Recognition
(HAR). HAR involves many tasks: from signals acquisition to activity classification. The tasks involved are not simple and
often require dedicated hardware, sophisticated engineering, and computational and statistical techniques for data preprocess-
ing and analysis. Over the years, different techniques have been tested and different solutions have been proposed to achieve
a classification process that provides reliable results. This survey presents the most recent solutions proposed for each task in
the human activity classification process, that is, acquisition, preprocessing, data segmentation, feature extraction, and clas-
sification. Solutions are analyzed by emphasizing their strengths and weaknesses. For completeness, the survey also presents
the metrics commonly used to evaluate the goodness of a classifier and the datasets of inertial signals from smartphones that
are mostly used in the evaluation phase.

Keywords ADL · Human activity recognition · Machine learning · Deep learning · Smartphone

1 Introduction The data acquisition step is in charge of acquiring data


from sensors. Data generally originate from sensors such as
The first work on human activity recognition dates back to accelerometers, compasses, and gyroscopes. Data acquired
the late ’90s [1]. During the last 30 years, the Human Activ- from sensors typically include artifacts and noise due to many
ity Recognition (HAR) research community has been very reasons, such as electronic fluctuation, sensors calibration,
active proposing several methods and techniques. In recent and malfunctions. Thus, data have to be processed.
years, significant research has been focused on experiment- The preprocessing step is responsible for removing arti-
ing with solutions that can recognize Activities of Daily facts and noise. Generally, preprocessing is based on filtering
Living (ADLs) from inertial signals. This is mainly due to techniques. The output of the step is a set of filtered data that
two factors: the increasingly low cost of hardware and the constitute the input for the next step.
wide spread of mobile devices equipped with inertial sensors. The data segmentation step is responsible of splitting data
The use of smartphones to both acquire and process signals into segments, also called windows. Data segmentation is a
opens opportunities in a variety of application contexts such common practice which facilitates the next step.
as surveillance, healthcare, and delivering [2–4]. The feature extraction step aims to extract the most sig-
In the context of HAR, most of the classification methods nificative portion of information from the data to be given to
rely on the Activity Recognition Process (ARP) protocol. As the classification algorithm while reducing data dimension.
depicted in Fig. 1, ARP consists of five steps, acquisition, The classification is the last step of the process. It consists
preprocessing, segmentation, feature extraction, and classi- in training and testing the algorithm. That is, the parameters
fication. of the classification model are estimated during the training
procedure. Thereinafter, the classification performances of
B Daniela Micucci the model are tested in the testing procedure.
[email protected]
This paper presents a review of the techniques and meth-
1 Department of Informatics, Systems and Communication, ods commonly adopted in the steps of the ARP process. The
University of Milano - Bicocca, Milan, Italy

123
190 Journal of Reliable Intelligent Environments (2021) 7:189–213

Fig. 1 Activity recognition

ng

ct re
tio
process

en ta
si

n
tra tu
on
qu ata

io
es

gm Da
ta

Ex Fea
iti
D

c
is

ro
ep
Ac

Se
Pr
Raw Data Input Data

focus is on techniques and methods that have been experi- ity that is normally performed on the raw data as acquired
mented and proposed for smartphone. Therefore, this review by the sensors. Sections 5 and 6 describe the commonly
does not include other types of devices used in HAR. The used segmentation strategies and features, respectively. Sec-
choice to consider smartphones only is due to the increasing tion 7 introduces the most recent classification methods, their
attention paid to these devices by the scientific community as strength, and weakness. Moreover, the Section discusses per-
a result of their valuable equipment and their wide diffusion. sonalization and why it is important to improve the overall
The paper also provides an overview of the most used classification performance. Given the importance of datasets
datasets for evaluating HAR techniques. Since this review is in the evaluation process of techniques and methods, Sect. 8
focused on smartphones, the datasets included are those of discusses the characteristics of a set of publicly available
inertial signals collected using smartphones. datasets often used in the evaluation of classifiers. Section 9
The analysis of the state-of-the-art encompasses scientific summarizes the lessons learned and provides some guidance
articles selected based on the following criteria and key- on where the research should focus. Finally, Sect. 10 sketches
words: the conclusions.

– first 100 papers found in Google Scholar with keywords:


human activity recognition smartphone, 2 Background
– first 100 papers found in Google Scholar with key-
words: human activity recognition smartphone staring This section is intended to provide a quick overview of the
from 2015, recognition process of activities of daily living. The details
– first 100 papers found in Google Scholar with keywords: are then discussed in more detail in the respective sections.
personalized human activity recognition smartphone, The goal of human activity recognition (HAR) is to auto-
– first 100 papers found in Google Scholar with keywords: matically analyze and understand human actions from sig-
personalized human activity recognition smartphone nals acquired by multimodal wearable and/or environmental
staring from 2015. devices, such as accelerometer, gyroscope, microphones, and
camera [5].
The selection of the papers has been completed on March Recently, research has been shifting toward the use of
2020. wearable devices. There are several reasons that have led
We initially removed duplicates from the resulting articles. to this shift, which include lower costs as they do not require
Then, we manually checked the remaining papers by reading a special installation, the use also outside the home, and a
the abstract, the introduction, and the conclusion sections to greater willingness to use as perceived by users as less intru-
quickly eliminate those articles that are out of the scope of our sive respecting their privacy.
survey. The articles that we excluded are those that rely on Among wearable devices, recently, smartphones are the
devices other than smartphones, those that use smartphones most widely used compared to ad hoc devices. This is mainly
in conjunction with other devices, those that use sensors dif- due to the fact that the smartphone is now widely used even
ferent from the inertial ones, and those that deal with complex in the older population and is always ‘worn’ without being
ADLs such as preparing a meal, taking transportation, and perceived as an additional element of disturbance, because it
so on. is now integrated into the daily routine.
The paper is organized as follows. Figure 2 shows the recognition process. To the left are the
Section 2 introduces the problem related to human activ- sensors that are the source of the data required to recognize
ity recognition. Section 3 describes the data acquisition step activities, whereas to the right are activities of daily living
and, thus, the sensors that are mainly exploited in HAR for that are recognized by the ‘activity recognition’ chain (in the
data acquisition. Section 4 presents the preprocessing activ- middle).

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 191

The potentially recognizable activities vary in complexity: were worn by people on different parts of their bodies and
walking, jogging, sitting, and standing are examples of the included typically inertial sensors [7].
most simple ones; preparing a meal, shopping, taking a bus, Over the past decade, a considerable progress in hard-
and driving a car are examples of the most complex ones. ware and software technologies has modified habits of the
Depending on the complexity, different techniques and types entire population and business. On one hand, the micro-
of signals are implemented. We are interested in activities electro-mechanical systems (MEMS) have reduced sensors
that belong to the category of the simplest ones. size, cost, and power needs of sensors, while capacity, preci-
When the wearable device is a smartphone, the most com- sion, and accuracy have increased. On the other hand, Internet
monly used sensors are the accelerometer, gyroscope, and of Things (IoT) has enabled the spread of easy and fast con-
magnetometer. Therefore, the first step of the Activity Recog- nections between devices, objects, and environments. The
nition Process (ARP) introduced in Sect. 1 (Data Acquisition) pervasiveness and the reliability of these new technologies
requires to be able to interface with the sensors and to acquire enables the acquisition and storage of a large amount of mul-
the signals with the required frequencies. This step is detailed timodal data [15].
in Sect. 3. Thanks to these technological advances, smartphones,
As the signals are acquired, they undergo an elabora- smartwatches, home assistants, and drones are daily used
tion process whose purpose is to remove the noise caused and represent essential instruments for many economy busi-
by the user and the sensors. Generally, high-pass, low-pass nesses, such as remote healthcare, merchandise delivering,
filters, and average smoothing methods are applied. This cor- agriculture, and others [16]. These new technologies together
responds to the second step (Preprocessing) of the ARP that with the large availability of data gained the attention from
is detailed in Sect. 4. the research communities, including HAR.
The continuous pre-processed data stream is then split into The goal of this section is to present the most used
segments whose dimensions and overlaps may vary accord- wearable devices for data acquisition in HAR, which are a
ing to several factors such as the technique used to classify, consequence of the technological advances discussed above.
the type of activity to be detected, and the type of signals to be Wearable devices encompass all accessories attached to
processed. This corresponds to the third step (Segmentation) the person’s body or clothing incorporating computer tech-
of the ARP process that is detailed in Sect. 5. nologies, such as smart clothing, and ear-worn devices [17].
The segments of pre-processed signals are then elaborated They enable to capture attributes of interest as motion, loca-
to extract significative features. This step (Feature extraction tion, temperature, and ECG, among others.
in the ARP process) is crucial for the performance of the final Nowadays, smartphones and smartwatches are the most
recognition. Two main types of features are commonly used: used wearable devices among the population. In particular,
hand-crafted features (which are divided into time-domain the smartphone is one of the most used devices in people’s
and frequency-domain) and learned features that are auto- daily lives and it has been stated that it is the first thing people
matically discovered. Feature extraction is detailed in Sect. 6. reach for after waking up in the morning [18,19].
The last step of the ARP process is Classification. Smartphone’s pervasiveness over last years is due mostly
For many years, this step was accomplished through the because it provides the opportunity to connect with people,
exploitation of traditional machine learning techniques. to play games, to read emails, and, in general, to achieve
More recently, due to promising results in the field of video almost all online services that a user needs. In particular, their
signal processing, deep learning techniques have also been high diffusion is a crucial aspect, because the more the users,
used. More recently, due to the problem known as population the more the data availability. The more data availability, the
diversity [6] (which is related to the natural users heterogene- more information and the more the possibility to create robust
ity in terms of data), researchers have applied recognition models.
techniques based on personalization to obtain better results. A the same time, smartphones are preferable over other
Classification is detailed in Sect. 7. wearables, because a huge amount of sensors and softwares
are already installed and permit to acquire many kind of data,
potentially, all day long.
The choice of the sensors plays an important role for the
3 Data acquisition activity recognition performances [20].
Accelerometers, gyroscopes, and magnetometers are the
Historically, human activity recognition techniques exploited most used sensors for HAR tasks and classification.
both environmental devices and ad hoc devices worn by
subjects [7]. Commonly used environment devices include
cameras [8–11], and other sensors such as, for example RFID – Accelerometer. The accelerometer is an electromechani-
[12], acoustic sensors [13], and WiFi [14]. The ad hoc devices cal sensor that captures the rate of change of the velocity

123
192 Journal of Reliable Intelligent Environments (2021) 7:189–213

Fig. 2 An abstracted overview


of the human activity
recognition process

of an object over a time laps, that is, the acceleration. Shoaib et al. demonstrated that gyroscope-based classifica-
It is composed of many other sensors, including some tion achieves better results than accelerometer for specific
microscopic crystal structures that become stressed due activities, such as walking downstairs and upstairs [35]. Fur-
to accelerative forces. The accelerometer interprets the thermore, as afore mentioned, gyroscope data permit to infer
voltage coming from the crystals to understand how fast device position that drastically impacts recognition perfor-
the device is moving and which direction it is pointing mances [37,38].
in. A smartphone records three-dimension acceleration, Other studies combined accelerometer and magnetometer
which join the reference devices axes. Thus, a trivariate simultaneously [39], acceleration and gyroscope with mag-
time series is produced. The measuring unit is meters netometer [40,41], accelerometer with microphone and GPS
over second squared (m/s 2 ) or g forces. [6], and other combinations [42].
– Gyroscope. The gyroscope measures three-axial angu- An important factor to consider in the acquisition step is
lar velocity. Its unit is measured in degrees over second the sample rate that influences the number of available sam-
(degrees/s). ples for the classification step. The sampling rate is defined
– Magnetometer. A magnetometer measures the change of as the number of data points recorded in a second and is
a magnetic field at a particular location. The measurement expressed in Hertz. For instance, if the sampling rate is equal
unit is Tesla (T ), and is usually recorded on the three axes. to 50Hz, it means that 50 values per second are recorded.
This parameter is normally set during the acquisition phase.
In the literature, different sampling rates have been con-
In addition to accelerometers, gyroscopes, and magne-
sidered. For instance, in [43], the sampling rate is set at 50
tometers, other less common sensors are used in HAR. For
Hz, in [44] at 45 Hz, and from 30 to 32 Hz in [32]. Although
example, Garcia-Ceja and Brena use a barometer to classify
the choice is not unanimous in the literature, 50 Hz define a
vertical activities, such as ascending and descending stairs
suitable sampling rate that properly permits to model human
[21]. Cheng et al. [22] and Foubert et al. [23] use pressure
activities [45].
sensors arrays to detect respectively activities and lying and
sitting transitions. Other researchers use biometric sensors.
For example, Zia et al. use electromyography (EMG) for
fine-grained motion detection [24], and Liu et al. use elec- 4 Preprocessing
trocardiography (ECG) in conjunction with accelerometer to
recognize activities [25]. In a classification pipeline, data preprocessing is a funda-
Accelerometer is the most popular sensor in HAR, because mental step to prepare raw data for further steps.
it measures the directional movement of a subject’s motion Raw data coming from sensors often present artifacts due
status over time [26–31]. Nevertheless, it struggles to resolve to instruments, such as electronic fluctuation or sensor cal-
lateral orientation or tilt, and to find out the location of the ibration, or to the physical activity its self. Data have to be
user, which are precious information for activity recognition. cleaned to exclude from the signals these artifacts.
For these reasons, some sensor combinations have been Moreover, accelerometer signal combines the linear accel-
proposed as valid solution in HAR. In most of the cases, eration due to body motion and due to gravity. The presence
accelerometer and gyroscope are used conjointly to both of the gravity is a bias that can influence the accuracy of
acquire more information about the device movements, and the classifier, and thus is a common practice to remove the
to possibility infer the device position [32–36]. Moreover, gravity component from the raw signal.

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 193

For all the reasons mentioned above, a filtering proce-


dure is normally executed. Filters are powerfully instruments
which acting on frequency component of the signal.
The high-frequency component of the accelerometer sig-
nal is mostly related to the action performed by the subjects,
while the low-frequency component of the accelerometer sig-
nal is mainly related to the presence of gravity [46–48].
Usually, a low-pass filter with cut-off frequency ranging
between 0.1 and 0.5 Hz is used to isolate the gravity com-
ponent. To find the body acceleration component, the result
of the low-pass filtered signal is subtracted from the original
signal [49–51].
Filtering is also used to clear raw data from artifacts. It is
stated that a cut-off frequency of 15Hz is enough to capture
Fig. 3 State-of-the-art Sliding Window’s Size
human body motion which energy spectrum lies between 0
Hz and 15 Hz [49,52].
5.2 Window size
5 Data segmentation
The choice of the window size influences the accuracy of the
classification [55]. However, its choice is not trivial. Win-
Data segmentation partitions signals into smaller data seg-
dows should be large enough to guarantee to contain at least
ments, also called windows.
one cycle of an activity and to differentiate similar move-
Data segmentation helps in overcoming some limitations
ments. At the same time, incrementing its dimension does
due to acquisition and pre-proccessing aspects. First, data
not necessarily improve the performance. Shoaib et al. show
sampling: data recorded from different subjects may present
that 2 s is enough for recognizing basic physical activities
different lengths in time which is generally a limit for the
[35].
classification process. Second, time consumption: multidi-
Figure 3 shows the distribution of windows size among
mensional data can lead to a very high computational time
the state-of-the-art studies we considered. It is possible to
consumption. Splitting data into smaller segments helps the
notice that the most frequently used window size does not
algorithm to face with high volumes of data. Third, it helps
exceed 3 s.
the computation of the features extraction procedure in terms
The impact of windows sizes on the classification per-
of more simplicity and lower time consumed.
formance still remains a challenging task for the HAR
Window characteristics are influenced by: (a) the type of
community and continues to be largely studied in the lit-
windowing, (b) the size of the window, and (c) the overlap
erature [35,54,56].
among contiguous windows.

5.1 Window type 5.3 Window overlap

Three main types of windowing are mainly used in HAR: Another parameter to consider is the percentage of over-
activity-defined windows, event-defined windows, and slid- lap among consecutive windows. Sliding windows are often
ing windows [53]. overlapped, which means that a percentage of a window is
In activity-defined windowing, the initial and end points of repeated in the subsequent window. This leads to two main
each window are selected by detecting patterns of the activity advantages: it avoids noise due to the truncation of data dur-
changes. ing the windowing process, and increases the performance
In event-defined windowing, the window is created around by increasing the data points number.
a detected event. In some studies, it is also mentioned as Generally, the higher the number of data points, the higher
windows around peak [43]. the classification performance. For these reasons, overlapped
In sliding windowing, data are split into windows of fixed sliding windows are the most common choice in the litera-
size, without gap between two consecutive windows, and, in ture.
some cases, overlapped, as shown in Fig. 4. Sliding window- Figure 5 shows the distribution of the percentage overlap
ing is the most widely employed segmentation technique in in the state-of-the-art. In more than 50% of the proposals
activity recognition, especially for periodic and static activ- we selected, an overlap of 50% has been chosen. Some
ities [54]. approaches avoid any overlap [29,32,44,57], claiming faster

123
194 Journal of Reliable Intelligent Environments (2021) 7:189–213

by abstracting each data segment into a high-level represen-


tation of the same segment.
From a mathematical point of view, features extraction is
defined as a process that extracts a set of new features from the
original data segment through some functional mapping [60].
For instance, let be x = {x1 , x2 , . . . , xn } ∈ Rn a segment of
data, an extracted feature f i is given by

Fig. 4 Sliding windows with and without Overlap


f i = gi (x1 , x2 , . . . , xn ) for i = 1, . . . , m,

where gi : Rn → R is a map. The features space is of


dimension m ≤ n, which means that features’ extraction
reduces raw data space dimension.
In the classification context, the choice of gi is crucial.
In fact, in the recognition process, g has to be chosen, such
that the original data are mapped in separated regions of the
features space. In other words, the researcher assumes that
Fig. 5 Distribution of % of overlap in the state-of-the-art
in the feature space, data diversify better than in the original
space.
responses in real-time environments and better performances The accuracy of activity recognition approaches dramati-
in detection of short duration movements. cally depends on the choice of the features [55].
At the end of the segmentation step, data are organized in In the literature, the way features gi are extracted is divided
vectors vi as follows: into two main categories: hand-crafted features and learned
features.

vi = (x1 , x2 , . . . xn , 6.1 Hand-crafted features


  
x−dimension
y , y , . . . yn , Hand-crafted features are the most used features in HAR
 1 2  [61–63]. The term “hand-crafted” refers to the fact that the
y−dimension
features are selected from an expert using heuristics.
z , z , . . . z n ),
1 2  Hand-crafted features are themselves generally split in
z−dimension time-domain and frequency-domain features. The signal
domain is changed from time to frequency based on the
Fourier transformation.
where x, y, z are the three-axis acceleration values, and vi Table 1 shows some of the most used time-domain and
is a 1 × (n × k) vector that represents the ith window. k frequency-domain features.
refers dimensionality of the sensor; for instance, a 3-axial Low computational complexity and calculation simplicity
acceleration has value of k = 3. The number n is the total make hand-crafted features still a good practice for activity
length of the windows, and it depends on two factors: the size recognition.
of the widows, normally in seconds, and the sampling rate. Nevertheless, they present many disadvantages, such as
a high dependency on the sensor choice and the reliance
on the expert knowledge. Hence, a different set of features
6 Feature extraction need to be defined for each different type of input data, that
is, accelerometer, gyroscope, time domain, and frequency
Theoretical analysis and experimental studies indicate that domain. In addition, hand-crafted features highly depend on
many algorithms scale poorly in domains with large number experts’ prior knowledge and manual data investigation and
of irrelevant and\or redundant data [58]. it is still not always clear which features are likely to work
The literature shows that using a set of features instead best.
of raw data improves the classification accuracy [59]. Fur- It is a common practice to chose the features through
thermore, features extraction reduces the data dimensionality empirical evaluation of different combinations of features
while extracting the most important peculiarity of the signal or with the aid of feature selection algorithms [64].

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 195

Table 1 Hand-crafted features in time domain and frequency domain. F F T ( f ) is the Fourier transformation of the signal f
Time domain features
Feature name Formula Description

Minimum min j=1,...n (x j ) Minimum value of a given segment


in each dimension
Maximum max j=1,...n (x j ) Maximum value of a given segment
in each dimension
n
Mean x̄ = 1
n i=1 x i Mean value of a given segment
in each dimension
Median Me = x0.5 : F(x0.5 ) ≤ 0.5 Median value of a given segment
in each dimension
 n
Standard Deviation s= 1
n−1 i=1 (x i − x̄)2 Standard deviation of a given segment
in each dimension
n
Variance s2 = 1
n−1 i=1 (x i − x̄)
2 Variance of a given segment
in each dimension
Interquartile Difference I D = x0.75 − x0.25 Difference between third and first quartile
of a given segment in each dimension
1 n
i=1 (x i − x̄)
3
Skewness skw = n
s3
Skewness value of a given segment
in each dimension
1 n
i=1 (x i − x̄)
4
Kurtosis kur t = n
s4
Kurtosis value of a given segment
in each dimension
 
n
Root mean square r ms = 1
n
2
i=1 x i Root mean square value of a given segment
in each dimension
n
Total Sum ts = i=1 x i Total sum value of a given segment
in each dimension
Range R = max − min Range of a given segment
in each dimension
s s
Mean of Peak’s distance mp = 1
s2 j=1 i=1 d( pi , pj) Mean of distance between peaks of a given
segment in each dimension
n
Fourth central moment m4 = 1
n j=1 (x − x̄)4 Fourth central moment of a given segment
in each dimension
n
Fifth central moment m5 = 1
n i=1 (x i − x̄)5 Fifth central moment of a given segment
in each dimension

Frequency domain features


Feature name Formula Description
n
Entropy H (x) = − i=1 p(xi ) log2 p(xi ) Normalized information entropy
of the discrete FFT components
Sum of the spectral I D = x0.75 − x0.25 Difference between third and first quartile
power components of a given segment in each dimension
n
Mean of the spectral μf = 1
n j=1 F F Tj Mean of FFT distribution
components
Median of the spectral Me f = F F T0.5 : F(F F T0.5 ) = 0.5 Median of FFT distribution
components
First cepstral coefficient c(1) = F −1 {log |F F T ( f )|} First coefficient of the cepstrum
transformation

123
196 Journal of Reliable Intelligent Environments (2021) 7:189–213

6.2 Learned features by more flexible techniques, developed during recent years,
based on data-driven paradigms. The main difference between
The goal of feature learning is to automatically discover these two approaches is given by the a priori assumption
meaningful representations of raw data to be analyzed [65]. about the relationship between independent and response
According to [66], the main features learning methods variables. Thus, given a classification model, y = f (x),
from sensor data are the following: model-driven approaches state that f is (or can be) deter-
mined by assumptions on the distribution of the underlying
– Codebooks [67,68] considers each sensor data window as stochastic process that generates x. f is build through a set
a sequence, from which subsequences are extracted and of rules, or algorithms, which choices depend on data with
grouped into clusters. Each cluster center is a codeword. an unknown distribution. On the opposite, in data-driven
Then, each sequence is encoded using a bag-of-words paradigms, f is unknown and depends directly on the data
approach using codewords as features. and on the choice of the algorithm.
– Principal Component Analysis (PCA) [69] is a multi- The strength and the success of data-driven approaches
variate technique, commonly used for dimensionality is due to their capability to manage and to analyze large
reduction. The main goal of PCA is the extraction of amount of variables that characterize a phenomenon without
a set of orthogonal features, called principal component, assuming any a priori relation between the independent and
which are linear combination of the original data and such response variables. From a certain point of view, this flexi-
as the variance extracted from the data is maximal. It is bility can be a weakness, because the lack of a well-known
also used for features selection. relation also can be interpreted as a lack of cause–effect
– Deep Learning uses Neural Networks engines to learn knowledge.
patterns from data. Neural Networks are composed from In model-driven approaches, in contrast, cause–effect
a set of layers. In each layer, the input data are trans- relation is known by definition. However, model-driven
formed through combinations of filters and topological approaches loose in performance in estimating high-
maps. The output of each layer becomes the input of the dimensionality relations.
following layer and so on. At the end of this procedure, the In activity recognition context, model-driven approaches
result is a set of features more or less abstract depending are less powerful and data-driven approaches are preferred
on the number of layers. The higher the number of layers [71].
is, the more the features are abstract. These features can Among data-driven algorithms, Artificial Intelligence
be used for classification. Different deep learning meth- (AI) have produced very promising results over the last years
ods for features extraction have been used for time series and have been largely used for data analysis, for information
analysis [70]. extraction, and for classification tasks. AI algorithms encom-
passes machine learning which, in turns, encompasses deep
Features learning techniques avoid the issue of manu- learning methods.
ally defining and selecting the features. Recently, promising Machine learning uses statistical exploration techniques
results are leading the research community to exploit learned to enable the machine to learn and improve with the experi-
features in their analysis. ences without being explicitly programmed. Deep learning
emulates human neural system to analyze and extract fea-
tures from data. In this survey, we focus on machine learning
7 Classification and deep learning algorithms.
The choice of the classification algorithm drastically influ-
Over the last years, hardware and software development has ences the classification performance, but up to now, there is
increased wearable devices capability to face with complex no evidence of a best classifier and its choice still remains a
applications and tasks. For instance, smartphones are, nowa- challenging task for the HAR community.
days, able to acquire, store, share, and elaborate huge amount In particular, machine learning and deep learning meth-
of data in a very short time. As a consequence of this tech- ods struggle to achieve good performances for new unseen
nological development, new instruments related to the data users. This loss of performance is mostly caused by the
availability, data processing, and data analysis are born. subjects variabilities, also called population diversity [6],
The capability of a simple smartphone to meet some com- which is related to the natural users heterogeneity in terms of
plex tasks (e.g., steps count and life style monitoring) is the data. The following sections present both traditional state-of-
result of very recent scientific changes regarding methods the-art machine learning and deep learning techniques, and
and techniques. personalized machine learning and deep learning techniques
In general, more traditional data analysis methods, based as solutions to overcome the population diversity problem.
on model-driven paradigms, have been largely substituted

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 197

7.1 Traditional learning methods Table 2 Kernel in support vector machines


Kernel Linear Polynnomial RBF
Artificial Intelligence (AI) algorithms are based on the emu- 
||xi −x j ||2
lation of the human learning process. According to [72], the Formula xiT x j (xiT x j + c)d ex p 2σ 2
word learning refers to a process to acquire knowledge or
skill about a thing. A thing can always be viewed as a sys-
tem, and the general architecture of the knowledge of the
thing follows the FCBPSS architecture, in which F is a func- might be very hard to discriminate activities in a completely
tion that refers to the role a particular structure plays in a unsupervised context [7].
particular context; C is a context that refers to the environ- Figure 6 shows the distribution of traditional machine
ment as well as pre-condition and post-condition surrounding learning and deep learning algorithms used for human activ-
a structure; B is a behavior that refers to causal relationships ity recognition in the papers we selected for this survey. In
among states of a structure; P is a principle that refers to the following paragraphs, we will describe the most used
the knowledge that governs a behavior of a structure; S is a techniques in HAR with the related literature.
state that describes the property or character of a structure; S
is a structure that represents elements or components of the 7.1.1 Traditional machine learning
system or thing along with their connections [73].
Machine learning and deep learning both refer to the word Machine learning techniques have been largely used for
learning and, indeed, they are implemented, so that they emu- activity recognition tasks. More and more sophisticated
late the human capability of learning. methods have been developed to face with the complex-
Machine learning techniques used in HAR are mostly ity related to activity recognition tasks. In this section, we
divided into supervised and unsupervised approaches. Super- describe traditional machine learning algorithms that have
vised machine learning encompasses all techniques that rely been mostly exploited for HAR, according to Fig. 6.
on labeled data. Unsupervised machine learning are tech- Support Vector Machine (SVM) belongs to domain
niques which are based on data devoid of labels. transform algorithms. It implements the following idea: it
Let x and y be, respectively, a set of data and their corre- is assumed that the input data x are not linearly separable
sponding labels. A classification task is a procedure whose with respect to the classes y in the data space, but there exists
goal is to predict the value of the label ŷ from the data input an higher dimensional space where the linearity is achieved.
x. In other terms, assuming that there exists a linear or non- Once data are mapped into this space, a linear decision sur-
linear relation f between x and y, the goal of the classification face (or hyperplane) is constructed and used as recognition
is to find f such as the prediction’s error, that is, the distance model. Thus, guided from the data, the algorithm searches for
between y and ŷ, is minimal. the optimal decision surface by minimizing the error func-
In supervised machine learning, data and corresponding tion. The projection of the optimal decision surface into the
labels are known and the algorithm learns f by iterating a pro- original space marks the areas belonging to a specific class
cedure until the global minimum of a loss function is reached. which is used for the classification [74]. The transformation
The loss function is again a measure about the prediction’s of the original space into a higher dimensional space is made
error, estimated by the difference between y and ŷ. The opti- through a kernel which is defined as a linear or non-linear
mization procedure, that is, finding the loss global minimum, combination of the data, for example, polynomial kernel,
is computed on the training dataset, which is a subset of the sigmoid kernel, and radial basis function (RBF) kernel, see
whole dataset. Once the minimum is achieved, the model is Table 2.
ready to be tested on the test dataset. The algorithm per- Originally, SVM have been implemented as two-class
formance measures the model’s capability to classify new classifier. The computation of the multi-class SVM bases
instances (Sect. 7.3 discusses details about the performance on two strategies: one-versus-all where one class is labeled
measures). with 0 and the other classes as 1, and one-versus-one where
In unsupervised approaches, the labels y are unknown and the classification is made between two classes at a time [75].
the evaluation of the algorithm goodness bases on statistical Among HAR classifiers, SVM is the most popular one
indices, such as the variance or the entropy. [32,34,48,75–79].
Consequently, the choice between supervised or unsuper- k-Nearest Neighbors (k-NN) is a particular case of
vised methods determines how the relation f between x and instance-based method. The nearest-neighbor algorithm
y is learnt. Since a human activity recognition system should compares each new instance with existing ones using a dis-
return a label such as walking, sitting, running, and so on, tance metric (see Table 3), and the closest existing instance
most of HAR systems work in a supervised fashion. Indeed, it is used to assign the class to the new one. This is the simplest
case where k = 1. If k > 1, the majority class of the closest

123
198 Journal of Reliable Intelligent Environments (2021) 7:189–213

Fig. 6 Traditional machine


learning and deep learning
classifiers distribution

Table 3 Distance metrics in k-nearest neighbor The construction of a tree involves determining split cri-
Distance Formula terion, stopping criterion, and class assignment rule [82].
 J48 and C4.5 are the most used decision tree in HAR
n [29,30,77,81].
i=1 (xi − x j )
Euclidean 2
n Random Forest (RF) is a classifier consisting of a col-
City Block i=1 |xi − x j |
Chebychev maxi=1...n |xi − x j |
lection of tree-structured classifiers {h(x, k ), k = 1, ...}
xi xTj
where the {k } are independent identically distributed
Cosine 1− 
random vectors and each tree casts a unit vote for the
(xi xiT )(x j xTj )

(x −x̄ )(x −x̄ )T


most popular class at input x [83]. Random Forest gen-
Correlation 1− √ i i
√j j
erally achieves high performance with high-dimensional
(xi −x̄i )(xi −x̄i )T (x j −x̄ j )(x j −x̄ j )T

Mahalnobis (xi − x j )C −1 (xi − x j )T
data by increasing the number of trees [29,40,48,56,75,84,
85].
where C is the covariance matrix
Naive Bayes (NB) belongs to Bayesian methods whose
prediction of new instances is based on the estimation of the
posterior probability as a product of the likelihood, which is
k neighbors is assigned to the new instance [80]. It is a very a conditional probability estimated on the training set given
simple algorithm and belongs to the lazy algorithms. Lazy the class, and a prior probability. In Naive Bayes, data are
algorithms have no parameters to learnt from the training assumed independent given the class values. Thus, given y
phase [32,75–77,81]. k-NN depends only on the number k be a certain class and xi ...xn the data, Naive Bayes classi-
of the nearest neighbors. fier based on the Bayesian rules and the likelihood splits
Decision tree algorithms build a hierarchical model in in the product of the conditional probabilities given the
which input data are mapped from the root to leafs through class
branches. The path between the root and a leaf is a classi-
n
fication rule [7]. Sometimes, the length of the trees has to P(y)P(x1 , ...xn |y) P(y) i=1 P(xi |y)
be modified and growing and pruning algorithms are used. P(y|x1 ...xn ) = = .
P(x1 , ...xn ) P(x1 , ...xn )

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 199

Decision rules is the maximum a posteriori (MAP) given and eventually to fire the neuron. The output of the acti-
by vation function is given by y = σ (wT x). If it fires, the
output becomes the next neuron’s input. Table 4 provides
n more details about activation functions.
arg max P(y|x1 ...xn ) = arg max P(xi |y). A set of neurons is called layer. A set of layers and
y y
i=1 synapses is called network. The input data x are passed
from the first layers to the last layer, called, respectively,
Naive Bayes has been applied in activity recognition because the input layer and the output layer, through intermediary
of the simple assumption on the likelihood, which is usually layers, called hidden layers. The term Deep Learning comes
violated in practice [29,77,81,86] from the network’s depth, that is, when the number of hidden
Adaboost is part of the classifier ensembles. Classifier layers grows.
ensembles encompass all algorithms that combine different Neurons belonging to same layers are not communicating
classifiers together. to each others, while neurons belonging to different layers
The combination between classifiers is meant in two ways: are connected and share the information passed through the
either using the same classifiers with different parameter’s activation function. If each neuron of the previous layer is
settings (e.g., random forest with different lengths), or using connected to all neurons of the next layer, the former is called
different classifiers together (e.g., random forest, support fully connected or dense layer. The output layer, also called
vector machines, and k-NN). classification layers in case of classification task or regres-
The ensemble classifiers encompass bagging, stacking, sion layer in case of continuous estimation, is responsable
blending, and boosting. In bagging, n samplings are gener- to estimate the predicted value ŷ of the labels y. Once the
ated from training set and a model is created on each. The last output is computed, the feed-forward procedure is com-
final output is a combination of each model’s prediction. Nor- pleted.
mally, either the average or a quantile is used. In stacking, Thereafter, an iterative procedure is computed to minimize
the whole training dataset is given to the multiple classifiers the loss function. This procedure is called back propagation
which are trained using the k-fold-cross-validation. After and is responsible to minimize the loss function with respect
training, they are combined for final prediction. In blending, to the weights wi . The weight’s values, indeed, represent how
the same procedure as staking is performed, but instead of strong is the relation between neurons belonging to different
the cross-validation, the dataset is divided into training and layers and how far the input information has to be trans-
validation. Finally, in boosting, the final classifier is com- fer through the network. The minimization procedure bases
posed of a weighted combination of models. The weights on gradient descent algorithms, which iteratively search for
are initially equal for each model and are iteratively updated weights, that reduce the value of the gradient of the loss until
based on the models performance, as for Adaboost [6,7,87]. it meets the global minimum or a stopping criteria. In general,
a greedy-wise tuning procedure over the hyper-parameters is
7.1.2 Traditional deep learning performed to the aim at achieving the best network configu-
ration. Most important hyper-parameters are: the number of
Generally, the relation between input data and labels is very layers, the kernel’s number and size, the pooling’s size, and
complex and mostly non-linear. Among Artificial Intelli- the regularization parameter, such as the learning rate.
gence algorithms, Artificial Neural Network (ANN) is a set According to Fig. 6, most used deep learning algorithms
of supervised machine learning techniques which emulate are described in the following.
human neural system with the aim at extracting non-linearity Multi-layer Perceptron (MLP) is the most widely used
relations from data for classification. Artificial Neural Network (ANN). It is a collection or neu-
Human neural system is composed by neurons (about 86 rons organized in a layers’ structure, connected in an acyclic
billions) which are connected with synapses (around 1014 ). graphs. Each neuron that belongs to a layer produces an
Neurons receive input signals from the outer (e.g., visual or output which becomes the input of the neurons of the next
olfactory) and based on the synaptic’s strength they fire and adjacent layer. Most common layer type is the fully con-
produce some output signals to be transmit to other neurons. nected layer, where each neurons share their output with each
Artificial Neural Network bases on the same neurons and adjacent layer’s neuron, while same layer’s neurons are not
synapses concept. connected. MLP is made up of the input layer, one (or more)
In a traditional ANN, each data input value is associated hidden layer and the output layer [88]. Used in HAR as base-
with a neuron and its synapses strength is measured by a line for deep learning techniques, it has been often compared
functional combination of input data x and randomly chosen with machine learning, such as SVM [48,89], RF [48], k-NN
weights w. This value is passed to an activation functions [89], DT [89], and deep learning techniques, LSTM [90],
σ which is responsable to determine the synapse strength CNN [89,90].

123
200 Journal of Reliable Intelligent Environments (2021) 7:189–213

Convolutional Neural Networks (ConvNet or CNN) is Forest in [27]. Roano et al. demonstrate that CNN out-
a class of ANN based on convolution products between ker- performs state-of-the-art techniques, which are all using
nels and small patches of the input data of the layer. The input hand-crafted features [92]. More recently, ensemble clas-
data are organized in channels if needed (e.g., in tri-axial sification algorithm with CNN-2 and CNN-7 shows better
accelerometer data, each axes is represented by one chan- performance when compared with machine learning random
nel), and normally convolution is performed independently forest, boosting, and traditional CNN-7 [40].
on each channel. The convolutional function is computed by Residual Neural Networks (ResNet) is a particular con-
sliding a convolutional kernel of the size of m × m over the volutional neural network composed by blocks and skip
input of the layer. That is, the calculation of the lth convolu- connections which permit to increase the number of lay-
tional layer is given by ers in the network. Success of Deep Neural network has
 been accredited to the additional layer, but He at al. empiri-
m
l, j j l−1, j cally showed that there exists a maximum threshold for the
xi = f wa · xi+a−1 + bj , network’s depth without avoiding vanishing\explosion gra-
a=1
dient’s issues [93].
l, j
In Residual Neural Networks, the output xt−1 is both
where m is the kernel size, and xi is the jth kernel on passed as an input to the next convolutional-activation-
j
the i-th unit of the convolutional layer l. wa is the convolu- pooling block and directly added to the output of the block
tional kernel matrix and b j is the bias of the convolutional f (xt−1 ). The former addiction is called shortcut connection.
kernel. This value is mapped through the activation function The resulting output is xt = f (xt−1 ) + xt−1 . This procedure
σ . Thereafter, a pooling layer is responsable to compute the is repeated many times and permit to deepen the network
maximum or average value on a patch of the size r × r of the without adding neither extra parameters nor computation
resulting activation’s output. Mathematically, a local output complexity. Figure 8 shows an example of ResNet. Bianco et
after the max pooling or the average pooling process is given al. state that ResNet represents the most performing network
by in the state of the art [94], while Ferrari et al. demonstrated
that ResNet outperforms traditional machine learning tech-
niques [59,95].
l, j
max pooling xi = maxra,b=1 (xa,b ) Long-Short-Term-Memory Networks is a variant of the
Recurrent Neural Network which enables to store infor-
l, j r
average pooling xi = 1
r2 a,b=1 (x a,b ). mation over time in an internal memory, overcoming gra-
dient’s vanishing issue. Given a sequence of inputs x =
The pooling layer is responsible to extracts impor- {x1 , x2 , ..., xn }, LSTM’s external inputs are its previous cell
tant features and to reduces the data size’s dimension. state ct−1 , the previous hidden state h t−1 , and the current
This convolutional-activation-pooling layers block can be input vector xt . LSTM associates each time step with an input
repeated may time if necessary. The number of repetition gate, forget gate, and output gate, denoted, respectively, as i t ,
time determines the depth of the network. f t , and ot , which all are computed by applying an activation
Generally, between the last block and the output layer one function of the linear combination of weights, input xi , and
(or more), fully connected layer is added to perform a fusion hidden state h t−1 . An intermediate state c̃i is also computed
of the information extracted from all sensor channels [88]. through the tahnh of the linear combination of weights, input
After the feed-forward procedure is ended, the back propa- xi , and hidden state h t−1 . Finally, the cell and hidden state
gation is performed on the convolutional weights until the are updated as
convergence to the global minimum or until a stopping crite-
rion is met. Figure 7 depicts a CNN example in HAR, with six
channels, corresponding to xyz-acceleration and xyz-angualr
ct = f t · c̃t + i t · c̃t
velocity data, two convolutional-activation-max pooling lay-
ers, one fully connected layer, and a soft-max layer which
h t = ot · thanh(ct ).
compute the class probability given input data.
CNN is a robust model under many aspects: in terms of
local dependency due to the the signals correlation, in terms The forget gate decides how much of the previous infor-
of scale invariance for different paces or frequencies, and in mation is going to be forgotten. The input gate decides how
terms of sensor position [31,71]. For this reasons, CNN have to update the state vector using the information from the cur-
been largely studied in HAR [91]. rent input. Finally, the output gate decides what information
Additionally, CNN have been compared to other tech- to output at the current time step [30]. Figure 9 represents
niques. CNN outperforms SVM in [78] and baseline Random the network schema.

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 201

Fig. 7 Convolutional neural


network schema

Table 4 Activation functions


Activation function Step Sigmoid Tanh ReLU

0 if x < 0 1
Formula tan(x) max(0, x)
1 if0 ≤ x 1+e x

Table 5 Loss functions for


Loss function Cross-entropy Hinge Euclidean Absolute value
neural networks. M = number of
classes, x = input data, y = class, M M M
Formula − y=1 y · log( px,y ) max(0, 1 − ŷ · y) y=1 ( ŷ − y)2 y=1 | ŷ − y|
px,y = probability of being y
given x

Although LSTM is a very powerful techniques when feature selection techniques to select the best features [97,
data temporal dependencies have to be considered during 98].
classification, it takes into account only past information. Additionally, approaches using hand-crafted features make
Bidirectional-LSTM (BLSTM) offers the possibility to con- it very difficult to compare between different algorithms due
sider past and future information. Hammerla et al. illustrate to different experimental grounds and encounter difficulty in
how their results based on LSTM and BLSTM, verified on a discriminating very similar activities [40].
large benchmark dataset, are the state-of-the-art [96]. In recent years, deep learning techniques are increasingly
becoming more and more attractive in human activity recog-
nition. First applied to 3D and 2D context in particular in
vision computing domain [99,100], deep learning methods
7.1.3 Traditional machine learning vs traditional deep have been shown to be valid methods also adapted to the
learning 1D case, that is, for time series classification [101], such as
HAR.
Machine learning techniques have been demonstrated to be Deep learning techniques have shown many advantages
high performing even with low amount of labeled data and over the machine learning, among them the capability to
that are low time-consumption methods. automatically extract features. In particular, depending on
Nevertheless, machine learning techniques remain highly the depth of the algorithm, it is possible to achieve a very
expertise-dependent algorithms. Input data feeding machine high abstraction level for the features, despite machine
learning algorithms are normally features, a processed ver- learning techniques [71]. In these terms, deep learning
sion of the data. Features permit to reduce data dimen- techniques are considered valid algorithms to overcome
sionality and computational time. However, features are machine learning dependency on the feature extraction
hand-crafted and are expert knowledge and tasks depen- procedure and show crucial advantages in algorithm perfor-
dent. mance.
Furthermore, engineered features cannot represent salient
feature of complex activities, and involve time-consuming

123
202 Journal of Reliable Intelligent Environments (2021) 7:189–213

However, deep learning techniques, unlike traditional


machine learning approaches, require a large number of sam-
ples and an expensive hardware to estimate the model [95].
Large-scale inertial datasets with millions of signals recorded
by hundreds of human subjects are still not available, and
instead, several smaller datasets made of thousands of signals
and dozens of human subjects are publicly available [102].
It is therefore not obvious in this domain, which method
between deep and machine learning is the most appropri-
ate, especially in those case where the hardware is low
cost.
Scarcity of data results in an important limit of machine
learning and deep learning approaches in HAR: the difficul-
ties in being able to generalize the models against the variety
of movements performed by different subjects [103]. This
variety occurs in relation to heterogeneity in the hardware
on which the inertial data are collected, different inherent
capabilities or attributes relating to the users themselves, and
differences in the environment in which the data are col-
lected.
One of the most relevant difficulties to face with new sit-
uations is due to the population diversity problem [6], that
is, the natural differences between users’ activity patterns,
which implies that different executions of the same activity
are different. A solution is to leverage labeled user-specific
data for a personalized approach to activity recognition
[104].

7.2 Personalized learning methods

Traditional systems are limited in their ability to generalize to


new users and/or new environments, and require considerable
effort and customization to achieve good performance in a
real context [105,106].
As previously mentioned, one of the most relevant chal-
lenges to face with new situations is due to the population
diversity problem: as users of mobile sensing applications
increase in size, the differences between people cause the
accuracy of classification to degrade quickly [6].
Ideally, algorithms should be trained on a representative
number of subjects and on as many cases as possible. The
number of subjects present in the dataset does not just impact
the quality and robustness of the induced model, but also the
ability to evaluate the consistency of results across subjects
[107].
Furthermore, although new technology potentially enables
to store large amount of data from varied devices, the actual
availability of data are scarce and public datasets are normally
very limited (see Sect. 8). In particular, it is very difficult to
source labeled data necessary to train supervised machine
learning algorithms. To face this problem, activity classifica-
tion models should be able to generalize as much as possible
Fig. 8 ResNet full schema with respect to the final user.

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 203

Fig. 9 LSTM networks schema.


Source: “Nonlinear Dynamic
Soft Sensor Modeling With
Supervised Long Short-Term
Memory Network”, by X. Yuan,
L. Li, and Y. Wang, 2020, IEEE
Transactions on Industrial
Informatics, vol. 16, no. 5, pp.
3168–3176, copyright IEEE

Fig. 10 A graphical
representation of the three main
classification models

Following sections discuss state-of-the-art results related context. The flaw is that it must be implemented for each
to population diversity issue based on the personalization of end user [108].
machine learning and deep learning algorithms. The hybrid model uses the end user data and the data of
the other users for the development of the activity recog-
nition model. In other words, the classification model is
7.2.1 Personalized machine learning trained both on the data of other users and partially on data
from the final user. The idea is that the classifier should
To achieve generalizable activity recognition models based recognize easier the activity performed by the final user.
on machine learning algorithms, three approaches are mainly Figure 10 shows a graphical depiction of the three models
adopted in literature: to better clarify their differences. Tapia et al. [109] intro-
duced the subject-independent and subject-dependent
– Data-based approaches encompass three data split con- models, and later Weiss at al. [29] the hybrid model.
figurations: subject-independent, subject-dependent, and The models were compared by different researchers and
hybrid. The subject-independent (also called imper- also extended to achieve better performance.
sonal) model does not use the end user data for the Medrano et al. [110] demonstrated that the subject-
development of the activity recognition model. It is based dependent approach achieves higher performance and
on the definition of a single activity recognition model then subject-independent approach for falls detection,
that must be flexible enough to be able to generalize the called respectively personal and generic fall detec-
diversity between users and it should be able to have good tor.
performance once a new user is to be classified. Shen et al. [111] achieved similar results for activity
The subject-dependent (also called personal) model only recognition and come to the conclusion that the subject-
uses the end user data for the development of the activity dependent (termed personalized) model tends to perform
recognition model. The specific model, being built with better than the subject-independent (termed generalized)
the data of the final user, is able to capture her/his pecu- one, because user training data carry her/his personalized
liarities, and thus, it should well generalize in the real activity information.

123
204 Journal of Reliable Intelligent Environments (2021) 7:189–213

Lara et al. [112] consider subject-independent approach higher the weight, the more similar two users are and
more challenging, because in practice, a real-time activ- the more that signals from those users is used for classi-
ity recognition system should be able to fit any individual fication.
and they consider not convenient in many cases to train Garcia-Ceja et al. [116,117] exploited inter-class simi-
the activity model for each subject. larity instead of the similarity between subjects (called
Weiss at al. [29] and Lockhart et al. [61] compared the inter-user similarity) presented by Lane et al. [6]. The
subject-independent and the subject-dependent (termed final model is trained using only the instances that are
impersonal and personal, respectively) with the hybrid similar to the target user for each class.
model. They concluded that the models built on the – Classifier-based approaches obtain generalization from
subject-dependent and the hybrid approaches achieve several combinations of activity recognition models.
same performance and outperform the performance of Hong at al. [105] proposed a solution where the general-
the model based on the subject-independent approach. ization is obtained by a combination of activity recogni-
Similar conclusions are achieved by Lane et al. [6], who tion models (trained by a subject-dependent approach).
compare subject-dependent and subject-independent This combination permits to achieve better activity recog-
(respectively, named isolated and single) models with nition performance for the final user.
another model called multi-naive. In this case, subject- Reiss et al. [118] proposed a model that consists of a set
dependent approach outperformed the other two of weighted classifiers (experts). Initially, all the weights
approaches as the amount of the available data increases. have the same values. The classifiers are adapted to a
Chen et al. [75] compared the subject-independent, new user by considering a new set of suitable weights
subject-dependent, and hybrid (respectively, called rest- that better fit the labeled data of the new user.
to-one, one-to-one, and all-to-one) models, and once
again the subject-dependent model outperforms the Ferrari et al. have recently proposed a similarity-based
subject-independent model, whereas the hybrid model approach that does not fall into the above classification [70].
achieves the best performance. The authors also classify The proposed approach is a combination of data-based and
subject-independent and hybrid models as generalized similarity-based approaches. Authors trained the algorithms
models, while the subject-dependent model falls into the by exploiting the similarity between subject and different
category of the personalized models. data splits. They stated that hybrid models and similarity
Same results have been achieved by Vaizman et al. achieve best performance with respect to the state-of-the-art
[113], who compared the subject-independent, subject- algorithms.
dependent, and hybrid (respectively, called universal,
individual, and adapted) models. Furthermore, they 7.2.2 Personalized deep learning
introduced context-based information by exploiting many
sensors, such as, location, audio, and phone-state sen- Personalized deep learning techniques have been explored in
sors. the literature and mainly refer to two main approaches
– Similarity-based approach consider the similarity between
users as a crucial factor for obtaining a classification – Incremental learning refers to recognition methods that
model able to adapt to new situations. can learn from streaming data and adapt to new moving
Sztyler et al. [114,115] proposed a personalized vari- style of a new unseen person without retraining [119]. Yu
ant of the hybrid model. The classification model is et al. [120] exploited the hybrid model and compare it to a
trained using the data of those users that are similar to new model called incremental hybrid model. The latter is
the final user based on signal patterns similarity. They trained first with the subject-independent approach, and
found that people with same fitness level also have simi- then, it is incrementally updated based on personal data
lar acceleration patterns regarding the running activity, from a specific user. The difference from the hybrid is
whereas gender and physique could characterize the that the incremental hybrid model gives more weights to
walking activity. The heterogeneity of the data is not personal data during training.
eliminated, but it is managed in the classification pro- Similarly, Siirtola et al. [41] proposed an incremental
cedure. learning method. The method initially uses a subject-
A similar approach is presented by Lane et al. [6]. The independent model, which is updated with a two-step
proposed approach consists in exploiting the similarity feature extraction method from the test subject data.
between users to weight the collected data. The simi- Afterwards, the same authors proposed a 4 steps subject-
larities are calculated based on signal pattern data, or dependent model [39]. The proposed method initially
on physical data (e.g., age and height), or on lifestyle uses a subject-independent model, collects and labels
index. The value of similarity is used as weight. The the data from the user based on the subject-independent

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 205

model, trains a subject-dependent model on the collected Table 6 Confusion matrix


and labeled data, and classifies activity based on the Groundtruth Estimated
subject-dependent model. 1 0
Vo et al. [121] exploited a similar approach. The proposed
approach first trains a subject-dependent model from data 1 True positives (TP) False negatives (FP)
of subject A. The model of subject A is then transferred 0 False positives (FN) True negatives (TN)
to subject B. Then, the unlabeled samples of subject B
are classified to the model of subject A. These data are
finally used to adjust model for the subject B. split the training and test set in k-folds. The entire classifica-
Abdallah et al. [122] propose an incremental learning tion procedure is performed on each split, k times. Thus, k
algorithm based on clustering procedure which aims at models are estimated, and their performances are evaluated
tuning the general model to recognize a given user’s per- and averaged.
sonalized activities. The classification performance is calculated through
– Transfer learning bases on pre-trained network; it heuristic metrics based on the correctly classified samples.
adjusts weights using new user’s data. This procedure In particular, these metrics are all derived from the confusion
permits to reduce the time consumption of the training matrix.
phase. In addition, it is a powerful method for when In supervised machine learning, the confusion matrix
scarcity of labeled data does not permit to train a net- compares the groundtruth (the observed labels) with the esti-
work from scratch. mated labels. The binary case is shown in Table 6.
Rokni et al. [123] propose to personalize their HAR mod- True positives are observed 1-class samples which are
els with transfer learning. In the training phase, a CNN is classified as 1. True negatives represent the number of
first trained with data collected from a few participants. observed 0-class samples which are classified as 0. False
In the test phase, only the top layers of the CNN are fine- negatives are 0-class samples which are classified as class 1.
tuned with a small amount of data for the target users. Viceversa, False positives represent the number of samples
classified as 1-class but which truly belongs to 0-class.
A recent personalization approach is proposed by Ferrari The confusion matrix can be extended to the multi-class
et al. that relies on similarity among the subjects. The simi- classification problem. In this case, on the principal diago-
larity is used to select the m most similar ones to the target. nal are displayed the number of correctly classified samples,
The algorithm is trained with their data [124]. while out of the principal diagonal miss-classified samples
are listed.
7.3 Metrics for performance evaluation The classification performance can be measured by focus-
ing either on the number of correct classified samples or
In supervised machine learning algorithms, the classifica- by giving more importance on the miss-classification. The
tion uses three sets of data: the training, the validation, and choice of the evaluation metric accentuates either one or the
the test datasets. The training set is designed to estimate the other aspect of the classification.
relation between input and output, together with the model In the context of HAR, the accuracy is the most used
parameters. The validation set is designed to affine and tune metric for the evaluation of the classification performance
the model parameters and hyper-parameters. With hyper- [6,33,40,126]. According to the confusion matrix showed in
parameters, it is meant the parameters that are not necessarily Table 6, accuracy (Acc) is defined as follows:
directly involved in the model, but define the structure of the
algorithm, such as, for instance, the number of the channels TP +TN
Acc = .
in a deep network. Finally, the test set is used to evaluate T N + FP + FN + T P
the classification performance of the resulted classification
model. It calculates the percentage of correctly classified samples
Training, validation, and test sets are generally defined as over the total number of the samples. The accuracy highlights
a partition of the original dataset and mostly representing, the correct classification performance and gives more empha-
for instance, the 70%, 20%, and 10% of the number of the sis to the classification of the true positives and of the true
samples. negatives.
It is a common practice to perform the k-fold cross- In some cases, it is required that the evaluation of the clas-
validation procedure [48,81,125]. The k-fold-cross-validation sification performance accentuates the mis-classifications,
is a procedure that helps to achieve more robust results and such as either false positives or false negatives cases. For
helps to avoid that the algorithm specializes on a specific instance, in the case of falls detection, the algorithm should
partitions of the original dataset. In particular, it consists in be more penalized if it does not recognize a fall when it occurs

123
206 Journal of Reliable Intelligent Environments (2021) 7:189–213

(false negative) with respect to the cases where it does rec- In Table 7, each dataset has assigned an ID (column ID).
ognize a normal behavior as fall (false positive). Columns Dataset and Reference specify the official name
An appropriate metric for this case is the Fβ -score. It is and the bibliographic reference of each dataset respectively.
defined as function of recall and precision. Column # Activities specifies the number of ADLs present
Recall is also called sensitivity or true positive rate and in the dataset. Usually, each dataset contains 6–10 ADLs and
is calculated as the number of correct positive predictions in some cases, both ADLs and Falls data are considered, as
divided by the total number of positives; the best value cor- in datasets D08, D09, D11.
responds to 1, the worst to 0. Column # Subjects reports the number of subjects that
Precision is also called positive predictive value and is cal- performed the activities. Considering a restricted number of
culated as the number of correct positive predictions divided subjects in the analysis does not just impact the quality and
by the total number of positive prediction; the best precision robustness of the classification, but also the ability to eval-
is 1, whereas the worst is 0. uate the consistency of the results across subjects [107]. In
Formula are given by other words, the number of the subjects included in the train-
ing set of the algorithm is crucial in terms of generalization
pr ecision = TP
T P+F P capabilities of the model to classify a new unseen instance.
r ecall = T P+F TP Nevertheless, the difference between people, also called pop-
N
ulation diversity, could lead to poor classification, as largely
Fβ = (1+β
2 )· pr ecision·r ecall
(β 2 · pr ecision)+r ecall discussed in [6]. Unfortunately, most of the datasets are lim-
ited in terms of subject numerousness.
If β = 1, F1 -score is the harmonic mean of the precision To overcome this issues, recently, several HAR research
and the recall. groups implemented strategies for merging datasets [102,
The specifictiy, also called true negative rate, is calcu- 134]. Other techniques, such as transfer learning and per-
lated as the number of correct negative predictions divided sonalization, have been investigated for robustness of results
by the total number of negatives. Best value corresponds to [61,123,135].
1, while the worst is 0. Together with the sensitivity, the Column Devices reports typologies and number of devices
specificity helps to determine the best parameter value when that have been used to collect the data. In particular, datasets
a tuning procedure is computed. A common practice is to D03, D04, D05, D06, D11, D12 collected data from sev-
calculate the area under the curve (AUC) created by plot- eral wearable devices at the same time, which is due to the
ting values of the sensitivity and 1-specificity. This curve is following reasons. First, the device position influences the
called Receiver-Operating Characteristic curve (ROC). The performance of the classification. Several works investigated
value of the parameter which maximizes the classification which position leads to the best classification [35,136]. Fur-
performance corresponds the point on the ROC curve where thermore, it is also challenging to investigate devices fusion,
AUC is maximal. which has a not negligible positive effects on the classifica-
tion performances and reflects realistic situation where users
employ multiple smart devices at once [30,56,63,114].
8 Datasets Position-aware and position-unaware scenarios have been
presented in [35]. In position-aware scenarios, the recogni-
In recent years, the spread of wearable devices has lead to tion accuracy on different positions is evaluated individu-
a huge availability of physical activity data. Smartphones ally, while in position-unaware scenarios, the classification
and smartwatches have become more and more pervasive performance of the combination of devices positions is mea-
and ubiquitous in our everyday life. This high diffusion and sured. It is shown that the latter approach highly improves the
portability of wearable devices has enabled researchers to classification performances for some activities, such as walk-
easily produce plenty of labeled raw data for human activity ing, walking upstairs, and walking downstairs. Almaslukh et
recognition. al. exploited deep learning technique for classification and
Several public datasets are open to the HAR community demonstrated its capability to produce an effective position-
and are freely accessible on the web, see for instance the UC independent HAR.
Irvine Machine Learning Repository [127]. Column Sensors lists the sensors exploited in data collec-
Table 7 shows the main characteristics of the most used tion. Tri-axial acceleration sensor (A) is the most exploited
datasets in the state-of-the-art. inertial sensor among the literature [7]. Datasets D9, D14,
Most of the datasets used contain signals recorded by and D15 even collected just acceleration data. Acceleration
smartphones. Some datasets also contain signals from both is very popular, because it both directly captures the subjects’
smartphones and IMUs, and from both smartphones and physiology motion status and it consumes low energy [137].
smartwatches (datasets D03, D010, D11, and D16).

123
Table 7 Public HAR dataset collection inertial signals recorded from smartphone
ID Dataset # Activity # Subject # Devices Sensors Sampling rate (Hz) Metadata Reference

D01 UCI HAR 6 ADL 30 SP(1) A,G 50 No [49]


D02 Smartphone-based recognition of human 6 ADL 30 SP(1) A,G 50 No [49]
Activities and postural transitions data set
D03 HHAR 6 ADL 9 SP(8),SW(4) A,G H No [128]
D04 Physical activity recognition dataset using 6 ADL 4 SP(4) A,G,M 50 No [20]
Smartphone sensors
D05 Sensors activity dataset 7 ADL 10 SP(5) AG,M,LA 50 No [35]
D06 Complex human activities dataset 13 ADL 10 SP(2) A,G,LA 50 No [81]
D07 Motions sense 6 ADL 24 SP(1) A,G,AT 50 Gender, Age [129]
Height,Weight
Journal of Reliable Intelligent Environments (2021) 7:189–213

D08 MobiAct 11 ADL, 4 F 67 SP(1) A,G,OR 87 Gender, Age [130]


Height,Weight
D09 UniMiB-SHAR 9 ADL, 8 F 30 SP(1) A 50 Gender, Age [43]
Height,Weight
D10 UMAFall 12 ADL, 3 F 19 SP(1),IMUs(4) A,G,M 200,20 Gender, Age [131]
Height,Weight
D11 Real world 8 ADL 15 SP(6),SW(1) A,G,GPS,L,M,S 50 Gender, Age [37]
Height,Weight
D12 WISDM 6 ADL 29 SP (1) A 20 No [62]
D013 Smartphone dataset for HAR in 6 ADL 30 SP(1) A,G 50 No [49]
Ambient assisted living (AAL) data Set
D14 Daily activity dataset 5 ADL 8 SP (1) A 40 No [132]
D15 HASC2010 6 ADL 96 SP(1) A [10-100] Gender,Height [133]
Weight,Shoes
Floor,Place
D016 Extrasensory dataset 7 ADL + 60 SP(1), SW A,G,M,CO,LO,S,SM,ST 40,25 No [113]
109 Specific activities
ADL = Activity of daily living, F = Falls; A = Accelerometer, LA = Linear Acceleration Sensor G = Gyroscope, M = Magnetometer, AT = attitude, OR = orientation, L = light, S = sound, SM =
sound magnitude GPS = Global Positioning System,CO= compass, LO = location, ST = phone state, H = highest frequency as possible, SP = smartphone, SW = smartwatch, and IMU = inertial
measurement unit

123
207
208 Journal of Reliable Intelligent Environments (2021) 7:189–213

Acceleration has been combined with other sensors, such share the exact same specifications. As an example, some
as gyroscope, magnetometer, GPS, and biosensors, with the accelerometers may output signals including the low fre-
aim of improving activity classification performance. quencies of the gravity acceleration, while other may exclude
In general, data captured from several sensors carry addi- it internally. For this reason, the preprocessing phase is of
tional informations about the activity and about the device paramount importance to reduce signal differences due to
settings. For instance, information derived from the gyro- heterogeneous sources and improve the consistency between
scope is used to maintain reference direction in the motion the in vitro training (usually performed with specific devices
system and permits to determine the orientation of the smart- and sensors) and real-world use, where the devices and sen-
phone [32,51]. sors may be similar, but not equal, to the ones used when the
Performances comparisons between gyroscope, accelera- models have been trained.
tion, and their combination for human activity recognition Moreover, we covered the fact that the way the signal
have been explored in many studies. For example, Ferrari et is segmented and fed to the classification model may have
al. showed that accelerometer is more performing than the a significant impact on the results. In the literature, sliding
gyroscope and their combination leads to an overall improve- windows with a 50% overlap is the most common choice.
ment of about 10% [36]. Shoaib et al. stated that in situations Another aspect we highlighted during this study is the
where accelerometer and gyroscope individually perform importance of the features used to train the model, as they
with low accuracies, their combination improved the over- have a significant impact on the performances of the clas-
all performances, while when one of the sensors performs sifiers. Specifically, hand-crafted features may better model
with higher accuracy, the performances are not impacted by some already known traits of the signals, but automatically
the combination of the sensors [35]. extracted features are free of bias and may uncover unfore-
Column Sampling Rate shows the frequency at which the seen patterns and characteristics.
data are acquired. The sampling rate has to be high enough New and improved features that are able to better repre-
to capture most significant behavior of data. In HAR, the sent the peculiar traits of the human activities are needed:
most commonly used sampling rate is 50 Hz when recording ideally, they would combine the domain knowledge of the
inertial data (see Table 7). experts given in the hand-crafted features and the lack of
Column Metadata lists characteristics regarding the sub- bias provided by the automatically generated features.
jects that performed the activities. In D07–11, D15 physical Finally, regarding the Classification phase, we highlighted
characteristics are annotated. In D15, environmental char- how Model-Driven approaches are being replaced by Data-
acteristics have been also stored, such as the kind of shoes Driven approaches as they are usually better performing.
worn, floor characteristics, and places where activities have Among the Data-Driven approaches, we find that both
been preformed. As discussed in Sect. 7, metadata are pre- traditional ML approaches and more modern DL techniques
cious additional information, which help to overcome the can be applied to HAR problems. Specifically, we learned
population diversity problem. that while DL methods outperform traditional ML most of
the time and are able to automatically extract the features,
they require significant amounts of computational power and
9 Lessons learned and future directions data more than traditional ML techniques, which makes the
latter still a good fit for many use cases.
In this study, we covered the main steps of the Activity Recog- Regardless of the classification method, we discussed how
nition Process (ARP). For each phase of the ARP pipeline, Population Diversity may impact the performances of HAR
we highlighted what are the key aspects that are being con- applications. To alleviate this problem, we mentioned some
sidered and that are more challenging in the field of HAR. recents trends regarding the personalization of the models.
Specifically, when considering the Data Acquisition phase, Personalizing a classification model means identifying only a
we noted that the number and kind of available devices portion of the population that is similar to the current subject
is constantly increasing and new devices and sensors are under some perspective and then only use this subset to train
introduced every day. To take advantage of this aspect, new the classifier. The resulting model should be better fitting to
sensors should be experimented in HAR applications to the end user. This, however, may exacerbate the issue of data
determine whether or not they can be employed to recog- scarcity, since only small portions of the full datasets may be
nize actions. Moreover, new combinations (data fusion) are used to train the model for that specific user.
possible, which may again increase the ability of the data to To solve this issue, more large-scale data collection cam-
represent the performed activities. paigns are needed, as well as further studies in the field of
This increase in sensor numbers and types, while ensuring dataset combination and preprocessing pipelines to effec-
the availability of more data sources, may pose a challenge tively combine and reduce differences among data acquired
in terms of heterogeneity as not all the devices and sensors from different sources.

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 209

10 Conclusions Proceedings of the International Conference on Ubiquitous Com-


puting (UbiComp)
7. Lara OD, Labrador MA et al (2013) A survey on human activity
This paper surveyed the state-of-the-art and new trends in recognition using wearable sensors. IEEE Commun Surv Tutor
human activity recognition using smartphones. In particu- 15(3):1192
lar, we went through the activity recognition process: the 8. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words
data acquisition, preprocessing, data segmentation, feature and fusion methods for action recognition: comprehensive study
and good practice. Comput Vis Image Underst 150:109
extraction, and classification steps. 9. Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF, (2017)
Each step has been analyzed by detailing the objectives Cdc: Convolutional-de-convolutional networks for precise tem-
and discussing the techniques mainly adopted for its realiza- poral action localization in untrimmed videos, in Proceedings of
tion. the IEEE conference on computer vision and pattern recognition
, pp. 5734–5743
We conclude the review by providing some considerations
10. Zhang S, Wei Z, Nie J, Huang L, Wang S, Li Z (2017) A review on
on the state of maturity of the techniques currently employed human activity recognition using vision-based method. J Health-
in each step and by providing some ideas for future research care Eng 2017
in the field. 11. Gonzàlez J, Moeslund TB, Wang L et al (2012) Semantic
understanding of human behaviors in image sequences: from
We do not claim to have included everything that has been video-surveillance to video-hermeneutics. Comput Vis Image
published on human activity recognition, but we believe that Underst 116(3):305
our paper can be a good guide for all those researchers and 12. Buettner M, Prasad R, Philipose M, Wetherall D (2009) Recog-
practitioners that approach this topic for the first time. nizing daily activities with RFID-based sensors. In: Proceedings
of the 11th international conference on Ubiquitous computing,
Funding Open access funding provided by Università degli Studi di pp. 51–60
Milano - Bicocca within the CRUI-CARE Agreement. 13. Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2013) Berke-
ley mhad: a comprehensive multimodal human action database.
In: 2013 IEEE Workshop on Applications of Computer Vision
Open Access This article is licensed under a Creative Commons
(WACV) (IEEE, 2013), pp 53–60
Attribution 4.0 International License, which permits use, sharing, adap-
14. Wang W, Liu AX, Shahzad M, Ling K, Lu S (2017) Device-free
tation, distribution and reproduction in any medium or format, as
human activity recognition using commercial WiFi devices. IEEE
long as you give appropriate credit to the original author(s) and the
J Sel Areas Commun 35(5):1118
source, provide a link to the Creative Commons licence, and indi-
15. Qi J, Yang P, Waraich A, Deng Z, Zhao Y, Yang Y (2018) Examin-
cate if changes were made. The images or other third party material
ing sensor-based physical activity recognition and monitoring for
in this article are included in the article’s Creative Commons licence,
healthcare using Internet of Things: a systematic review. J Biomed
unless indicated otherwise in a credit line to the material. If material
Inf 87:138
is not included in the article’s Creative Commons licence and your
16. Sreenilayam SP, Ahad IU, Nicolosi V, Garzon VA, Brabazon D
intended use is not permitted by statutory regulation or exceeds the
(2020) Advanced materials of printed wearables for physiological
permitted use, you will need to obtain permission directly from the copy-
parameter monitoring. Mater Today 32:147
right holder. To view a copy of this licence, visit http://creativecomm
17. Godfrey A, Hetherington V, Shum H, Bonato P, Lovell N, Stuart
ons.org/licenses/by/4.0/.
S (2018) From A to Z: wearable technology explained. Maturitas
113:40
18. Chotpitayasunondh V, Douglas KM (2016) How “phubbing”
becomes the norm: the antecedents and consequences of snub-
bing via smartphone. Comput Hum Behav 63:9
References 19. Perlow LA (2012) Sleeping with your smartphone: how to break
the 24/7 habit and change the way you work. Harvard Business
1. Foerster F, Smeja M, Fahrenberg J (1999) Detection of posture Press, Harvard
and motion by accelerometry: a validation study in ambulatory 20. Shoaib M, Scholten H, Havinga PJ (2013) Towards physical activ-
monitoring. Comput Hum Behav 15(5):571 ity recognition using smartphone sensors. In: 2013 IEEE 10th
2. Sun S, Folarin AA, Ranjan Y, Rashid Z, Conde P, Stewart C, international conference on ubiquitous intelligence and comput-
Cummins N, Matcham F, Dalla Costa G, Simblett S et al (2020) ing and 2013 IEEE 10th international conference on autonomic
Using smartphones and wearable devices to monitor behavioral and trusted computing (IEEE, 2013), pp 80–87
changes during COVID-19. J Med Intern Res 22(9): 21. Muralidharan K, Khan AJ, Misra A, Balan RK, Agarwal S (2014)
3. Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee Barometric phone sensors: more hype than hope!. In: Proceedings
D (2020) EnsemConvNet: a deep learning approach for human of the 15th workshop on mobile computing systems and applica-
activity recognition using smartphone sensors for healthcare tions, pp 1–6
applications. Multimed Tools Appl 79(41):31663 22. Cheng J, Sundholm M, Zhou B, Hirsch M, Lukowicz P (2016)
4. Iyengar K, Upadhyaya GK, Vaishya R, Jain V (2020) COVID-19 Smart-surface: large scale textile pressure sensors arrays for activ-
and applications of smartphone technology in the current pan- ity recognition. Pervas Mob Comput 30:97
demic. Diabetes Metab Syndrome: Clin Res Rev 14(5):733 23. Foubert N, McKee AM, Goubran RA, Knoefel F (2012) Lying
5. Shoaib M, Bosch S, Incel O, Scholten H, Havinga P (2015) A and sitting posture recognition and transition detection using a
survey of online activity recognition using mobile phones. Sensors pressure sensor array. In: 2012 IEEE international symposium
15(1):2059 on medical measurements and applications proceedings (IEEE,
6. Lane ND, Xu Y, Lu H, Hu S, Choudhury T, Campbell AT, 2012), pp 1–6
Zhao F (2011) Enabling large-scale human activity inference 24. Rehman M, Ziaur Waris A, Gilani SO, Jochumsen M, Niazi IK,
on smartphones using community similarity networks (csn). In: Jamil M, Farina D, Kamavuako EN, (2018) Multiday EMG-based

123
210 Journal of Reliable Intelligent Environments (2021) 7:189–213

classification of hand motions with deep learning techniques. Sen- 42. Li F, Shirahama K, Nisar MA, Köping L, Grzegorzek M (2018)
sors 18(8):2497 Comparison of feature learning methods for human activity recog-
25. Liu J, Chen J, Jiang H, Jia W, Lin Q, Wang Z (2018) Activity nition using wearable sensors. Sensors 18(2):679
recognition in wearable ECG monitoring aided by accelerome- 43. Micucci D, Mobilio M, Napoletano P (2017) Unimib shar: a
ter data. In: 2018 IEEE international symposium on circuits and dataset for human activity recognition using acceleration data
systems (ISCAS) (IEEE, 2018), pp 1–4 from smartphones. Appl Sci 7(10):1101
26. Bao L, Intille SS (2004) Activity recognition from user-annotated 44. Khan AM, Lee YK, Lee SY, Kim TS (2010) Human activity
acceleration data. In: International conference on pervasive com- recognition via an accelerometer-enabled-smartphone using ker-
puting, Springer, New York, pp 1–17 nel discriminant analysis. In: 2010 5th international conference
27. Lee SM, Yoon SM, Cho H (2017) Human activity recognition on future information technology (IEEE, 2010), pp 1–6
from accelerometer data using convolutional neural network. In: 45. Ravi N, Dandekar N, Mysore P, Littman ML (2005) Activity
2017 IEEE International conference on big data and smart com- recognition from accelerometer data. In: Proceedings of the con-
puting (BigComp) (IEEE, 2017), pp 131–134 ference on innovative applications of artificial intelligence (IAAI)
28. Shakya SR, Zhang C, Zhou Z (2018) Comparative study of 46. Lester J, Choudhury T, Borriello G (2006) A practical approach
machine learning and deep learning architecture for human activ- to recognizing physical activities. In: International conference on
ity recognition using accelerometer data. Int J Mach Learn pervasive computing, Springer, New York, pp 1–16
Comput 8:577 47. Gyllensten IC, Bonomi AG (2011) Identifying types of physical
29. Weiss GM, Lockhart JW (2012) The impact of personalization activity with a single accelerometer: evaluating laboratory-trained
on smartphone-based activity recognition. In: Proceedings of the algorithms in daily life. IEEE Trans Biomed Eng 58(9):2656
AAAI workshop on activity context representation: techniques 48. Bayat A, Pomplun M, Tran DA (2014) A study on human activ-
and languages ity recognition using accelerometer data from smartphones. Proc
30. Milenkoski M, Trivodaliev K, Kalajdziski S, Jovanov M, Sto- Comput Sci 34:450
jkoska BR (2018) Real time human activity recognition on 49. Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL (2013) A
smartphones using LSTM networks. In: 2018 41st International public domain dataset for human activity recognition using smart-
convention on information and communication technology, elec- phones. In: Proceedings of the European symposium on artificial
tronics and microelectronics (MIPRO) (IEEE, 2018), pp 1126– neural networks, computational intelligence and machine learning
1131 (ESANN13)
31. Almaslukh B, Artoli AM, Al-Muhtadi J (2018) A robust deep 50. Bo X, Huebner A, Poellabauer C, O’Brien MK, Mummidisetty
learning approach for position-independent smartphone-based CK, Jayaraman A (2007) Evaluation of sensing and processing
human activity recognition. Sensors 18(11):3726 parameters for human action recognition. In: 2017 IEEE Inter-
32. Alruban A, Alobaidi H, Clarke N, Li F (2019) Physical activ- national Conference on Healthcare Informatics (ICHI) (IEEE,
ity recognition by utilising smartphone sensor signals. In: 8th 2017), pp 541–546
International conference on pattern recognition applications and 51. Su X, Tong H, Ji P (2014) Activity recognition with smartphone
methods, SciTePress, pp 342–351 sensors. Tsinghua Sci Technol 19(3):235
33. Hernández F, Suárez LF, Villamizar J, Altuve M (2019) Human 52. Antonsson EK, Mann RW (1985) The frequency content of gait.
activity recognition on smartphones using a bidirectional LSTM J Biomech 18(1):39
network. In: 2019 XXII symposium on image, signal processing 53. Quigley B, Donnelly M, Moore G, Galway L (2018) A com-
and artificial vision (STSIVA) (IEEE, 2019), pp 1–5 parative analysis of windowing approaches in dense sensing
34. Hassan MM, Uddin MZ, Mohamed A, Almogren A (2018) A environments. In: Multidisciplinary Digital Publishing Institute
robust human activity recognition system using smartphone sen- Proceedings, vol 2, p 1245
sors and deep learning. Fut Gen Comput Syst 81:307 54. Banos O, Galvez JM, Damas M, Pomares H, Rojas I (2014)
35. Shoaib M, Bosch S, Incel OD, Scholten H, Havinga PJ (2014) Window size impact in human activity recognition. Sensors
Fusion of smartphone motion sensors for physical activity recog- 14(4):6474
nition. Sensors 14(6):10146 55. Chen K, Zhang D, Yao L, Guo B, Yu Z, Liu Y (2020) Deep
36. Ferrari A, Micucci D, Mobilio M, Napoletano P (2019) Human learning for sensor-based human activity recognition: overview,
activities recognition using accelerometer and gyroscope. In: challenges and opportunities. arXiv:2001.07416
European conference on ambient intelligence, Springer, New 56. Janidarmian M, Roshan Fekr A, Radecka K, Zilic Z (2017) A com-
York, pp 357–362 prehensive analysis on wearable acceleration sensors in human
37. Sztyler T, Stuckenschmidt H (2016) On-body localization of activity recognition. Sensors 17(3):529
wearable devices: An investigation of position-aware activity 57. Capela NA, Lemaire ED, Baddour N (2015) Improving classi-
recognition. In: 2016 IEEE international conference on perva- fication of sit, stand, and lie in a smartphone human activity
sive computing and communications (PerCom) (IEEE, 2016), pp recognition system. In: 2015 IEEE international symposium on
1–9 medical measurements and applications (MeMeA) proceedings,
38. Bharti P, De D, Chellappan S, Das SK (2018) HuMAn: complex IEEE, pp 473–478
activity recognition with multi-modal multi-positional body sens- 58. Langley P (1996) Elements of machine learning. Morgan Kauf-
ing. IEEE Trans Mob Comput 18(4):857 mann, New York
39. Siirtola P, Koskimäki H, Röning J (2019) From user-independent 59. Ferrari A, Micucci D, Marco M, Napoletano P (2019) Hand-
to personal human activity recognition models exploiting the sen- crafted features vs residual networks for human activities recog-
sors of a smartphone. arXiv:1905.12285 nition using accelerometer. In: Proceedings of the IEEE interna-
40. Zhu R, Xiao Z, Li Y, Yang M, Tan Y, Zhou L, Lin S, Wen H tional symposium on consumer technologies (ISCT)
(2019) Efficient human activity recognition solving the confusing 60. Liu H, Motoda H (1998) Feature extraction, construction and
activities via deep ensemble learning. IEEE Access 7:75490 selection: A data mining perspective, vol 453, Springer, New York
41. Siirtola P, Koskimäki H, Röning J (2019) Personalizing 61. Lockhart JW, Weiss GM (2014) The benefits of personalized
human activity recognition models using incremental learning. smartphone-based activity recognition models. In: Proceedings of
arXiv:1905.12628 the 2014 SIAM international conference on data mining (SIAM,
2014), pp 614–622

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 211

62. Kwapisz JR, Weiss GM, Moore SA (2011) Activity recognition 82. Rokach L, Maimon OZ (2008) Data mining with decision trees:
using cell phone accelerometers. ACM SIGKDD Explor Newsl theory and applications. Data mining with decision trees: theory
12(2):74 and applications, vol. 69, World scientific, Singapore
63. Altun K, Barshan B, Tunçel O (2010) Comparative study on 83. Breiman L (1999) 1 RANDOM FORESTS–RANDOM FEA-
classifying human activities with miniature inertial and magnetic TURES
sensors. Pattern Recogn 43(10):3605 84. Polu SK (2018) Human activity recognition on smartphones
64. Sani S, Massie S, Wiratunga N, Cooper K (2017) Learning using machine learning algorithms. Int J Innovat Res Sci Technol
deep and shallow features for human activity recognition. In: 5(6):31
International conference on knowledge science, engineering and 85. Bansal A, Shukla A, Rastogi S, Mittal S (2018) Micro activity
management, Springer, New York, pp 469–482 recognition of mobile phone users using inbuilt sensors. In: 2018
65. Plötz T, Hammerla NY, Olivier PL (2011) Feature learning for 8th international conference on cloud computing, data science &
activity recognition in ubiquitous computing. In: Twenty-second engineering (confluence), IEEE, pp 225–230
international joint conference on artificial intelligence 86. Antal P (1998) Construction of a classifier with prior domain
66. Lago P, Inoue S (2019) Comparing Feature Learning Methods for knowledge formalised as bayesian network. In IECON’98. Pro-
Human Activity Recognition: Performance study in new user sce- ceedings of the 24th Annual Conference of the IEEE Industrial
nario. In: 2019 Joint 8th International Conference on Informatics, Electronics Society (Cat. No. 98CH36200), vol 4, IEEE, pp 2527–
Electronics & Vision (ICIEV) and 2019 3rd International Confer- 2531
ence on Imaging, Vision & Pattern Recognition (icIVPR) (IEEE, 87. Nguyen H, Tran KP, Zeng X, Koehl L, Tartare G (2019) Wear-
2019), pp 118–123 able sensor data based human activity recognition using machine
67. Wang J, Liu P, She MF, Nahavandi S, Kouzani A (2013) Bag- learning: a new approach. arXiv:1905.03809
of-words representation for biomedical time series classification. 88. Yu T, Chen J, Yan N, Liu X (2018) A multi-layer parallel
Biomed Signal Process Control 8(6):634 LSTM Network for Human Activity Recognition with Smart-
68. Shirahama K, Grzegorzek M (2017) On the generality of code- phone Sensors. In: 2018 10th International conference on wireless
book approach for sensor-based human activity recognition. communications and signal processing (WCSP), IEEE, pp 1–6
Electronics 6(2):44 89. Suto J, Oniga S, Lung C, Orha I (2018) Comparison of offline
69. Abdi H, Williams LJ (2010) Principal component analysis. Wiley and real-time human activity recognition results using machine
Interdiscip Rev Comput Stat 2(4):433 learning techniques. In: Neural computing and applications, pp
70. Ferrari A, Micucci D, Mobilio M, Napoletano P (2020) On the 1–14
personalization of classification models for human activity recog- 90. Nair N, Thomas C, Jayagopi DB (2018) Human activity recogni-
nition. IEEE Access 8:32066 tion using temporal convolutional network. In: Proceedings of the
71. Wang J, Chen Y, Hao S, Peng X, Hu L (2019) Deep learning for 5th international workshop on sensor-based activity recognition
sensor-based activity recognition: a survey. Pattern Recogn Lett and interaction, pp 1–8
119:3 91. Demrozi F, Pravadelli G, Bihorac A, Rashidi P (2020) Human
72. Zhang W, Yang G, Lin G, Ji C, Gupta MM (2018) On definition activity recognition using inertial, physiological and environmen-
of deep learning. In: 2018 World Automation Congress (WAC), tal sensors: a comprehensive survey. arXiv:2004.08821
IEEE, pp 1–5 92. Ronao CA, Cho SB (2015) Deep convolutional neural networks
73. Lin Y, Zhang W (2004) Towards a novel interface design frame- for human activity recognition with smartphone sensors. In: Inter-
work: function-behavior-state paradigm. Int J Hum Comput Stud national conference on neural information processing, Springer,
61(3):259 pp 46–53
74. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 93. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
20(3):273 image recognition. In: Proceedings of the IEEE conference on
75. Chen Y, Shen C (2017) Performance analysis of smartphone- computer vision and pattern recognition (CVPR), pp 770–778
sensor behavior for human activity recognition. IEEE Access 94. Bianco S, Cadene R, Celona L, Napoletano P (2018) Benchmark
5:3095 analysis of representative deep neural network architectures. IEEE
76. Amezzane I, Fakhri Y, El Aroussi M, Bakhouya M (2018) Towards Access 6:64270
an efficient implementation of human activity recognition for 95. Ferrari A, Micucci D, Mobilio M, Napoletano P (2019) Hand-
mobile devices. EAI Endorsed Trans Context-Aware Syst Appl crafted features vs residual networks for human activities recog-
4(13) nition using accelerometer. In: 2019 IEEE 23rd international
77. Vaughn A, Biocco P, Liu Y, Anwar M (2018) Activity detection symposium on consumer technologies (ISCT), IEEE, pp 153–156
and analysis using smartphone sensors. In: 2018 IEEE Interna- 96. Hammerla NY, Halloran S, Plötz T (2016) Deep, convolutional,
tional Conference on Information Reuse and Integration (IRI), and recurrent models for human activity recognition using wear-
IEEE, pp 102–107 ables. arXiv:1604.08880
78. Xu W, Pang Y, Yang Y, Liu Y (2018) Human activity recog- 97. Friday NH, Al-garadi MA, Mujtaba G, Alo UR, Waqas A (2018)
nition based on convolutional neural network. In: 2018 24th Deep learning fusion conceptual frameworks for complex human
International conference on pattern recognition (ICPR), IEEE, pp activity recognition using mobile and wearable sensors. In: 2018
165–170 International conference on computing, mathematics and engi-
79. Jalal A, Quaid MAK, Hasan AS (2018) Wearable sensor-based neering technologies (iCoMET), IEEE, pp 1–7
human behavior understanding and recognition in daily life for 98. Yang J, Nguyen MN, San PP, Li XL, Krishnaswamy S (2015) Deep
smart environments. In: 2018 International conference on fron- convolutional neural networks on multichannel time series for
tiers of information technology (FIT), IEEE, pp 105–110 human activity recognition. In: Proceedings of the international
80. Witten IH, Frank E, Hall MA (2005) Practical machine learning joint conference on artificial intelligence (IJCAI 15)
tools and techniques. Morgan Kaufmann, pp 578 99. Coşar S, Donatiello G, Bogorny V, Garate C, Alvares LO, Bré-
81. Shoaib M, Bosch S, Incel OD, Scholten H, Havinga PJ (2016) mond F (2016) Toward abnormal trajectory and event detection in
Complex human activity recognition using smartphone and wrist- video surveillance. IEEE Trans Circ Syst Video Technol 27(3):683
worn motion sensors. Sensors 16(4):426

123
212 Journal of Reliable Intelligent Environments (2021) 7:189–213

100. Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition 119. Siirtola P, Röning J (2019) Incremental learning to personalize
for intelligent video surveillance systems: a review. Expert Syst human activity recognition models: the importance of human AI
Appl 91:480 collaboration. Sensors 19(23):5151
101. LeCun Y, Bengio Y et al (1995) Convolutional networks for 120. Yu T, Zhuang Y, Mengshoel OJ, Yagan O (2016) Hybridizing
images, speech, and time series. Handb Brain Theory Neural Netw personal and impersonal machine learning models for activity
3361(10):1995 recognition on mobile devices. In: Proceedings of the EAI interna-
102. Siirtola P, Koskimäki H, Röning J (2018) OpenHAR: A Matlab tional conference on mobile computing, applications and services
toolbox for easy access to publicly open human activity data sets. (MobiCASE)
In: Proceedings of the ACM international joint conference and 121. Vo QV, Hoang MT, Choi D (2013) Personalization in mobile activ-
international symposium on pervasive and ubiquitous computing ity recognition system using K-medoids clustering algorithm. Int
and wearable computers (UbiComp18) J Distrib Sens Netw 9(7):315841
103. Bianchi V, Bassoli M, Lombardo G, Fornacciari P, Mordonini M, 122. Abdallah ZS, Gaber MM, Srinivasan B, Krishnaswamy S (2015)
De Munari I (2019) IoT wearable sensor and deep learning: an Adaptive mobile activity recognition system with evolving data
integrated approach for personalized human activity recognition streams. Neurocomputing 150:304
in a smart home environment. IEEE Internet of Things J 6(5):8553 123. Rokni SA, Nourollahi M, Ghasemzadeh H (2018) Personalized
104. Burns DM, Whyne CM (2020) Personalized activity recognition human activity recognition using convolutional neural networks.
with deep triplet embeddings. arXiv:2001.05517 In: Thirty-second AAAI conference on artificial intelligence
105. Hong JH, Ramos J, Dey AK (2016) Toward personalized activity 124. Ferrari A, Micucci D, Mobilio M, Napoletano P (2020) On the
recognition systems with a semipopulation approach. IEEE Trans personalization of classification models for human activity recog-
Hum-Mach Syst 46(1):101–112 nition. arXiv:2009.00268 (2020)
106. Igual R, Medrano C, Plaza I (2015) A comparison of public 125. Ronao CA, Cho SB (2014) Human activity recognition using
datasets for acceleration-based fall detection. Med Eng Phys smartphone sensors with two-stage continuous hidden Markov
37(9):870 models. In: 2014 10th International conference on natural com-
107. Lockhart JW, Weiss GM (2014) Limitations with activity recogni- putation (ICNC), IEEE, pp 681–686
tion methodology & data sets. In: Proceedings of the 2014 ACM 126. Su X, Tong H, Ji P (2014) Accelerometer-based activity recog-
international joint conference on pervasive and ubiquitous com- nition on smartphone. In: Proceedings of the 23rd ACM interna-
puting: adjunct publication, pp 747–756 tional conference on conference on information and knowledge
108. Berchtold M, Budde M, Schmidtke HR, Beigl M (2010) An exten- management, pp 2021–2023
sible modular recognition concept that makes activity recognition 127. Bay SD, Kibler D, Pazzani MJ, Smyth P (2000) The UCI KDD
practical. In: Annual conference on artificial intelligence (AAAI) archive of large data sets for data mining research and experimen-
109. Tapia EM, Intille SS, Haskell W, Larson K, Wright J, King A, tation. ACM SIGKDD Explor Newsl 2(2):81
Friedman R (2007) Real-time recognition of physical activities 128. Stisen A, Blunck H, Bhattacharya S, Prentow TS, Kjaergaard MB,
and their intensities using wireless accelerometers and a heart rate Dey A, Sonne T, Jensen MM (2015) Smart devices are different:
monitor. In: Proceeding of the IEEE international symposium on assessing and mitigating mobile sensing heterogeneities for activ-
wearable computers (ISWC) ity recognition. In: Proceedings of the 13th ACM conference on
110. Medrano C, Igual R, Plaza I, Castro M (2014) Detecting falls embedded networked sensor systems, pp 127–140
as novelties in acceleration patterns acquired with smartphones. 129. Malekzadeh M, Clegg RG, Cavallaro A, Haddadi H (2018) Pro-
PLoS One 9(4):e94811 tecting sensory data against sensitive inferences. In: Proceedings
111. Shen C, Chen Y, Yang G (2016) On motion-sensor behavior analy- of the workshop on privacy by design in distributed systems (W-
sis for human-activity recognition via smartphones. In: 2016 Ieee P2DS18)
International Conference on Identity, Security and Behavior Anal- 130. Vavoulas G, Chatzaki C, Malliotakis T, Pediaditis M, Tsik-
ysis (Isba), IEEE, pp 1–6 nakis M (2016) The MobiAct dataset: recognition of activities
112. Lara OD, Pérez AJ, Labrador MA, Posada JD (2012) Centinela: of daily living using smartphones. In: Proceedings of Information
a human activity recognition system based on acceleration and and Communication Technologies for Ageing Well and e-Health
vital sign data. Pervasiv Mob Comput 8(5):717 (ICT4AgeingWell16)
113. Vaizman Y, Ellis K, Lanckriet G (2017) Recognizing detailed 131. Casilari E, Santoyo-Ramón JA, Cano-García JM (2017)
human context in the wild from smartphones and smartwatches. UMAFall: a multisensor dataset for the research on automatic
IEEE Pervasive Comput 16(4):62 fall detection. Procedia Comput Sci 110:32
114. Sztyler T, Stuckenschmidt H (2017) Online personalization of 132. Siirtola P, Röning J (2012) Recognizing human activities user-
cross-subjects based activity recognition models on wearable independently on smartphones based on accelerometer data.
devices. In: Proceedings of the IEEE international conference on IJIMAI 1(5):38
pervasive computing and communications (PerCom) 133. Kawaguchi N, Watanabe H, Yang T, Ogawa N, Iwasaki Y, Kaji K,
115. Sztyler T, Stuckenschmidt H, Petrich W (2017) Position-aware Terada T, Murao K, Hada H, Inoue S et al (2012) Hasc2012corpus:
activity recognition with wearable devices. Pervasiv Mob Comput large scale human activity corpus and its application. In: Proceed-
38:281 ings of the second international workshop of mobile sensing: from
116. Garcia-Ceja E, Brena R (2015) Building personalized activity smartphones and wearables to big data, pp 10–14
recognition models with scarce labeled data based on class simi- 134. Ferrari A, Mobilio M, Micucci D, Napoletano P (2019) On the
larities. In: International conference on ubiquitous computing and homogenization of heterogeneous inertial-based databases for
ambient intelligence, Springer, New York, pp 265–276 human activity recognition. In: 2019 IEEE world congress on
117. Garcia-Ceja E, Brena R (2016) Activity recognition using com- services (SERVICES), IEEE, pp 295–300
munity data to complement small amounts of labeled instances. 135. Ferrari A, Micucci D, Marco M, Napoletano P (2019) On the
Sensors 16(6):877 homogenization of heterogeneous inertial-based databases for
118. Reiss A, Stricker D (2013) Personalized mobile physical activity human activity recognition. In: Proceedings of IEEE services
recognition. In: Proceeding of the IEEE international symposium workshop on big data for public health policy making
on wearable computers (ISWC) 136. Krupitzer C, Sztyler T, Edinger J, Breitbach M, Stuckenschmidt
H, Becker C (2018) Hips do lie! a position-aware mobile fall detec-

123
Journal of Reliable Intelligent Environments (2021) 7:189–213 213

tion system. In: 2018 IEEE international conference on pervasive Publisher’s Note Springer Nature remains neutral with regard to juris-
computing and communications (PerCom), IEEE, pp 1–10 dictional claims in published maps and institutional affiliations.
137. Huynh DTG (2008) Human activity recognition with wearable
sensors, Human activity recognition with wearable sensors. Ph.D.
thesis, Technische Universitat

123

You might also like