0% found this document useful (0 votes)

10 views

Federated Learning For Activity Recognition A System Level Perspective

Uploaded by

sadiasajidshaik

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Federated Learning For Activity Recognition A System Level Perspective

Uploaded by

sadiasajidshaik

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Received 9 June 2023, accepted 19 June 2023, date of publication 26 June 2023, date of current version 30 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3289220

Federated Learning for Activity Recognition:

A System Level Perspective
STEFAN KALABAKOV 1,2,3 , BORCHE JOVANOVSKI 1 , DANIEL DENKOVSKI 1 ,
VALENTIN RAKOVIC 1 , (Senior Member, IEEE), BJARNE PFITZNER 2 , ORHAN KONAK 2,

BERT ARNRICH 2 , AND HRISTIJAN GJORESKI 1

1 Faculty of Electrical Engineering and Information Technologies, Ss. Cyril and Methodius University in Skopje, 1000 Skopje, North Macedonia
2 Digital Health—Connected Healthcare Group, Hasso Plattner Institute, University of Potsdam, 14469 Potsdam, Germany
3 Jožef Stefan International Postgraduate School, 1000 Ljublana, Slovenia
Corresponding author: Valentin Rakovic ([email protected])
This work was supported by the WideHealth Project—European Union’s Horizon 2020 Research and Innovation Programme under Grant
952279. The work of Stefan Kalabakov was supported by the Slovene Human Resources Development and Scholarship Fund (Ad Futura).

ABSTRACT The past decade has seen substantial growth in the prevalence and capabilities of wearable
devices. For instance, recent human activity recognition (HAR) research has explored using wearable devices
in applications such as remote monitoring of patients, detection of gait abnormalities, and cognitive disease
identification. However, data collection poses a major challenge in developing HAR systems, especially
because of the need to store data at a central location. This raises privacy concerns and makes continuous
data collection difficult and expensive due to the high cost of transferring data from a user’s wearable device
to a central repository. Considering this, we explore the adoption of federated learning (FL) as a potential
solution to address the privacy and cost issues associated with data collection in HAR. More specifically,
we investigate the performance and behavioral differences between FL and deep learning (DL) HAR models,
under various conditions relevant to real-world deployments. Namely, we explore the differences between
the two types of models when (i) using data from different sensor placements, (ii) having access to users
with data from heterogeneous sensor placements, (iii) considering bandwidth efficiency, and (iv) dealing
with data with incorrect labels. Our results show that FL models suffer from a consistent performance deficit
in comparison to their DL counterparts, but achieve these results with much better bandwidth efficiency.
Furthermore, we observe that FL models exhibit very similar responses to those of DL models when exposed
to data from heterogeneous sensor placements. Finally, we show that the FL models are more robust to data
with incorrect labels than their centralized DL counterparts.

INDEX TERMS Human activity recognition, federated learning, deep learning, system-level aspects,
different and heterogeneous sensor placements, FL optimizers, fraction fit, bandwidth efficiency, data errors,
feature selection, model complexity.

I. INTRODUCTION context information acquired through them, which enables

The ever-increasing ubiquity of devices such as smartphones, applications such as (i) remote monitoring of patients [1],
smartwatches, fitness trackers, and smart glasses has paved (ii) prevention and detection of high-risk situations, such
the way for many new applications that could be offered as falls [2], (iii) fitness and lifestyle improvements [3],
to users. This is mainly due to the incredibly valuable (iv) detection of cognitive diseases such as Parkinson’s dis-
ease [4], [5], and (v) automatic activity log generation [6].
The associate editor coordinating the review of this manuscript and Although today’s wearable devices contain many different
approving it for publication was Mostafa M. Fouda . types of sensors and can capture large amounts of diverse

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
64442 VOLUME 11, 2023
S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

data, data collection remains one of the most prevalent prob- •different server-side model aggregation strategies for FL
lems in Human Activity Recognition (HAR). This is due to (i.e., FL optimizers),
the fact that the process is time-consuming, expensive, and • a different percentage of clients participating in the
usually performed only once, at the start of the development learning process,
of any specific HAR pipeline. One of the reasons behind • communication bandwidth efficiency,
performing data collection only once is that the technologies • model size and model complexity as a result of feature
used to develop HAR models require data to be centrally selection,
stored before being used. This introduces significant privacy • data with corrupted labels.
risks and hinders continuous data collection both because The lessons learned throughout the paper can later serve
of security concerns and the potentially substantial costs as comprehensive guidelines for designing and optimizing
incurred by sending large volumes of data from a user’s federated learning systems for HAR.
device to a central data store. The paper is organized as follows. Section II presents the
A possible solution to these problems could be the use related work at the intersection of FL and HAR. Section III
of Federated Learning (FL) [7], instead of the widely used presents the two HAR datasets used for training and evalua-
centralized classical machine learning (ML) and deep learn- tion of our models. Next, Section IV describes the method-
ing (DL) methods. FL is a distributed learning paradigm ology, namely, the feature extraction that was performed, the
that focuses on developing a shared model using clients who model architecture used as well as the FL system architecture
each only have access to their own data. The primary advan- used. Section V presents the evaluation setup, metrics and the
tage of FL is that a client’s data never leaves their device, details of the experiments. Section VI presents and discusses
which substantially decreases any security risks related to the results from the experiments. Section VII compiles the
sharing sensitive information. In addition, the only infor- lessons learned though our results and, finally, Section VIII
mation that leaves the user’s device when using FL is the provides a summary of the paper and discusses potential
computed updates/weights to the local model, which substan- directions for future work.
tially reduces the volume of data that needs to be sent to a
central data store compared to when users send actual sensor II. RELATED WORK
readings. Both the improved security and the reduced volume State-of-the-art ML and DL solutions for HAR usually
of data leaving a user’s device increase the feasibility of require data from different sensors and users to be located
performing continuous data collection, which, in turn, would in one central location before being used to develop models.
significantly impact the ability of models to improve over The disadvantages of training centralized models appear in
time and adapt to changes in the distribution of the data. the form of privacy concerns and the inability to perform
However, efficient deployment and optimal operation of continuous data collection due to both the security risk and
FL in real-world scenarios is far from a trivial task. FL is the high cost of transferring large amounts of data from a
commonly deployed on communication and computationally user’s device to a central data store. FL can mitigate these
constrained devices, and requires a better understanding of disadvantages by constructing a shared model using only
how various system-level factors impact its reliability and the updates/weights computed by each client on their local
applicability. Such an understanding has immense potential machine and data.
to facilitate the development of more effective FL-based mod- Over the past few years, numerous studies have inves-
els, which would advance the practical application of FL in tigated the use of FL in the field of HAR. The major-
real-world settings. ity of these studies have concentrated on exploring new
Rather than proposing a new model or FL optimizer, this FL applications in the context of activity recognition or
paper aims for a more significant and wider impact. Our enhancing FL pipelines and methodologies [8], [9], [10],
main goal is to provide a comprehensive and rigorous sys- [11], [12], [13], [14], [15]. These works usually aim to
tem level analysis of federated learning for human activity enhance the accuracy and resilience of the FL model, but they
recognition. This paper is the first, to our knowledge, to offer seldom focus on a broad exploration of the real-world deploy-
such a system-level perspective that covers various practical ment requirements of FL at the system level. Specifically,
aspects and considerations. In particular, this work initially there is a lack of research on how various factors, such as
focuses on analyzing the performance gaps between the cen- fusion of data from different sensor placements, exposure to
tralized deep learning and the distributed federated learning clients with data from heterogeneous sensor placements, and
approaches using two different HAR datasets with different exposure to data with corrupted labels, affect the accuracy,
sensor placements. Finally, this paper aims to characterize the communication efficiency, and complexity of FL systems for
behavior of FL by comparing it to that of a centralized DL, HAR. Furthermore, a head-to-head comparison of DL and FL
when considering the following important practical aspects models under varying conditions, in order to quantify and
for real-world deployments: define the differences between the two paradigms, is also
• data from different sensor placements, rarely performed.
• heterogeneous sensor placements in clients that partici- Only a limited number of studies have attempted to pro-
pate in the training at the same time, vide a deeper understanding of the system-level specifics

VOLUME 11, 2023 64443

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

of FL in the field of HAR. One example of such analysis

is presented in [16], where the impact of non-iid data on
the performance of a FL-based activity recognition system
was investigated. Specifically, the authors examined how the
performance was affected by clients having access to different
subsets of activities, unbalanced numbers of examples from
the activities they performed, and corrupted data. They also
proposed a technique to address the issue of corrupted data.
Another study that focused on the heterogeneity of clients’
data is [17], where a device selection strategy was proposed
to alleviate problems such as activity class imbalance and
varying data sizes per client. Finally, in [18], the authors
evaluated and compared various FL optimizers, including
personalized ones. The study found that the federated aver-
aging approach provided better global performance than the
other more complex personalized approaches.
Although the mentioned papers have made contribu-
tions to the understanding of some system level aspects of
FL-based HAR, they fail to provide a broad and thorough
enough investigation of the requirements and implications
in real-world deployments of these systems. More specifi-
cally, these works fail to investigate issues associated with
diverse sensor placements in clients and data fusion, datasets
containing corrupt labels, and the trade-off between optimal
FL-specific hyperparameters, model accuracy, and communi-
cation and computational overhead. This work aims to build
upon these limitations by exploring the effects of various
FL-related factors on the overall system performance. Specif-
ically, this work investigates how different sensor placements,
FL optimizers and FL-specific hyperparameters, and data
fusion affect model performance. In addition, the analysis
encompasses the effects of communication bandwidth, model
complexity, and data with corrupted labels on the overall
precision, robustness and overhead efficiency of FL models.

III. DATA AND PREPROCESSING

For the purposes of training and evaluation of our mod-
els, in this work, we used the JSI-FOS [19], [20] and
PAMAP2 [21], [22] (hereinafter referred to simply as
PAMAP) datasets. Both of these datasets contain recordings
of activities of daily living (ADL) made using Inertial Mea-
surement Units (IMUs) which users wore attached to different
parts of their bodies.
More specifically, the JSI-FOS dataset consists of record- FIGURE 1. The distribution of activities in the aggregated training and
test subsets of the (a) JSI-FOS dataset, and the (b) PAMAP dataset.
ings collected from ten subjects while performing the
following activities: walking, standing, sitting, running,
lying, lying_exercising, kneeling, cycling, allfours_moving, performing the following activities: lying, walking, transi-
allfours. Although more IMU placements were available, tion, sitting, standing, ascending_stairs, descending_stairs,
in this analysis, we considered only the data collected by ironing, vacuum_cleaning, nordic_walking, rope_jumping,
the IMUs placed on the wrist of the dominant hand and the cycling, running. Here, as well, we chose to work with only a
thigh of the dominant leg. Furthermore, we only considered subset of the IMU locations and sensor modalities available in
data coming from the accelerometer and gyroscope. During the dataset, limiting ourselves to data coming from the wrist
data collection, values from the sensors were sampled using of the dominant hand and the chest of the user and coming
a frequency of 50 Hz. from either an accelerometer or a gyroscope. Originally, the
Similarly to the JSI-FOS dataset, the PAMAP dataset values from the sensors were sampled using a frequency
consists of recordings collected from nine subjects while of 100 Hz.

64444 VOLUME 11, 2023

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

Before applying the feature extraction procedure described

in Section IV-A to both of these datasets, we first performed
some common preprocessing steps. Firstly, we downsampled
the data in the PAMAP dataset to a sampling frequency of
50 Hz to reduce the complexity of the problem. Then, we han-
dled PAMAP’s missing values by performing a backward
fill operation followed by a forward fill operation (to handle
missing values at the end of recordings).
Regarding the preprocessing steps applied to both datasets,
we first segmented the continuous data streams into smaller
windows. More specifically, we used windows of two sec-
onds without any overlap. The label of each window was
determined as the label most commonly found among the
readings contained in the window. Next, we calculated the
magnitude of the vectors provided by the accelerometer and
the gyroscope at each sampling point. Finally, we filtered the
FIGURE 2. The architecture of the used feed-forward neural network for
raw sensor data using a band-pass filter to remove both the training/inference.
gravitational component and the noise inherently present in
the data. It is important to mention that we kept and used both
the unfiltered and the filtered versions of the data. sensors used for HAR [23]. The features we extracted, cate-
Fig. 1(a) and Fig. 1(b) show the distribution of the labels in gorized in three groups, are the following:
the training and testing subsets (as described in Section V-A),
• generic: mean, standard deviation, median, min, max,
after the segmentation of the original continuous recordings
of the JSI-FOS and PAMAP datasets, respectively. range, interquartile range, kurtosis, skewness, root mean
The data in both datasets are distributed equally among the square
• HAR-specific: integral, mean crossing rate, number of
subjects who participated in the data collection, except for
‘subject 109’ in the PAMAP dataset. This subject performed peaks, average height of peaks, peak-to-Average power
only a small subset of the activities and recorded very little ratio, sum, squared sum
• frequency-domain: energy, entropy, binned distribu-
data from them. It is also worth noting that the activity
distribution in each subject’s data is similar to the one shown tion, three largest PSD magnitudes and their frequencies,
in either Fig. 1(a) or Fig. 1(b), depending on the dataset to skewness, kurtosis
which the subject belongs. Additionally, it is evident that the The same features were extracted from both the accelerom-
activity distribution and frequency of occurrence are consis- eter and gyroscope data and more specifically, for each
tent between the test and train subsets, irrespective of the of their channels (including the magnitude). Furthermore,
dataset under consideration. as was previously mentioned, features were extracted from
both the filtered and the unfiltered versions of the signal
IV. METHODOLOGY values. In total, for each window we extracted 1184 features.
This section explains the methodology used in our study,
i.e., the feature extraction process, the deep learning model B. MODEL ARCHITECTURE
architecture and the federated learning setup, respectively After performing feature extraction, the last step of the
elaborated in the following subsections. pipeline is the learning/inference step, performed by the
aforementioned FFNN. The architecture of the FFNN is
A. FEATURE EXTRACTION shown in Fig. 2. It consists of an input layer with as many
To reduce the complexity of both the training and infer- neurons as there are features that describe a single window
ence phase in our experiments, we decided to use a simple of data, followed by two fully-connected layers with 64 and
Feed-Forward Neural Network (FFNN) that operates on 32 neurons, respectively. Both of these layers use the ReLU
extracted features instead of the raw sensor data. Moreover, activation function [24]. The two fully-connected layers are
utilizing a simplistic FFNN for HAR has more practical appli- followed by a single dropout layer with a rate of 0.2 [25].
cability, because of the limited computational and energy Finally, the last layer in the network is a softmax layer with
capacity of HAR-related Internet-of-Things (IoT) devices. either 10 or 13 neurons, depending on the dataset used for
This means that after preprocessing the data, the next step training and evaluation. This network is used in the learning/
in the ML pipeline is to extract an informative and diverse set inference step, regardless of whether the pipeline is used in a
of features with which the FFNN would be able to achieve DL or FL context.
high classification accuracy. Considering this, we extracted Finally, we want to point out that in all experiments, the
several types of features which have proved effective when models are trained using the Adam optimizer [26], with a
analyzing time-series data and in particular, data from inertial learning rate of 0.0003, the categorical cross-entropy loss

VOLUME 11, 2023 64445

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

function, and a batch size of 256. It is important to note that, unique suitability of FL for developing personalized models.
when using FL, we also experimented with the use of the SGD To implement this setup, we divided the data of each user
(Stochastic Gradient Descent) optimizer, particularly because in both datasets into training and test subsets. The training
of the fact that it is stateless. However, the results suggested subset typically consisted of approximately 80% of the user’s
that there is no performance advantage of using SGD instead data, equivalent to around 100 minutes of labeled data (about
of Adam. 3100 windows/instances) in the JSI-FOS dataset and around
46 minutes of labeled data (about 1390 windows/instances)
C. FEDERATED LEARNING SETUP in the PAMAP dataset (except for ‘subject 109’). The
As previously mentioned, the core idea of FL is training remaining 20% of the user’s data, equivalent to around
a shared model using clients that never have to share data 20 minutes of labeled data (about 700 instances/windows)
between themselves or with a server [29]. The depiction of in the JSI-FOS dataset and around 13 minutes of labeled
a general FL implementation (and the one we use) is given in data (about 390 instances/windows) in the PAMAP dataset,
Fig. 3. A federated learning setup usually consists of a server formed the test set. No validation sets were used in this study
that holds the shared model and coordinates the training as there was no parameter tuning involved, and our focus
process, as well as clients which all hold their own local was solely on reporting performance changes using different
data and models. The training of a shared model is achieved setups on the test data from each user.
by aggregating the updates/weights that the clients make to To mitigate the potential issue of high similarity between
their local models using their local data. This way, clients do windows containing data from the same user in close tempo-
not have to share their data with the server, but instead, only ral proximity, we took precautions during the data splitting
share the updates/weights of their local model. One training process. We ensured that windows belonging to a continuous
iteration of the shared model in FL is referred to as a round. performance of a specific activity (activity segment) were
A more detailed illustration of what are the individual only present in either the training or test subset, but not
steps in a single round of training is given in Fig. 4. The both, in each of the two datasets. This was achieved through
whole process starts on the server-side with the initialization the following steps: (i) identifying activity segments in the
of the weights of the shared model. This only happens in data of each user, (ii) grouping activity segments based on
the first round of training (thus, it is depicted with a dashed the performed activity, (iii) iterating through the groups of
line). Next, the server picks a subset of clients (S) which activity segments and assigning each segment to either the
will participate in the specific training round. This is done to training or test subset.
simulate the fact that not all clients are available to participate During step (iii), we assigned activity segments from each
in each round. The number of clients selected in each round group to the training or test subset in such a way that approx-
of training is denoted as C. After picking the subset of clients imately 80% of the windows in the group belonged to the
that will participate, the server broadcasts the weights of the training subset of the user, while around 20% of the win-
current shared model to all of the clients that are included in dows in the group belonged to the test subset of the user.
the training round. This approach ensured that the evaluation of the model was
After receiving the broadcasted weights, each of the not biased by unintentional repetition of similar data during
included clients (client x ∈ S) creates a local copy of training and testing, and helped maintain the integrity of the
the shared model. This local model is then trained using evaluation process.
their local data for a few epochs. Subsequently, each of the Due to the inherent differences between DL and FL, the
included clients sends only their updates/weights of the local utilization of the training and test subsets varied for each
model to the server. It is important to note that when referring paradigm during the training and evaluation process. For
to updates, we mean the difference between the received DL models trained on one of the two available datasets, the
model and the local model after training using local data. training subsets of all users in that dataset were concatenated
Finally, after receiving the updates/weights from all partic- to update the model in each epoch. The concatenated test
ipating clients, the server is ready to update the shared model. subsets of all users in the same dataset were used to evaluate
This is done using some form of aggregation of the multiple the model after each epoch and at the end of the training
received updates/weights. The updated shared model is used procedure. In contrast, for FL models trained on one of the
as the starting point for training in the next round. two datasets, the training subset of each user was used to
train a local model in each round of FL. Simultaneously, the
V. EXPERIMENTAL SETUP test set of each user was used to evaluate the respective local
In the following subsections, we provide detailed information model’s performance. However, after each training round, the
about the evaluation setup, metrics of interest, and experi- shared global model was also evaluated using the concate-
ments conducted in our study. nated data from the test subsets of all users in the dataset.
This distinct approach in utilizing training and test subsets in
A. EVALUATION SETUP DL and FL models accounts for the differences in how data is
Instead of using a Leave-One-Subject-Out strategy, we opted aggregated and utilized in each paradigm, taking into consid-
for a more personalized evaluation setup due to the eration the distributed and collaborative nature of federated

64446 VOLUME 11, 2023

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

FIGURE 3. The architecture of a typical FL system.

utilized as the performance metric in this study. The macro

F-score avoids bias towards activities with a larger number
of examples, as it calculates the F-score for each activity
separately and reports the average of those results.
The F-score is a harmonic mean of the precision and recall
metrics for a specific label. While it may not be as easily
interpretable as accuracy, higher F-score and macro F-score
values (closer to 1.0) indicate better classification perfor-
mance, while lower values (closer to 0.0) indicate poorer
performance. It is worth noting that the macro F-score and
accuracy metric may report similar values on datasets with a
balanced distribution of activities.

C. EXPERIMENTS DEFINITION
The following section introduces all the experiments we con-
veyed in our study, providing descriptions, configurations and
targets of the experiment analysis.

1) SENSOR PLACEMENT IMPACT

Our first experiment performs a head-to-head comparison
between DL and FL models. Specifically, the comparison
includes observing the performance of FL models and their
gap to DL models when (i) using data from different sen-
sor placements and (ii) when using different numbers of
training epochs/rounds. In particular, the experiment inves-
tigates whether FL models exhibit similar behavior to DL
FIGURE 4. A step-by-step depiction of a single round of training when models when the above-mentioned conditions are varied,
using the federated learning paradigm. and analyzes the performance gap between the two learning
paradigms.
learning compared to the centralized training in deep More specifically, we trained six models (three DL models
learning. and three FL models based on the three different sensor
placements), on each of the two datasets, and evaluated their
B. METRICS performance after each epoch/round of training. The max-
To account for the imbalanced distribution of activities in imum number of epochs/rounds used for training both DL
the JSI-FOS and PAMAP datasets, the macro F-score was and FL models was 50. It is important to note that when

VOLUME 11, 2023 64447

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

comparing the DL and FL models, we treated one epoch (DL) contained data from only one sensor placement, depending
and one round (FL) as equivalent. This approach is intended on what type of data the user was chosen to have access to.
to provide fairness in the comparison, as FL locally operates As was the case previously, after each round, the model
on smaller amount of data, compared to DL, but exploits was evaluated on a test subset that was a combination of the
more local epochs. Also, the updates of the shared (global) individual test subsets of all users (clients). This effectively
FL model, occur in every round, which is equivalent to the meant that the test subset used to evaluate the model, had
model update at each epoch in the DL case. When training roughly the same ratio of examples from different placement
FL models, the C parameter was set to 6 and the number of as the ratio of users who had access to data from different
local epochs used, was 5. sensor placements.
Furthermore, as already mentioned, we also varied the Furthermore, aside from varying the number of users who
sensor placement whose data we used for training and testing. had access to each location, we also varied the number of
Namely, we used three possible sensor placements: (i) the clients used for training in each round and the number of
wrist of the dominant hand, (ii) the thigh of the right leg, rounds used to train each model.
when using the JSI-FOS dataset, or the chest when using the
PAMAP dataset, and (iii) a combination of both available sen- 4) BANDWIDTH EFFICIENCY ANALYSIS
sor placements. When using data from two different sensor One of the most prominent advantages of FL is the exchange
placements, the data were simply concatenated and examples of the model information, instead of the complete dataset.
from both placements had the same weight while training. This results in decreased volume of shared information, that
facilitates higher bandwidth efficiency and easier collabora-
2) FL OPTIMIZER IMPACT tion and model building. However, the improved bandwidth
The goal of this experiment is to explore the behavior of efficiency can result in performance decline. This experiment
different FL optimizers - FedAdagrad [27], FedYoGi [28], aims at analyzing the effects of bandwidth efficiency on the
and FedAvg [29], with respect to their macro F-score per- overall FL model performance. Specifically, the experiment
formance. Specifically, we will evaluate these optimizers strives to analyze how the number of clients and the volume
when using both sensor locations. Due to the different of the exchanged data impacts the precision and robustness
approach in computing the global model, it is expected that of the FL model.
some optimizers should operate more accurately for the case It is intuitive that DL will have an advantage compared
of HAR. to FL due to the larger volume data that is available to the
model at any point in time. However, this larger data volume
hampers the deployment of DL in real-world scenarios, where
3) IMPACT OF CLIENTS WITH HETEROGENEOUS SENSOR bandwidth limitation and efficiency is of utmost importance
PLACEMENTS to IoT-based HAR systems. Conversely, the experiment also
Our third experiment investigates the impact of building a compares the FL and DL performances for the same amount
shared model using clients that have access to data from dif- of exchanged data. The comparison provides further insights
ferent sensor placements. In real-world scenarios, not every regarding the applicability of FL when compared to DL.
person who uses an activity recognition service will wear The experiment setup and system configuration for the
their sensor-equipped device at the same location on their bandwidth efficiency analysis is the same as described in
body. For example, if that device is a smartphone, one person Section V-C1. The performance analysis is conducted with
may wear it in the pocket of their trousers, and another might respect to the attained macro F-score as a function of the
wear it in the pocket of their jacket or even in a backpack. volume of data transmitted to a server. For FL, the data
This means that some of the clients of a FL model might transfer volume is calculated as:
send updates computed on data from one sensor placement,
while others send updates computed on data from another DFL = C · Ntr · Nw · P (1)
sensor placement. Considering this, this experiment aims to
where C is the number of random clients that participate in the
explore the effects that receiving updates corresponding to
round, Ntr is the number of training rounds executed in order
data from heterogeneous sensor placements might have on
to attain the given macro F-score, Nw is the number of weights
the performance of the shared model.
of the client’s model and P is the memory size of each weight
To that end, we varied the number of users who only had
in the model (i.e., 4B per weight, assuming single precision
access to data from one sensor placement but not the other,
floating point). For DL, the data transfer volume is calculated
and observed the performance changes that occurred. In each
as:
training round, all clients, regardless of what data they had
access to, were eligible to be used for training, while the DDL = F · Nf · Ndr · P (2)
selection of which clients had access to a particular type of
data was done randomly. The whole process was repeated where F is the fraction of data used for training the DL
ten times to reduce the effects of randomness. It should be model, Nf is the number of features used for training (1184
pointed out that in each repetition, the test subset of each user in total), Ndr is the total amount of data rows (cumulative for

64448 VOLUME 11, 2023

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

all clients), and P is the feature precision (i.e., 4B, assuming A. SENSOR PLACEMENT IMPACT
single precision floats). Fig. 5 presents the main results from our first experiment.
For both, the DL and the FL strategy, multiple runs were More specifically, Fig. 5(a) and Fig. 5(b) show the achieved
conducted to calculate the 95% confidence intervals of the macro F-score in dependence of the number of training
macro F-score. For comparability reasons, only two DL vari- rounds/epochs, for the DL (shown using dotted lines) and
ations were considered, i.e., DL trained with 10% and 50% the FL models (solid lines) when using either the JSI-FOS or
of the training part of the dataset. PAMAP dataset for training and evaluation, respectively. The
three models per learning paradigm differ only in the sensor
5) MODEL COMPLEXITY AND THE EFFECTS OF FEATURE placement that provided the data they processed.
SELECTION When using JSI-FOS for training and evaluation, Fig. 5(a)
Often HAR-based systems rely on devices that have limited shows a clear ranking between the models that differ only
energy, computational and communication capabilities. Since in the sensor placement they used, regardless of whether
FL relies on local model building, it is crucial to minimize the DL or FL was used. For example, the worst performance
model complexity. However, straight-forward minimization was generated by DL and FL models that used data from
of the model complexity can have detrimental effects on the wrist sensor placement, while substantially better results
the overall performance of FL. As a result, there exists a were produced by those using either the thigh placement or a
requirement for exploring the possibilities that minimize the combination of both sensor placements. In fact, the best DL
model complexity without significantly decreasing the FL and FL models were produced using the combination of both
performances. placements. Furthermore, the results show that all models
Feature selection represents one of the most auspicious tend to plateau once the number of training epochs/rounds
ways of minimizing the model complexity while attaining a reaches 20, with models that use either both sensor place-
certain level of robustness and precision of the FL model. ments or the thigh sensor placement, converging slightly
This experiment analyzes the effects of model complexity faster than the models that use the wrist sensor placement.
minimization by feature selection, and discusses the potential When comparing models based on their type, i.e., DL or
benefits and pitfalls. FL, the results show that DL models always produced
For the purposes of this experiment, the performed feature slightly better results across the whole range of training
selection process is a Recursive Feature Elimination (RFE). epochs/rounds when compared to the corresponding FL
The goal of the feature selection was set as selecting the best model. Additionally, this performance gap between the two
100 features out of the total of 1184. Afterwards, the models types of models seems to remain almost constant across the
were trained and tested on these 100 most important features whole range of epochs/rounds, with the exception of the case
when DL and FL models are trained on data from both sensor
locations and when the number of epochs/rounds is above 20.
6) EFFECTS OF DATA WITH CORRUPTED LABELS
It is also evident that these models behave very similarly and
In real-world deployments, the available data is non-ideal usually generate test macro F-score curves that have nearly
and exhibits different negative properties, such as data will identical shapes, with FL models taking a slightly larger
be noisier and data labels can be incorrect. This experiment number of rounds to achieve their best performance.
analyzes the performance behavior of FL when considering The results presented in Fig. 5(b) indicate that using
non-ideal datasets. Specifically, the experiment analyzes the PAMAP as the dataset for model training and evaluation
FL performances when there exist errors in the labeling of the yields similar outcomes. It is worth noting that models of the
data. The amount of erroneous data (wrong labels) is varied same type maintain a consistent ranking. In particular, deep
for both DL and FL. Since FL relies on a subset instead learning (DL) and federated learning (FL) models that utilize
of all clients during each round of the training phase, it is data from both sensor locations perform better than those
very important to analyze how the volume of erroneous data using data from the chest alone, which in turn perform better
correlates with the number of active clients per round, and than those using data solely from the dominant wrist. How-
how it compares to the DL case. ever, a key difference when training and evaluating on the
The dataset with corrupted labels is generated from the PAMAP dataset is that the gap in the performance between
JSI-FOS dataset. The process of generating the erroneous models trained on wrist sensor data and models trained using
labels, is as follows: (i) randomly select specific amount of the chest location or data from multiple sensor locations is
labels (i.e., 1%, 10% or 20%) that will be incorrect; (ii) for the substantially smaller compared to that which is present when
selected labels, choose a different label based on a uniform using the JSI-FOS dataset. For instance, the FL model trained
random distribution from all available ones in the dataset; on chest data performs worse than the DL model trained on
(iii) use the newly generated dataset for training. wrist data, which is not observed in the case of using the
JSI-FOS dataset. We hypothesize that this discrepancy arises
VI. RESULTS AND DISCUSSION because data from the chest sensor placement is inherently
This section presents and elaborates on the main results we less informative for predicting the target activities compared
obtained from all the experiments introduced in Section V-C. to data coming from a sensor placed at the user’s thigh.
VOLUME 11, 2023 64449
S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

FIGURE 5. Comparison of macro F-scores [%] between DL and FL models at varying numbers of training
epochs/rounds when using the (a) JSI-FOS dataset, and the (b) PAMAP dataset.

Regarding the relative behavior of DL and FL when for training and evaluation, and to streamline our analysis,
using the PAMAP dataset, things remain unchanged. Again, we decided to exclusively present the results obtained on the
DL models always produce slightly better results across the JSI-FOS dataset from this point forward.
whole range of training epochs/rounds when compared to the Fig. 6 takes an even closer look into the relative per-
corresponding FL model. Additionally, the performance gap formance of the FL models compared to the DL models.
between these two models seems to remain constant as the It presents two confusion matrices, generated from the pre-
training of the model progresses. Furthermore, as was the dictions of a DL model and an FL model, both using data
case when using the JSI-FOS dataset, the results show that from both sensor locations for training and evaluation on
all models tend to plateau around the 20th epoch/round, with the JSI-FOS dataset. By comparing the confusion matrices,
models that use either both sensor placements or the chest we can observe that both DL and FL models exhibit very
sensor placement converging slightly faster than the models similar detection performance per activity class. Specifically,
that use the wrist sensor placement. Finally, here we can once both models achieve the best performance for activities such
more observe that the different models produce test macro as standing, lying, cycling and running. The worst perfor-
F-score curves that have nearly identical shapes. mances are attained for activities such as kneeling. It is also
Given that the relative performance of DL and FL models interesting to note that DL and FL models make mistakes
does not appear to change when using different datasets in roughly the same situations, namely, confusing lying for

64450 VOLUME 11, 2023

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

show that FedAvg provides best performances in terms of

the achieved macro F-score. The figure also shows that the
FedYoGi optimizer has comparable performances to FedAvg
for higher number of training rounds. The worst performance
is achieved by the FedAdagrad optimizer, attaining the lowest
macro F-score, and exhibiting large performance oscillations.
In all remaining experiments we use FedAvg, as it provides
best performances.

C. IMPACT OF CLIENTS WITH HETEROGENEOUS SENSOR

PLACEMENTS
Fig. 8 depicts the macro F-scores attained by FL models that
were trained for 50 rounds on the JSI-FOS dataset, using
varying values for the C parameter, and for different amounts
of data from the two sensor placements. The quantity of
data from each sensor placement is regulated by the number
of users who have access to the data from that particular
placement. The x-axis shows the sensor split i.e. how many
clients had access to data from the wrist sensor placement
(w) or the thigh sensor placement (t). The y-axis shows the
achieved macro F-score. The red line presents the results of a
DL model, while the rest of the lines correspond to FL models
that use different values for the C parameter. It is evident
that both DL and FL behave in a very similar manner. They
achieve the best performances for the case when all data is
derived from only the thigh (i.e. w0_t10), and achieve the
worst performance when all data is derived from the only the
wrist sensors (i.e. w10_t0). Moreover, it is noticeable that FL
closes the performance gap to DL for the case of w10_t0.
Fig. 8 also shows that FL models require a slightly larger
number of clients with data from the thigh sensor placement
before achieving more substantial improvements in perfor-
mance. Notably, when the data split corresponds to w10_t0,
w9_t1, w8_t2, w7_t3, or w6_t4 FL models perform at a level
close to the maximum achieved by models that use only
wrist sensor data in the first experiment (Fig. 5(a)). It is only
when at least five users have access to thigh sensor data that
FL models in this experiment start to show more substantial
improvement.
Our hypothesis is that the FL models’ inability to leverage
data from a potentially more informative sensor placement
results from the fact that, during each training round, only a
FIGURE 6. The confusion matrices generated by the (a) DL model, and the subset of clients (four, six, or eight) contribute their data for
(b) FL model, trained using both sensor placements on the JSI-FOS training, thereby limiting the models’ exposure to the entire
dataset.
training set and hindering their ability to properly adapt to
using data from two different sensor placements. However,
sitting, walking for standing, kneeling for standing, etc. How- the results from the first experiment (Fig. 5(a)) demonstrate
ever, the FL model tends to confuse lying for sitting and that when models have access to twice the amount of training
bending for standing a lot more than the DL model. data, they can more easily utilize data from two sensor place-
ments and generate superior results. Thus, a possible solution
B. FL OPTIMIZER IMPACT to mitigate these negative effects is to involve clients who can
Fig. 7 investigates the performances of different FL opti- provide more data than those included in these experiments.
mizers, i.e., the FedAdagrad, the FedYoGi and FedAvg, It is also interesting to note that there does not seem to be a
when combining both sensor placements. The optimizers noticeable advantage to using any of the investigated C values
do not undergo a hyperparameter tuning process, in order over the other, as they perform comparable to one another
to foster a more generic and fair comparison. The results across the whole range of possible training data compositions.

VOLUME 11, 2023 64451

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

FIGURE 7. Macro F1-scores [%] of different FL optimizers.

FIGURE 8. Macro F-scores [%] achieved by DL and FL models using different compositions of the
training data.

Additionally, the analysis in this section focuses on the D. BANDWIDTH EFFICIENCY ANALYSIS
statistical behavior of the FL models. Fig. 9 shows the statis- The results of our analysis regarding bandwidth efficiency
tical performances of FL models (mean and 95% confidence are presented in Fig. 10(a). More specifically, Fig. 10(a)
interval) that had been trained for either 10, 30 or 50 rounds, shows a comparison between DL and FL models that use
that used eight clients for training in each round (C = 8), different amounts of training data from a full-featured version
and that used different ratios of clients with heterogeneous of the JSI-FOS dataset. As a distributed learning strategy, FL,
sensor placements. The results reveal a substantial perfor- transfers the model weights to the centralized server in each
mance gap depending on the number of rounds chosen for round of operation. In contrast, for DL, the data needs to be
training. Specifically, opting for a low number of rounds, such completely transferred to the central server to perform the
as 10, yields relatively poor results in terms of mean macro training of the model. The FL-based macro F-score curves are
F-score values, whereas a higher value like 30 or 50 leads to presented as continuous with respect to data transfer volume
better performance. However, the difference between choos- and the DL results are depicted as discrete points on the macro
ing 30 and 50 rounds for training is small, consistent with F-score vs. data transfer volume plots.
the results from the first experiment, where the models tend In terms of the FL performances, Fig. 10(a) shows that the
to plateau in performance after the 20th round. Furthermore, FL strategy with one active client per round (C = 1) can
choosing a larger number of rounds for training (e.g., above achieve the near optimal macro F-score with about 15MB of
30) and/or using only thigh sensor data, yields results with data transferred, while FL with five and nine active clients
a lower standard deviation (i.e. smaller 95% confidence per round needs ∼30 and ∼45MB, respectively, to achieve
interval). the near-optimal macro F-scores. The FL results also show

64452 VOLUME 11, 2023

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

FIGURE 9. Impact of the number of training rounds on a FL model’s (C = 8) macro F-score [%]
performance for different compositions of training data from the JSI-FOS dataset.

that the confidence intervals for the macro F-score decrease only differences: an increase in the confidence interval for DL
as the number of active clients increases, meaning that a bit of trained with 10% of the dataset and a slight increase in macro
bandwidth efficiency needs to be sacrificed for an increased F-score for the DL trained with 50% of the dataset. DL with
stability of the FL models. In conclusion, there is a clear the reduced feature set (100 features) provides a dominant
trade-off between the bandwidth efficiency, model accuracy bandwidth efficiency, i.e., a macro F-score of ≈0.83 for 5MB
and model stability for the FL strategy. of data volume transferred.
Fig. 10(a) also depicts the DL results for the macro In conclusion, DL with optimized feature set might come
F-scores and confidence intervals vs. the data transfer vol- as a satisfactory solution for bandwidth efficient ML for
ume. It is clear that the DL model using only 10% of the HAR. However, the online principle of operation, privacy
dataset for training is outperformed by all FL scenarios in preservation, reasonable performances and bandwidth effi-
terms of bandwidth efficiency. The DL model trained with ciency, still remain the main benefits of the FL strategy.
50% of the dataset, shows slightly better macro F-scores at the Furthermore, the drop in macro F-score performances of FL
price of a wider confidence interval (lower model stability) with reduced feature set may come as a result of the low
than FL with a larger number of active clients per round (≥5). number of epochs used to train the local FL models (=5), i.e.,
the inability of the local models to converge for the reduced
E. MODEL COMPLEXITY AND THE EFFECTS OF FEATURE feature set. The optimization of these aspects will be part of
SELECTION the authors’ future work.
The results of our analysis regarding model complexity are
presented in Fig 10(b). They are consistent with the ones F. EFFECTS OF DATA WITH CORRUPTED LABELS
presented in Section VI-D. The data volumes are reduced in The results of our analysis into the effects of data with cor-
compliance with equations 1 and 2. rupted labels are presented in Fig. 11.
Comparing the results between Fig. 10(a) and Fig. 10(b), A general observation is that DL is more vulnerable to
there is a significant improvement of the bandwidth efficiency this phenomenon than the FL models. It is intuitive that the
of the FL strategy. In particular, FL with one active client increase of percentage of incorrect labels will decrease the
per round needs about 4MB to achieve a near-optimal macro macro F-score of the DL model, which is also confirmed
F-score. FL with a higher number of clients (five and nine) by the results. Furthermore, as the number of epochs grows,
does not converge in the inspected data volume range. The the DL performances drop even more significantly, since
increase of the bandwidth efficiency comes at the price the model has more opportunities to fine-tune to data with
of a reduced model accuracy. Comparing Fig. 10(a) and incorrect labels.
Fig. 10(b), there is a noticeable drop in performances for the On the opposite, the FL strategy is more robust to label
FL strategy. There is about a 5% drop in macro F-score at a errors, dropping only 1-4% in macro F-score as the percent-
lower number of rounds, as well as a noticeable increase in age of label errors grows to 20%, depending on the number of
the confidence intervals (model instability) for all inspected active clients. It is also clear that FL with more active clients
FL use-cases (C = 1, 5, 9). (C = 6) is more robust to label errors. This is mostly due
On the contrary, the DL strategy preserves the macro to the online operation and the weight averaging principle of
F-score performances with the reduced feature set, compared the FL strategy. This is a very important advantage of the FL
to DL with the full feature set (Fig. 10(a)). These are the paradigm, since in real-world scenarios flawed or imprecise

VOLUME 11, 2023 64453

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

VII. LESSONS LEARNED

This section summarizes our observations from the multiple
performed experiments related to FL for HAR using wearable
sensors. The following lessons can be learned based on our
findings:
1) Federated vs. deep learning general observations.
It has been consistently observed that DL models
outperform FL models in terms of classification per-
formance. This is intuitive from an information theory
perspective, since distributed learning cannot achieve
higher accuracy than a centralized DL model when they
use the same underlying neural network. DL models
are trained on the entire dataset, while FL models only
train local models on portions of the dataset and then
combine them into a global model, which can result
in valuable information being lost due to partitioning
and averaging. However, this trade-off is necessary for
increased privacy and data protection. After examining
various use cases, it has been found that the macro
F-score performance gaps between FL and DL typi-
cally range between 5-10% for the region of a lower
number of rounds/epochs (<15), and below 5% for
larger numbers of rounds/epochs (>20). These results
and all previously discussed conclusions are consistent
for two different datasets, namely the JSI-FOS and the
PAMAP datasets, that were analysed using the same
pipeline. Given that the performance of DL models
serves as an upper limit to the performance of the FL
models and the gap are not so significant when the mod-
els are trained in a sufficient number of epochs/rounds,
FIGURE 10. A comparison between FL and DL in terms of bandwidth
choosing the appropriate neural network architecture is
efficiency, i.e., macro F-score vs. data transfer volume to achieve the a critical step that can greatly impact the performance
respective scores. of FL. The goal should be to select a neural network
architecture that maximizes the upper limit, thereby
pushing the FL classification performances as high as
possible.
2) FL optimizer impact. We conducted an investigation
to compare the performance of various FL optimizers,
namely FedAdagrad, FedYoGi, and FedAvg, for the
purpose of HAR. The findings showed that despite its
simplicity, FedAvg outperformed the other optimizers
in terms of both convergence and macro F-score. Fur-
ther investigation is required in this area since some
of the optimizers were used with default initialization
parameters.
3) Sensor placement impact. The impact of sensor place-
ment on model performance highlights the importance
FIGURE 11. Impact of label errors on the macro F-score performance of of careful selection of sensor placements for accurate
FL and DL models. recognition of different activities. The results from the
analysis in this paper show that the thigh (JSI-FOS
data might seriously degrade performances. In each round of dataset) and the chest placement (PAMAP dataset)
operation, the global FL model is calculated based on averag- prove to be more informative regarding HAR in com-
ing local models from a random subset of clients. This means parison to the wrist sensor placement. It is important
that erroneous local FL model weights are averaged out with to note that this observation is true for both DL and
more accurate ones and the effect of error propagation is FL models. Furthermore we observed that, the combi-
diminished. nation of either thigh or chest sensor data with wrist

64454 VOLUME 11, 2023

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

sensor data yielded the best performances in terms of The previously discussed conclusions and lessons learned
macro F-scores, again, for both the DL and the FL can serve as valuable and comprehensive guidelines for
models. This is due to the fact the wrist sensor can designing, developing and implementing efficient federated
contribute to better performances for some specific learning solutions for human activity recognition. Most of the
type of activities. The DL and FL results, as well as the conclusions are also generalizable to other federated learning
gaps between the DL and the FL models are consistent applications beyond human activity recognition.
for the two investigated datasets.
4) Clients with heterogeneous sensor placements. The VIII. CONCLUSION
experiment conducted on clients with heterogeneous This paper presents a performance analysis for FL-based
sensor placements revealed that compared to DL mod- HAR, from a system level perspective and under various real-
els, FL models needed a slightly higher number of world conditions, such as communication cost/bandwidth
clients that have access to data from the more infor- efficiency, model complexity, erroneous data, etc. The analy-
mative sensor placement before they are able to start sis also provides a head-on comparison between FL and DL
leveraging this data source and improve their results. when using two different datasets. The results clearly show
In addition, our results also showed that, when using that various system parameters and configurations like the
clients with data from heterogeneous sensor place- type of sensor placement, FL optimizer, model complexity,
ments, choosing one C value (fraction of clients) over data volume as well as erroneous data can play a crucial role
the others does not make much sense as there was no in the robustness and applicability of FL-based HAR.
substantial difference between their results. Future work will focus on several different optimality and
5) Bandwidth efficiency. Regarding bandwidth effi- optimization aspects that will build upon the findings from
ciency, FL demonstrated better performance than DL this work. Specifically, the future work will investigate the
by achieving a nearly optimal macro F-score with the analytical tractability and generalization of the optimization
transfer of only tens of megabytes of data. The inves- problem related to system-level parameters, including band-
tigation also looked into the C parameter and revealed width efficiency, energy efficiency, model complexity, and
that increasing the number of active clients per round the model performance. Additionally, it will broaden the
led to improved model stability but required more data analysis of the erroneous data effect, by including non-iid
to be transferred for the FL models to converge. In other data points, noising of the data samples, as well as label
words, the study highlighted a clear trade-off between smoothing.
bandwidth efficiency, model accuracy, and model sta-
bility for the FL paradigm. REFERENCES
[1] M. Luštrek et al., ‘‘A personal health system for self-management of
6) Model complexity and feature selection. The exper- congestive heart failure (HeartMan): Development, technical evaluation,
iment used the Recursive Feature Elimination (RFE) and proof-of-concept randomized controlled trial,’’ JMIR Med. Informat.,
to select the best 100 features, and the models were vol. 9, no. 3, Mar. 2021, Art. no. e24501.
[2] I. Kiprijanovska, H. Gjoreski, and M. Gams, ‘‘Detection of gait
trained and tested on these 100 features. The results abnormalities for fall risk assessment using wrist-worn inertial
showed a substantial improvement in the bandwidth sensors and deep learning,’’ Sensors, vol. 20, no. 18, p. 5373,
efficiency of the FL strategy when compared to the Sep. 2020.
[3] I. Husain and D. Spence, ‘‘Can healthy people benefit from health apps?’’
full feature set, with a 4MB data volume needed for BMJ, Clin. Res. Ed., vol. 350, p. h2520, Apr. 2015.
near-optimal macro F-score. However, this increase in [4] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao, ‘‘FedHealth: A federated
bandwidth efficiency came at the cost of reduced model transfer learning framework for wearable healthcare,’’ IEEE Intell. Syst.,
accuracy, with a noticeable drop in macro F-score and vol. 35, no. 4, pp. 83–93, Jul./Aug. 2020.
[5] Y. Chen, X. Yang, B. Chen, C. Miao, and H. Yu, ‘‘PdAssist: Objective and
an increase in confidence intervals for all inspected quantified symptom assessment of Parkinson’s disease via smartphone,’’
FL use cases. The main conclusion is that DL with in Proc. IEEE Int. Conf. Bioinf. Biomed. (BIBM), Kansas City, MO, USA,
optimized feature sets may be a satisfactory solution for Nov. 2017, pp. 939–945.
[6] M. Lee, A. M. Khan, J. Kim, Y. Cho, and T. Kim, ‘‘A single tri-axial
bandwidth-efficient ML for HAR, but FL still remains accelerometer-based real-time personal life log system capable of activ-
the main choice for online operation, privacy preserva- ity classification and exercise information generation,’’ in Proc. Annu.
tion, and reasonable performances. Int. Conf. IEEE Eng. Med. Biol., Buenos Aires, Argentina, Aug. 2010,
pp. 1390–1393.
7) Erroneous data effect. The experiment compared the [7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas,
performance of FL to that of traditional DL when ‘‘Communication-efficient learning of deep networks from decentralized
working with a dataset that has a varying percentage of data,’’ in Proc. AISTATS, Ft. Lauderdale, FL, USA, vol. 54, Apr. 2017,
pp. 1273–1282.
erroneous labels. The results of the experiment show
[8] K. Kirsten, B. Pfitzner, L. Löper, and B. Arnrich, ‘‘Sensor-based obsessive-
that the DL model is more vulnerable to label errors compulsive disorder detection with personalised federated learning,’’ in
than the FL model. This finding highlights the advan- Proc. 20th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Pasadena, CA,
tage of FL in mitigating the effect of erroneous data, USA, Dec. 2021, pp. 333–339.
[9] S. Ek, F. Portet, P. Lalanda, and G. Vega, ‘‘A federated learning aggregation
limiting the error propagation due to the averaging algorithm for pervasive computing: Evaluation and comparison,’’ in Proc.
process for the global model update. IEEE PerCom, Kassel, Germany, Mar. 2021, pp. 1–10.

VOLUME 11, 2023 64455

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

[10] R. Presotto, G. Civitarese, and C. Bettini, ‘‘Semi-supervised and person- STEFAN KALABAKOV received the B.Sc. degree
alized federated activity recognition based on active learning and label in computer technologies and engineering from the
propagation,’’ Pers. Ubiquitous Comput., vol. 26, no. 5, pp. 1281–1298, Faculty of Electrical Engineering and Information
Oct. 2022. Technologies (FEEIT), Skopje, North Macedonia,
[11] L. Tu, X. Ouyang, J. Zhou, Y. He, and G. Xing, ‘‘FedDL: and the M.Sc. degree from the Joef Stefan Inter-
Federated learning via dynamic layer sharing for human activity national Postgraduate School, Ljubljana, Slovenia.
recognition,’’ in Proc. ACM SenSys, Coimbra, Portugal, Nov. 2021, He is currently pursuing the Ph.D. degree.
pp. 15–28.
He is also a Research Assistant with the Dig-
[12] C. Li, D. Niu, B. Jiang, X. Zuo, and J. Yang, ‘‘Meta-HAR:
ital Health-Connected Healthcare Group, Hasso
Federated representation learning for human activity recog-
nition,’’ in Proc. Web Conf., Ljubljana, Slovenia, Apr. 2021,
Plattner Institute (HPI), Germany. His research
pp. 912–922. interests include federated learning, electronic health records, and human
[13] G. K. Gudur and S. K. Perepu, ‘‘Resource-constrained federated activity recognition.
learning with heterogeneous labels and models for human
activity recognition,’’ in Proc. DL-HAR, Kyoto, Japan, Jan. 2021,
pp. 57–69.
BORCHE JOVANOVSKI received the B.Sc.
[14] X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, ‘‘ClusterFL:
A similarity-aware federated learning system for human degree in electrical engineering and information
activity recognition,’’ in Proc. ACM MobiSys, Jun. 2021, technologies and in the field of telecommuni-
pp. 54–66. cations and the M.Sc. degree in electrical and
[15] X. Zhou, W. Liang, J. Ma, Z. Yan, and K. I. Wang, ‘‘2D federated information technology and in the field of wire-
learning for personalized human activity recognition in cyber-physical- less systems, services and applications from the
social systems,’’ IEEE Trans. Netw. Sci. Eng., vol. 9, no. 6, pp. 3934–3944, Faculty of Electrical Engineering and Information
Nov. 2022. Technologies (FEEIT), Ss. Cyril and Methodius
[16] K. Sozinov, V. Vlassov, and S. Girdzijauskas, ‘‘Human University in Skopje (UKIM), Skopje, Macedonia,
activity recognition using federated learning,’’ in Proc. IEEE in 2019 and 2021, respectively, where he is cur-
ISPA/IUCC/BDCloud/SocialCom/SustainCom, Melbourne, VIC, rently pursuing the Ph.D. degree in electrical engineering and information
Australia, Dec. 2018, pp. 1103–1111. technologies. He is also a Research Associate and part of the Laboratory
[17] H. Cho, A. Mathur, and F. Kawsar, ‘‘Device or user: Rethinking federated for Wireless and Mobile Networks, UKIM in Skopje. His research interests
learning in personal-scale multi-device environments,’’ in Proc. ACM Sen- include wireless networks, wireless communications, cloud computing, and
Sys, 2021, pp. 446–452. recently application of machine learning and federated learning in different
[18] S. Ek, F. Portet, P. Lalanda, and G. Vega, ‘‘Evaluation of federated learning
domains.
aggregation algorithms: application to human activity recognition,’’ in
Proc. ACM UbiComp-ISWC, 2020, pp. 638–643.
[19] S. Kozina, H. Gjoreski, M. Gams, and M. Luštrek, ‘‘Three-layer
activity recognition combining domain knowledge and meta- DANIEL DENKOVSKI is currently an Associate
classification,’’ J. Med. Biol. Eng., vol. 33, no. 4, pp. 406–414, Professor with the Faculty of Electrical Engi-
Jan. 2013.
neering and Information Technologies, Ss. Cyril
[20] H. Gjoreski, B. Kaluža, M. Gams, R. Milić, and M. Luštrek,
and Methodius University in Skopje. His major
‘‘Context-based ensemble method for human energy expen-
diture estimation,’’ Appl. Soft Comput., vol. 37, pp. 960–970,
research interests include concentrated on signal
Dec. 2015. processing, information theory, wireless commu-
[21] A. Reiss and D. Stricker, ‘‘Introducing a new benchmarked dataset nications, cloud computing, and recently machine
for activity monitoring,’’ in Proc. 16th Int. Symp. Wearable Comput., learning and federated learning and their applica-
Newcastle, U.K., Jun. 2012, pp. 108–109. tion in different domains. He has notable research
[22] A. Reiss and D. Stricker, ‘‘Creating and benchmarking a new dataset for experience having worked on 12 internationally
physical activity monitoring,’’ in Proc. PETRA, Crete, Greece, Jun. 2012, funded research projects (FP7, H2020, and NATO SpS) and several domestic
pp. 1–8. projects in his research areas. Besides theoretical research, he has serious
[23] S. Kalabakov, S. Stankoski, I. Kiprijanovska, A. Andova, N. Reščič, prototyping experience, which has resulted in several awarded ICT system
V. Janko, M. Gjoreski, M. Gams, and M. Luštrek, ‘‘What actually works prototypes. He has more than 60 publications, out of which 16 in top journals
for activity recognition in scenarios with significant domain shift: Lessons with impact factor, and seven chapters in Springer books. He was a recipient
learned from the 2019 and 2020 sussex-huawei challenges,’’ Sensors, of the Award ‘‘Best Young Scientist’’ for 2014 from the President of the
vol. 22, no. 10, p. 3613, May 2022. Republic of Macedonia.
[24] X. Glorot, A. Bordes, and Y. Bengio, ‘‘Deep sparse rectifier neural net-
works,’’ in Proc. AISTATS, Ft. Lauderdale, FL, USA, vol. 15, 2011,
pp. 315–323.
[25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and VALENTIN RAKOVIC (Senior Member, IEEE)
R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks received the Dipl.-Ing., M.Sc., and Ph.D. degrees
from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, in telecommunications from the Faculty of Elec-
2014. trical Engineering and Information Technologies,
[26] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’ Ss Cyril and Methodius University in Skopje
2014, arXiv:1412.6980.
(UKIM), in 2008, 2010, and 2016, respectively.
[27] J. Wang, Z. Xu, Z. Garrett, Z. Charles, L. Liu, and G. Joshi, ‘‘Local
He currently holds the position of an Associate
adaptivity in federated learning: Convergence and consistency,’’ 2021,
arXiv:2106.02305. Professor and the Head of the Laboratory for
[28] I. Tenison, S. A. Sreeramadas, V. Mugunthan, E. Oyallon, E. Belilovsky, Wireless and Mobile Networks, Faculty of Elec-
and I. Rish, ‘‘Gradient masked averaging for federated learning,’’ 2022, trical Engineering and Information Technologies
arXiv:2201.11986. (FEEIT), UKIM in Skopje. He has coauthored more than 70 publications
[29] H. Mcmahan, E. Moore, D. Ramage, and B. Yarcas, ‘‘Federated learn- in international conferences and journals. His research interests include
ing of deep networks using model averaging,’’ 2016, arXiv:1602. wireless networks, signal processing, optimization theory, machine learning,
05629. and the prototyping of wireless networking solutions.

64456 VOLUME 11, 2023

S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective

BJARNE PFITZNER received the M.Eng. degree BERT ARNRICH is currently the Head of
in computing from the Imperial College London. the Chair Digital Health—Connected Healthcare,
He is currently pursuing the Ph.D. degree with joint Digital-Engineering Faculty, Hasso Plattner
the Digital Health—Connected Healthcare Group, Institute (HPI), and the University of Potsdam.
Hasso Plattner Institute (HPI), Germany. He is also He has been a PI in several European and national
a Research Assistant with the Digital Health— research projects. He studied ‘‘Informatics in the
Connected Healthcare Group, HPI. For the last Natural Sciences.’’ In his Ph.D. thesis, he imple-
four years, he worked in the area of federated mented an early big data approach that collects and
learning with a focus on privacy-preserving algo- consolidates patient data for scientific data anal-
rithms using differential privacy and healthcare ysis. At ETH Zurich, he established and headed
applications, such as medical imaging and risk stratification for the intensive the Wearable Computing Laboratory, Research Group Pervasive Healthcare.
care unit. He received a Marie Curie Cofound Fellowship from the European Union and
was appointed to tenure track professorship with the Computer Engineering
Department, Bosporus University. He was the Science Manager of Emerging
Technologies with Accenture Technology Solutions.

HRISTIJAN GJORESKI received the M.Sc. and

Ph.D. degrees in information and communication
technologies from the Jozef Stefan Postgraduate
School, Slovenia, in 2011 and 2015, respectively.
From 2010 to 2016, he was a Researcher with the
Department of Intelligent Systems, Jozef Stefan
Institute, Slovenia. In 2017, he was a Postdoctoral
ORHAN KONAK received the Diploma degree Research Fellow with the University of Sussex,
in mathematics and the M.Sc. degree in U.K. Currently, he is an Associate Professor with
mathematics—computational engineering from the Ss. Cyril and Methodius University in Skopje,
the University of Applied Sciences Berlin North Macedonia. His research interests include artificial intelligence,
(BHT), Germany. He is currently pursuing machine learning, and wearable computing. He was highly successful at
the Ph.D. degree with the Digital Health— several machine learning competitions and received first place award at
Connected Healthcare, Hasso Plattner Institute the EvAAL Activity Recognition Challenge, in 2013, ChallengeUP Fall
(HPI), Germany. His research interests include Detection Competition, in 2019, and Emteq Activity Recognition Challenge
sensor-based human activity recognition, espe- at Ubicomp 2019 in London, U.K. He has received the award ‘‘Best Young
cially in developing new algorithms to determine Scientist’’ for 2016 from the President of Republic of Macedonia and was
the optimal sensor placement and augmenting given datasets through syn- selected in the top-cited 2% scientists worldwide, in 2021.
thetically generated data.

VOLUME 11, 2023 64457

Pocket Guide To Nutrition Assessment of The Patient With Kidney Disease
100% (2)
Pocket Guide To Nutrition Assessment of The Patient With Kidney Disease
44 pages
Experiment 6: Colorimetric Estimation of Cu in Brass
No ratings yet
Experiment 6: Colorimetric Estimation of Cu in Brass
16 pages
Technical Report
No ratings yet
Technical Report
35 pages
FL Profiling Paper
No ratings yet
FL Profiling Paper
10 pages
Heterogeneity-aware-device-selection-for-eff_2024_International-Journal-of-I
No ratings yet
Heterogeneity-aware-device-selection-for-eff_2024_International-Journal-of-I
9 pages
sensors-24-01299
No ratings yet
sensors-24-01299
18 pages
A Detailed Survey On Federated Learning Attacks and Defenses
No ratings yet
A Detailed Survey On Federated Learning Attacks and Defenses
18 pages
Asynchronous federated learning on heterogeneous devices A survey
No ratings yet
Asynchronous federated learning on heterogeneous devices A survey
15 pages
2411.06263v1
No ratings yet
2411.06263v1
6 pages
ip2024_12_002
No ratings yet
ip2024_12_002
12 pages
a148-hamideche final-1
No ratings yet
a148-hamideche final-1
7 pages
Newres 5
No ratings yet
Newres 5
23 pages
Prioritized Multi-Criteria Federated Learning: Vito Walter Anelli, Yashar Deldjoo, Tommaso Di Noia and Antonio Ferrara
No ratings yet
Prioritized Multi-Criteria Federated Learning: Vito Walter Anelli, Yashar Deldjoo, Tommaso Di Noia and Antonio Ferrara
18 pages
Implementation and Analysis of A Federated Learning Architecture Using CIFAR 10 Dataset 1
No ratings yet
Implementation and Analysis of A Federated Learning Architecture Using CIFAR 10 Dataset 1
6 pages
EdgeFed Optimized Federated Learning Based On Edge Computing
No ratings yet
EdgeFed Optimized Federated Learning Based On Edge Computing
8 pages
Blockchain For Deep Learning: Review and Open Challenges
No ratings yet
Blockchain For Deep Learning: Review and Open Challenges
25 pages
Bharati Et Al 2022 Federated Learning Applications Challenges and Future Directions
No ratings yet
Bharati Et Al 2022 Federated Learning Applications Challenges and Future Directions
17 pages
1-s2.0-S0167739X21004726-main
No ratings yet
1-s2.0-S0167739X21004726-main
9 pages
Role of Federated Learning in Edge Computing A Sur
No ratings yet
Role of Federated Learning in Edge Computing A Sur
23 pages
Decentralized Federated Learning for Healthcare Networks a Case Study on Tumor Segmentation (2)
No ratings yet
Decentralized Federated Learning for Healthcare Networks a Case Study on Tumor Segmentation (2)
16 pages
Federated Learning Article
No ratings yet
Federated Learning Article
68 pages
Deng 2022 Auction
No ratings yet
Deng 2022 Auction
14 pages
Multi-View Ensemble Federated Learning For Efficient Prediction of Consumer Electronics Applications in Fog Networks
No ratings yet
Multi-View Ensemble Federated Learning For Efficient Prediction of Consumer Electronics Applications in Fog Networks
8 pages
ieee-fed-strategies
No ratings yet
ieee-fed-strategies
10 pages
Real-Time Prediction Using Fog-Based Federated Learning and Genetic Hyperparameter Optimisation
No ratings yet
Real-Time Prediction Using Fog-Based Federated Learning and Genetic Hyperparameter Optimisation
10 pages
A Survey On Federated Learning For Resource-Constrained IoT Devices
No ratings yet
A Survey On Federated Learning For Resource-Constrained IoT Devices
24 pages
Secure_and_Efficient_Federated_Learning_Through_Layering_and_Sharding_Blockchain
No ratings yet
Secure_and_Efficient_Federated_Learning_Through_Layering_and_Sharding_Blockchain
15 pages
Deep Learning On Mobile Devices-A Review: March 2019
No ratings yet
Deep Learning On Mobile Devices-A Review: March 2019
16 pages
Heterogeneous Federated Learning For Non-IID Smartwatch Data Classification
No ratings yet
Heterogeneous Federated Learning For Non-IID Smartwatch Data Classification
8 pages
Privacy
No ratings yet
Privacy
14 pages
1 s2.0 S0020025523008204 Main
No ratings yet
1 s2.0 S0020025523008204 Main
22 pages
Wahab Et Al. - 2021 - Federated Machine Learning Survey, Multi-Level CL
No ratings yet
Wahab Et Al. - 2021 - Federated Machine Learning Survey, Multi-Level CL
50 pages
Chapter Five
No ratings yet
Chapter Five
7 pages
22-tbd
No ratings yet
22-tbd
14 pages
Defending Against Label-Flipping Attacks in Federated Learning Systems Using Uniform Manifold Approximation and Projection
No ratings yet
Defending Against Label-Flipping Attacks in Federated Learning Systems Using Uniform Manifold Approximation and Projection
8 pages
Ubiccjournalvolume2no3 2 28
No ratings yet
Ubiccjournalvolume2no3 2 28
10 pages
Abstract in All
No ratings yet
Abstract in All
13 pages
Mobile Keyword Prediction Using Federated Learning
No ratings yet
Mobile Keyword Prediction Using Federated Learning
10 pages
FedAFR Enhancing Federated Learning With Adaptive Fea - 2024 - Computer Communi
No ratings yet
FedAFR Enhancing Federated Learning With Adaptive Fea - 2024 - Computer Communi
8 pages
InfoSec Project Report
No ratings yet
InfoSec Project Report
7 pages
Federated Learning a Survery
No ratings yet
Federated Learning a Survery
31 pages
1 s2.0 S1566253523002063 Main
No ratings yet
1 s2.0 S1566253523002063 Main
12 pages
Base Paper
No ratings yet
Base Paper
10 pages
Ubiccjournalvolume2no3 2 28
No ratings yet
Ubiccjournalvolume2no3 2 28
10 pages
s41467-022-29763-x
No ratings yet
s41467-022-29763-x
8 pages
s10586-024-04629-7
No ratings yet
s10586-024-04629-7
18 pages
AI_Skin Skin Disease Recognition Based on Self-learning and Wide Data
No ratings yet
AI_Skin Skin Disease Recognition Based on Self-learning and Wide Data
9 pages
MLLM-FL: Multimodal Large Language Model Assisted Federated Learning On Heterogeneous and Long-Tailed Data
No ratings yet
MLLM-FL: Multimodal Large Language Model Assisted Federated Learning On Heterogeneous and Long-Tailed Data
10 pages
Federated Learning- Hope and Scope
No ratings yet
Federated Learning- Hope and Scope
3 pages
Final Submission Research Paper Batch 16
No ratings yet
Final Submission Research Paper Batch 16
8 pages
A Hybrid Cloud Enterprise Strategic Management Sys
No ratings yet
A Hybrid Cloud Enterprise Strategic Management Sys
18 pages
download
No ratings yet
download
12 pages
Federated Learning With Differential Privacy Algorithms and Performance Analysis
No ratings yet
Federated Learning With Differential Privacy Algorithms and Performance Analysis
16 pages
Federated Learning A Survey on EnablingACCESS3013541
No ratings yet
Federated Learning A Survey on EnablingACCESS3013541
28 pages
7
No ratings yet
7
20 pages
N Device Federated Learning With Lower
No ratings yet
N Device Federated Learning With Lower
5 pages
2206.00807v2
No ratings yet
2206.00807v2
13 pages
Federated Inception-Multi-Head Attention Models For Cyber Attacks Detection
No ratings yet
Federated Inception-Multi-Head Attention Models For Cyber Attacks Detection
17 pages
sensors-23-06986
No ratings yet
sensors-23-06986
21 pages
Securing The Future 3rd
No ratings yet
Securing The Future 3rd
14 pages
MS-FL A Federated Learning Framework Based On Mult
No ratings yet
MS-FL A Federated Learning Framework Based On Mult
12 pages
Essential Federated Learning: AI at the Edge
From Everand
Essential Federated Learning: AI at the Edge
Robert Johnson
No ratings yet
Hci h2 Chem p4 QP With Ans Ms
No ratings yet
Hci h2 Chem p4 QP With Ans Ms
13 pages
List of Students - Zscaler
No ratings yet
List of Students - Zscaler
30 pages
Sunny Engineers
No ratings yet
Sunny Engineers
4 pages
The Effect of Growth Mindset On Mathematical Performance in Algeb
No ratings yet
The Effect of Growth Mindset On Mathematical Performance in Algeb
31 pages
Dehemi UAE 2024
No ratings yet
Dehemi UAE 2024
2 pages
Progress in Civil, Architectural and Hydraulic Engineering: Editor: Yun-Hae Kim
100% (2)
Progress in Civil, Architectural and Hydraulic Engineering: Editor: Yun-Hae Kim
1,447 pages
ManishChaulagain Resume
No ratings yet
ManishChaulagain Resume
3 pages
FM Lecture Notes
No ratings yet
FM Lecture Notes
118 pages
Greenheck Presentation
No ratings yet
Greenheck Presentation
69 pages
01 Uic Algebra 1 2017
No ratings yet
01 Uic Algebra 1 2017
3 pages
The Power of Breakthrough Praise Ebook
No ratings yet
The Power of Breakthrough Praise Ebook
59 pages
Bima Global
No ratings yet
Bima Global
2 pages
School of Mount St. Mary: First Periodical Test in Technology and Livelihood Education Iii Third Year S.Y. 2011-2012
No ratings yet
School of Mount St. Mary: First Periodical Test in Technology and Livelihood Education Iii Third Year S.Y. 2011-2012
6 pages
LC Handbook Complete 2
No ratings yet
LC Handbook Complete 2
163 pages
A Review of Newly Diagnosed Glioblastoma
No ratings yet
A Review of Newly Diagnosed Glioblastoma
10 pages
Updated Information Brochure For MBBS 2023 Admission in AFMC Pune
No ratings yet
Updated Information Brochure For MBBS 2023 Admission in AFMC Pune
29 pages
Priority Queues - 1
No ratings yet
Priority Queues - 1
11 pages
CNF ST Q4
No ratings yet
CNF ST Q4
2 pages
As-2 Inventory Valuation-Cp5u1
No ratings yet
As-2 Inventory Valuation-Cp5u1
19 pages
Cycle Counting - The Secret To Inventory Accuracy
No ratings yet
Cycle Counting - The Secret To Inventory Accuracy
40 pages
1611 en
No ratings yet
1611 en
16 pages
Basic Janam Patri
No ratings yet
Basic Janam Patri
7 pages
Manufacturing Performance and Evolution of TPM
No ratings yet
Manufacturing Performance and Evolution of TPM
14 pages
PE Pipe Production Process & QA-QC Requirements & Standards
0% (1)
PE Pipe Production Process & QA-QC Requirements & Standards
35 pages
Lasting Legacies-Sudha Murthy
No ratings yet
Lasting Legacies-Sudha Murthy
6 pages
8606 Assignment Answer
No ratings yet
8606 Assignment Answer
19 pages
Ix Icse Hindi - Prelim-1 - Set A QP
No ratings yet
Ix Icse Hindi - Prelim-1 - Set A QP
8 pages
Cvxpy - Convex - Optimization (tcmm2014, Slides)
No ratings yet
Cvxpy - Convex - Optimization (tcmm2014, Slides)
43 pages

Federated Learning For Activity Recognition A System Level Perspective

Uploaded by

Federated Learning For Activity Recognition A System Level Perspective

Uploaded by

Received 9 June 2023, accepted 19 June 2023, date of publication 26 June 2023, date of current version 30 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3289220

Federated Learning for Activity Recognition:

BERT ARNRICH 2 , AND HRISTIJAN GJORESKI 1

I. INTRODUCTION context information acquired through them, which enables

VOLUME 11, 2023 64443

of FL in the field of HAR. One example of such analysis

III. DATA AND PREPROCESSING

64444 VOLUME 11, 2023

Before applying the feature extraction procedure described

VOLUME 11, 2023 64445

64446 VOLUME 11, 2023

FIGURE 3. The architecture of a typical FL system.

utilized as the performance metric in this study. The macro

1) SENSOR PLACEMENT IMPACT

VOLUME 11, 2023 64447

64448 VOLUME 11, 2023

64450 VOLUME 11, 2023

show that FedAvg provides best performances in terms of

C. IMPACT OF CLIENTS WITH HETEROGENEOUS SENSOR

VOLUME 11, 2023 64451

FIGURE 7. Macro F1-scores [%] of different FL optimizers.

64452 VOLUME 11, 2023

VOLUME 11, 2023 64453

VII. LESSONS LEARNED

64454 VOLUME 11, 2023

VOLUME 11, 2023 64455

64456 VOLUME 11, 2023

HRISTIJAN GJORESKI received the M.Sc. and

VOLUME 11, 2023 64457

You might also like