Federated Learning For Activity Recognition A System Level Perspective
Federated Learning For Activity Recognition A System Level Perspective
ABSTRACT The past decade has seen substantial growth in the prevalence and capabilities of wearable
devices. For instance, recent human activity recognition (HAR) research has explored using wearable devices
in applications such as remote monitoring of patients, detection of gait abnormalities, and cognitive disease
identification. However, data collection poses a major challenge in developing HAR systems, especially
because of the need to store data at a central location. This raises privacy concerns and makes continuous
data collection difficult and expensive due to the high cost of transferring data from a user’s wearable device
to a central repository. Considering this, we explore the adoption of federated learning (FL) as a potential
solution to address the privacy and cost issues associated with data collection in HAR. More specifically,
we investigate the performance and behavioral differences between FL and deep learning (DL) HAR models,
under various conditions relevant to real-world deployments. Namely, we explore the differences between
the two types of models when (i) using data from different sensor placements, (ii) having access to users
with data from heterogeneous sensor placements, (iii) considering bandwidth efficiency, and (iv) dealing
with data with incorrect labels. Our results show that FL models suffer from a consistent performance deficit
in comparison to their DL counterparts, but achieve these results with much better bandwidth efficiency.
Furthermore, we observe that FL models exhibit very similar responses to those of DL models when exposed
to data from heterogeneous sensor placements. Finally, we show that the FL models are more robust to data
with incorrect labels than their centralized DL counterparts.
INDEX TERMS Human activity recognition, federated learning, deep learning, system-level aspects,
different and heterogeneous sensor placements, FL optimizers, fraction fit, bandwidth efficiency, data errors,
feature selection, model complexity.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
64442 VOLUME 11, 2023
S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective
data, data collection remains one of the most prevalent prob- •different server-side model aggregation strategies for FL
lems in Human Activity Recognition (HAR). This is due to (i.e., FL optimizers),
the fact that the process is time-consuming, expensive, and • a different percentage of clients participating in the
usually performed only once, at the start of the development learning process,
of any specific HAR pipeline. One of the reasons behind • communication bandwidth efficiency,
performing data collection only once is that the technologies • model size and model complexity as a result of feature
used to develop HAR models require data to be centrally selection,
stored before being used. This introduces significant privacy • data with corrupted labels.
risks and hinders continuous data collection both because The lessons learned throughout the paper can later serve
of security concerns and the potentially substantial costs as comprehensive guidelines for designing and optimizing
incurred by sending large volumes of data from a user’s federated learning systems for HAR.
device to a central data store. The paper is organized as follows. Section II presents the
A possible solution to these problems could be the use related work at the intersection of FL and HAR. Section III
of Federated Learning (FL) [7], instead of the widely used presents the two HAR datasets used for training and evalua-
centralized classical machine learning (ML) and deep learn- tion of our models. Next, Section IV describes the method-
ing (DL) methods. FL is a distributed learning paradigm ology, namely, the feature extraction that was performed, the
that focuses on developing a shared model using clients who model architecture used as well as the FL system architecture
each only have access to their own data. The primary advan- used. Section V presents the evaluation setup, metrics and the
tage of FL is that a client’s data never leaves their device, details of the experiments. Section VI presents and discusses
which substantially decreases any security risks related to the results from the experiments. Section VII compiles the
sharing sensitive information. In addition, the only infor- lessons learned though our results and, finally, Section VIII
mation that leaves the user’s device when using FL is the provides a summary of the paper and discusses potential
computed updates/weights to the local model, which substan- directions for future work.
tially reduces the volume of data that needs to be sent to a
central data store compared to when users send actual sensor II. RELATED WORK
readings. Both the improved security and the reduced volume State-of-the-art ML and DL solutions for HAR usually
of data leaving a user’s device increase the feasibility of require data from different sensors and users to be located
performing continuous data collection, which, in turn, would in one central location before being used to develop models.
significantly impact the ability of models to improve over The disadvantages of training centralized models appear in
time and adapt to changes in the distribution of the data. the form of privacy concerns and the inability to perform
However, efficient deployment and optimal operation of continuous data collection due to both the security risk and
FL in real-world scenarios is far from a trivial task. FL is the high cost of transferring large amounts of data from a
commonly deployed on communication and computationally user’s device to a central data store. FL can mitigate these
constrained devices, and requires a better understanding of disadvantages by constructing a shared model using only
how various system-level factors impact its reliability and the updates/weights computed by each client on their local
applicability. Such an understanding has immense potential machine and data.
to facilitate the development of more effective FL-based mod- Over the past few years, numerous studies have inves-
els, which would advance the practical application of FL in tigated the use of FL in the field of HAR. The major-
real-world settings. ity of these studies have concentrated on exploring new
Rather than proposing a new model or FL optimizer, this FL applications in the context of activity recognition or
paper aims for a more significant and wider impact. Our enhancing FL pipelines and methodologies [8], [9], [10],
main goal is to provide a comprehensive and rigorous sys- [11], [12], [13], [14], [15]. These works usually aim to
tem level analysis of federated learning for human activity enhance the accuracy and resilience of the FL model, but they
recognition. This paper is the first, to our knowledge, to offer seldom focus on a broad exploration of the real-world deploy-
such a system-level perspective that covers various practical ment requirements of FL at the system level. Specifically,
aspects and considerations. In particular, this work initially there is a lack of research on how various factors, such as
focuses on analyzing the performance gaps between the cen- fusion of data from different sensor placements, exposure to
tralized deep learning and the distributed federated learning clients with data from heterogeneous sensor placements, and
approaches using two different HAR datasets with different exposure to data with corrupted labels, affect the accuracy,
sensor placements. Finally, this paper aims to characterize the communication efficiency, and complexity of FL systems for
behavior of FL by comparing it to that of a centralized DL, HAR. Furthermore, a head-to-head comparison of DL and FL
when considering the following important practical aspects models under varying conditions, in order to quantify and
for real-world deployments: define the differences between the two paradigms, is also
• data from different sensor placements, rarely performed.
• heterogeneous sensor placements in clients that partici- Only a limited number of studies have attempted to pro-
pate in the training at the same time, vide a deeper understanding of the system-level specifics
function, and a batch size of 256. It is important to note that, unique suitability of FL for developing personalized models.
when using FL, we also experimented with the use of the SGD To implement this setup, we divided the data of each user
(Stochastic Gradient Descent) optimizer, particularly because in both datasets into training and test subsets. The training
of the fact that it is stateless. However, the results suggested subset typically consisted of approximately 80% of the user’s
that there is no performance advantage of using SGD instead data, equivalent to around 100 minutes of labeled data (about
of Adam. 3100 windows/instances) in the JSI-FOS dataset and around
46 minutes of labeled data (about 1390 windows/instances)
C. FEDERATED LEARNING SETUP in the PAMAP dataset (except for ‘subject 109’). The
As previously mentioned, the core idea of FL is training remaining 20% of the user’s data, equivalent to around
a shared model using clients that never have to share data 20 minutes of labeled data (about 700 instances/windows)
between themselves or with a server [29]. The depiction of in the JSI-FOS dataset and around 13 minutes of labeled
a general FL implementation (and the one we use) is given in data (about 390 instances/windows) in the PAMAP dataset,
Fig. 3. A federated learning setup usually consists of a server formed the test set. No validation sets were used in this study
that holds the shared model and coordinates the training as there was no parameter tuning involved, and our focus
process, as well as clients which all hold their own local was solely on reporting performance changes using different
data and models. The training of a shared model is achieved setups on the test data from each user.
by aggregating the updates/weights that the clients make to To mitigate the potential issue of high similarity between
their local models using their local data. This way, clients do windows containing data from the same user in close tempo-
not have to share their data with the server, but instead, only ral proximity, we took precautions during the data splitting
share the updates/weights of their local model. One training process. We ensured that windows belonging to a continuous
iteration of the shared model in FL is referred to as a round. performance of a specific activity (activity segment) were
A more detailed illustration of what are the individual only present in either the training or test subset, but not
steps in a single round of training is given in Fig. 4. The both, in each of the two datasets. This was achieved through
whole process starts on the server-side with the initialization the following steps: (i) identifying activity segments in the
of the weights of the shared model. This only happens in data of each user, (ii) grouping activity segments based on
the first round of training (thus, it is depicted with a dashed the performed activity, (iii) iterating through the groups of
line). Next, the server picks a subset of clients (S) which activity segments and assigning each segment to either the
will participate in the specific training round. This is done to training or test subset.
simulate the fact that not all clients are available to participate During step (iii), we assigned activity segments from each
in each round. The number of clients selected in each round group to the training or test subset in such a way that approx-
of training is denoted as C. After picking the subset of clients imately 80% of the windows in the group belonged to the
that will participate, the server broadcasts the weights of the training subset of the user, while around 20% of the win-
current shared model to all of the clients that are included in dows in the group belonged to the test subset of the user.
the training round. This approach ensured that the evaluation of the model was
After receiving the broadcasted weights, each of the not biased by unintentional repetition of similar data during
included clients (client x ∈ S) creates a local copy of training and testing, and helped maintain the integrity of the
the shared model. This local model is then trained using evaluation process.
their local data for a few epochs. Subsequently, each of the Due to the inherent differences between DL and FL, the
included clients sends only their updates/weights of the local utilization of the training and test subsets varied for each
model to the server. It is important to note that when referring paradigm during the training and evaluation process. For
to updates, we mean the difference between the received DL models trained on one of the two available datasets, the
model and the local model after training using local data. training subsets of all users in that dataset were concatenated
Finally, after receiving the updates/weights from all partic- to update the model in each epoch. The concatenated test
ipating clients, the server is ready to update the shared model. subsets of all users in the same dataset were used to evaluate
This is done using some form of aggregation of the multiple the model after each epoch and at the end of the training
received updates/weights. The updated shared model is used procedure. In contrast, for FL models trained on one of the
as the starting point for training in the next round. two datasets, the training subset of each user was used to
train a local model in each round of FL. Simultaneously, the
V. EXPERIMENTAL SETUP test set of each user was used to evaluate the respective local
In the following subsections, we provide detailed information model’s performance. However, after each training round, the
about the evaluation setup, metrics of interest, and experi- shared global model was also evaluated using the concate-
ments conducted in our study. nated data from the test subsets of all users in the dataset.
This distinct approach in utilizing training and test subsets in
A. EVALUATION SETUP DL and FL models accounts for the differences in how data is
Instead of using a Leave-One-Subject-Out strategy, we opted aggregated and utilized in each paradigm, taking into consid-
for a more personalized evaluation setup due to the eration the distributed and collaborative nature of federated
C. EXPERIMENTS DEFINITION
The following section introduces all the experiments we con-
veyed in our study, providing descriptions, configurations and
targets of the experiment analysis.
comparing the DL and FL models, we treated one epoch (DL) contained data from only one sensor placement, depending
and one round (FL) as equivalent. This approach is intended on what type of data the user was chosen to have access to.
to provide fairness in the comparison, as FL locally operates As was the case previously, after each round, the model
on smaller amount of data, compared to DL, but exploits was evaluated on a test subset that was a combination of the
more local epochs. Also, the updates of the shared (global) individual test subsets of all users (clients). This effectively
FL model, occur in every round, which is equivalent to the meant that the test subset used to evaluate the model, had
model update at each epoch in the DL case. When training roughly the same ratio of examples from different placement
FL models, the C parameter was set to 6 and the number of as the ratio of users who had access to data from different
local epochs used, was 5. sensor placements.
Furthermore, as already mentioned, we also varied the Furthermore, aside from varying the number of users who
sensor placement whose data we used for training and testing. had access to each location, we also varied the number of
Namely, we used three possible sensor placements: (i) the clients used for training in each round and the number of
wrist of the dominant hand, (ii) the thigh of the right leg, rounds used to train each model.
when using the JSI-FOS dataset, or the chest when using the
PAMAP dataset, and (iii) a combination of both available sen- 4) BANDWIDTH EFFICIENCY ANALYSIS
sor placements. When using data from two different sensor One of the most prominent advantages of FL is the exchange
placements, the data were simply concatenated and examples of the model information, instead of the complete dataset.
from both placements had the same weight while training. This results in decreased volume of shared information, that
facilitates higher bandwidth efficiency and easier collabora-
2) FL OPTIMIZER IMPACT tion and model building. However, the improved bandwidth
The goal of this experiment is to explore the behavior of efficiency can result in performance decline. This experiment
different FL optimizers - FedAdagrad [27], FedYoGi [28], aims at analyzing the effects of bandwidth efficiency on the
and FedAvg [29], with respect to their macro F-score per- overall FL model performance. Specifically, the experiment
formance. Specifically, we will evaluate these optimizers strives to analyze how the number of clients and the volume
when using both sensor locations. Due to the different of the exchanged data impacts the precision and robustness
approach in computing the global model, it is expected that of the FL model.
some optimizers should operate more accurately for the case It is intuitive that DL will have an advantage compared
of HAR. to FL due to the larger volume data that is available to the
model at any point in time. However, this larger data volume
hampers the deployment of DL in real-world scenarios, where
3) IMPACT OF CLIENTS WITH HETEROGENEOUS SENSOR bandwidth limitation and efficiency is of utmost importance
PLACEMENTS to IoT-based HAR systems. Conversely, the experiment also
Our third experiment investigates the impact of building a compares the FL and DL performances for the same amount
shared model using clients that have access to data from dif- of exchanged data. The comparison provides further insights
ferent sensor placements. In real-world scenarios, not every regarding the applicability of FL when compared to DL.
person who uses an activity recognition service will wear The experiment setup and system configuration for the
their sensor-equipped device at the same location on their bandwidth efficiency analysis is the same as described in
body. For example, if that device is a smartphone, one person Section V-C1. The performance analysis is conducted with
may wear it in the pocket of their trousers, and another might respect to the attained macro F-score as a function of the
wear it in the pocket of their jacket or even in a backpack. volume of data transmitted to a server. For FL, the data
This means that some of the clients of a FL model might transfer volume is calculated as:
send updates computed on data from one sensor placement,
while others send updates computed on data from another DFL = C · Ntr · Nw · P (1)
sensor placement. Considering this, this experiment aims to
where C is the number of random clients that participate in the
explore the effects that receiving updates corresponding to
round, Ntr is the number of training rounds executed in order
data from heterogeneous sensor placements might have on
to attain the given macro F-score, Nw is the number of weights
the performance of the shared model.
of the client’s model and P is the memory size of each weight
To that end, we varied the number of users who only had
in the model (i.e., 4B per weight, assuming single precision
access to data from one sensor placement but not the other,
floating point). For DL, the data transfer volume is calculated
and observed the performance changes that occurred. In each
as:
training round, all clients, regardless of what data they had
access to, were eligible to be used for training, while the DDL = F · Nf · Ndr · P (2)
selection of which clients had access to a particular type of
data was done randomly. The whole process was repeated where F is the fraction of data used for training the DL
ten times to reduce the effects of randomness. It should be model, Nf is the number of features used for training (1184
pointed out that in each repetition, the test subset of each user in total), Ndr is the total amount of data rows (cumulative for
all clients), and P is the feature precision (i.e., 4B, assuming A. SENSOR PLACEMENT IMPACT
single precision floats). Fig. 5 presents the main results from our first experiment.
For both, the DL and the FL strategy, multiple runs were More specifically, Fig. 5(a) and Fig. 5(b) show the achieved
conducted to calculate the 95% confidence intervals of the macro F-score in dependence of the number of training
macro F-score. For comparability reasons, only two DL vari- rounds/epochs, for the DL (shown using dotted lines) and
ations were considered, i.e., DL trained with 10% and 50% the FL models (solid lines) when using either the JSI-FOS or
of the training part of the dataset. PAMAP dataset for training and evaluation, respectively. The
three models per learning paradigm differ only in the sensor
5) MODEL COMPLEXITY AND THE EFFECTS OF FEATURE placement that provided the data they processed.
SELECTION When using JSI-FOS for training and evaluation, Fig. 5(a)
Often HAR-based systems rely on devices that have limited shows a clear ranking between the models that differ only
energy, computational and communication capabilities. Since in the sensor placement they used, regardless of whether
FL relies on local model building, it is crucial to minimize the DL or FL was used. For example, the worst performance
model complexity. However, straight-forward minimization was generated by DL and FL models that used data from
of the model complexity can have detrimental effects on the wrist sensor placement, while substantially better results
the overall performance of FL. As a result, there exists a were produced by those using either the thigh placement or a
requirement for exploring the possibilities that minimize the combination of both sensor placements. In fact, the best DL
model complexity without significantly decreasing the FL and FL models were produced using the combination of both
performances. placements. Furthermore, the results show that all models
Feature selection represents one of the most auspicious tend to plateau once the number of training epochs/rounds
ways of minimizing the model complexity while attaining a reaches 20, with models that use either both sensor place-
certain level of robustness and precision of the FL model. ments or the thigh sensor placement, converging slightly
This experiment analyzes the effects of model complexity faster than the models that use the wrist sensor placement.
minimization by feature selection, and discusses the potential When comparing models based on their type, i.e., DL or
benefits and pitfalls. FL, the results show that DL models always produced
For the purposes of this experiment, the performed feature slightly better results across the whole range of training
selection process is a Recursive Feature Elimination (RFE). epochs/rounds when compared to the corresponding FL
The goal of the feature selection was set as selecting the best model. Additionally, this performance gap between the two
100 features out of the total of 1184. Afterwards, the models types of models seems to remain almost constant across the
were trained and tested on these 100 most important features whole range of epochs/rounds, with the exception of the case
when DL and FL models are trained on data from both sensor
locations and when the number of epochs/rounds is above 20.
6) EFFECTS OF DATA WITH CORRUPTED LABELS
It is also evident that these models behave very similarly and
In real-world deployments, the available data is non-ideal usually generate test macro F-score curves that have nearly
and exhibits different negative properties, such as data will identical shapes, with FL models taking a slightly larger
be noisier and data labels can be incorrect. This experiment number of rounds to achieve their best performance.
analyzes the performance behavior of FL when considering The results presented in Fig. 5(b) indicate that using
non-ideal datasets. Specifically, the experiment analyzes the PAMAP as the dataset for model training and evaluation
FL performances when there exist errors in the labeling of the yields similar outcomes. It is worth noting that models of the
data. The amount of erroneous data (wrong labels) is varied same type maintain a consistent ranking. In particular, deep
for both DL and FL. Since FL relies on a subset instead learning (DL) and federated learning (FL) models that utilize
of all clients during each round of the training phase, it is data from both sensor locations perform better than those
very important to analyze how the volume of erroneous data using data from the chest alone, which in turn perform better
correlates with the number of active clients per round, and than those using data solely from the dominant wrist. How-
how it compares to the DL case. ever, a key difference when training and evaluating on the
The dataset with corrupted labels is generated from the PAMAP dataset is that the gap in the performance between
JSI-FOS dataset. The process of generating the erroneous models trained on wrist sensor data and models trained using
labels, is as follows: (i) randomly select specific amount of the chest location or data from multiple sensor locations is
labels (i.e., 1%, 10% or 20%) that will be incorrect; (ii) for the substantially smaller compared to that which is present when
selected labels, choose a different label based on a uniform using the JSI-FOS dataset. For instance, the FL model trained
random distribution from all available ones in the dataset; on chest data performs worse than the DL model trained on
(iii) use the newly generated dataset for training. wrist data, which is not observed in the case of using the
JSI-FOS dataset. We hypothesize that this discrepancy arises
VI. RESULTS AND DISCUSSION because data from the chest sensor placement is inherently
This section presents and elaborates on the main results we less informative for predicting the target activities compared
obtained from all the experiments introduced in Section V-C. to data coming from a sensor placed at the user’s thigh.
VOLUME 11, 2023 64449
S. Kalabakov et al.: Federated Learning for Activity Recognition: A System Level Perspective
FIGURE 5. Comparison of macro F-scores [%] between DL and FL models at varying numbers of training
epochs/rounds when using the (a) JSI-FOS dataset, and the (b) PAMAP dataset.
Regarding the relative behavior of DL and FL when for training and evaluation, and to streamline our analysis,
using the PAMAP dataset, things remain unchanged. Again, we decided to exclusively present the results obtained on the
DL models always produce slightly better results across the JSI-FOS dataset from this point forward.
whole range of training epochs/rounds when compared to the Fig. 6 takes an even closer look into the relative per-
corresponding FL model. Additionally, the performance gap formance of the FL models compared to the DL models.
between these two models seems to remain constant as the It presents two confusion matrices, generated from the pre-
training of the model progresses. Furthermore, as was the dictions of a DL model and an FL model, both using data
case when using the JSI-FOS dataset, the results show that from both sensor locations for training and evaluation on
all models tend to plateau around the 20th epoch/round, with the JSI-FOS dataset. By comparing the confusion matrices,
models that use either both sensor placements or the chest we can observe that both DL and FL models exhibit very
sensor placement converging slightly faster than the models similar detection performance per activity class. Specifically,
that use the wrist sensor placement. Finally, here we can once both models achieve the best performance for activities such
more observe that the different models produce test macro as standing, lying, cycling and running. The worst perfor-
F-score curves that have nearly identical shapes. mances are attained for activities such as kneeling. It is also
Given that the relative performance of DL and FL models interesting to note that DL and FL models make mistakes
does not appear to change when using different datasets in roughly the same situations, namely, confusing lying for
FIGURE 8. Macro F-scores [%] achieved by DL and FL models using different compositions of the
training data.
Additionally, the analysis in this section focuses on the D. BANDWIDTH EFFICIENCY ANALYSIS
statistical behavior of the FL models. Fig. 9 shows the statis- The results of our analysis regarding bandwidth efficiency
tical performances of FL models (mean and 95% confidence are presented in Fig. 10(a). More specifically, Fig. 10(a)
interval) that had been trained for either 10, 30 or 50 rounds, shows a comparison between DL and FL models that use
that used eight clients for training in each round (C = 8), different amounts of training data from a full-featured version
and that used different ratios of clients with heterogeneous of the JSI-FOS dataset. As a distributed learning strategy, FL,
sensor placements. The results reveal a substantial perfor- transfers the model weights to the centralized server in each
mance gap depending on the number of rounds chosen for round of operation. In contrast, for DL, the data needs to be
training. Specifically, opting for a low number of rounds, such completely transferred to the central server to perform the
as 10, yields relatively poor results in terms of mean macro training of the model. The FL-based macro F-score curves are
F-score values, whereas a higher value like 30 or 50 leads to presented as continuous with respect to data transfer volume
better performance. However, the difference between choos- and the DL results are depicted as discrete points on the macro
ing 30 and 50 rounds for training is small, consistent with F-score vs. data transfer volume plots.
the results from the first experiment, where the models tend In terms of the FL performances, Fig. 10(a) shows that the
to plateau in performance after the 20th round. Furthermore, FL strategy with one active client per round (C = 1) can
choosing a larger number of rounds for training (e.g., above achieve the near optimal macro F-score with about 15MB of
30) and/or using only thigh sensor data, yields results with data transferred, while FL with five and nine active clients
a lower standard deviation (i.e. smaller 95% confidence per round needs ∼30 and ∼45MB, respectively, to achieve
interval). the near-optimal macro F-scores. The FL results also show
FIGURE 9. Impact of the number of training rounds on a FL model’s (C = 8) macro F-score [%]
performance for different compositions of training data from the JSI-FOS dataset.
that the confidence intervals for the macro F-score decrease only differences: an increase in the confidence interval for DL
as the number of active clients increases, meaning that a bit of trained with 10% of the dataset and a slight increase in macro
bandwidth efficiency needs to be sacrificed for an increased F-score for the DL trained with 50% of the dataset. DL with
stability of the FL models. In conclusion, there is a clear the reduced feature set (100 features) provides a dominant
trade-off between the bandwidth efficiency, model accuracy bandwidth efficiency, i.e., a macro F-score of ≈0.83 for 5MB
and model stability for the FL strategy. of data volume transferred.
Fig. 10(a) also depicts the DL results for the macro In conclusion, DL with optimized feature set might come
F-scores and confidence intervals vs. the data transfer vol- as a satisfactory solution for bandwidth efficient ML for
ume. It is clear that the DL model using only 10% of the HAR. However, the online principle of operation, privacy
dataset for training is outperformed by all FL scenarios in preservation, reasonable performances and bandwidth effi-
terms of bandwidth efficiency. The DL model trained with ciency, still remain the main benefits of the FL strategy.
50% of the dataset, shows slightly better macro F-scores at the Furthermore, the drop in macro F-score performances of FL
price of a wider confidence interval (lower model stability) with reduced feature set may come as a result of the low
than FL with a larger number of active clients per round (≥5). number of epochs used to train the local FL models (=5), i.e.,
the inability of the local models to converge for the reduced
E. MODEL COMPLEXITY AND THE EFFECTS OF FEATURE feature set. The optimization of these aspects will be part of
SELECTION the authors’ future work.
The results of our analysis regarding model complexity are
presented in Fig 10(b). They are consistent with the ones F. EFFECTS OF DATA WITH CORRUPTED LABELS
presented in Section VI-D. The data volumes are reduced in The results of our analysis into the effects of data with cor-
compliance with equations 1 and 2. rupted labels are presented in Fig. 11.
Comparing the results between Fig. 10(a) and Fig. 10(b), A general observation is that DL is more vulnerable to
there is a significant improvement of the bandwidth efficiency this phenomenon than the FL models. It is intuitive that the
of the FL strategy. In particular, FL with one active client increase of percentage of incorrect labels will decrease the
per round needs about 4MB to achieve a near-optimal macro macro F-score of the DL model, which is also confirmed
F-score. FL with a higher number of clients (five and nine) by the results. Furthermore, as the number of epochs grows,
does not converge in the inspected data volume range. The the DL performances drop even more significantly, since
increase of the bandwidth efficiency comes at the price the model has more opportunities to fine-tune to data with
of a reduced model accuracy. Comparing Fig. 10(a) and incorrect labels.
Fig. 10(b), there is a noticeable drop in performances for the On the opposite, the FL strategy is more robust to label
FL strategy. There is about a 5% drop in macro F-score at a errors, dropping only 1-4% in macro F-score as the percent-
lower number of rounds, as well as a noticeable increase in age of label errors grows to 20%, depending on the number of
the confidence intervals (model instability) for all inspected active clients. It is also clear that FL with more active clients
FL use-cases (C = 1, 5, 9). (C = 6) is more robust to label errors. This is mostly due
On the contrary, the DL strategy preserves the macro to the online operation and the weight averaging principle of
F-score performances with the reduced feature set, compared the FL strategy. This is a very important advantage of the FL
to DL with the full feature set (Fig. 10(a)). These are the paradigm, since in real-world scenarios flawed or imprecise
sensor data yielded the best performances in terms of The previously discussed conclusions and lessons learned
macro F-scores, again, for both the DL and the FL can serve as valuable and comprehensive guidelines for
models. This is due to the fact the wrist sensor can designing, developing and implementing efficient federated
contribute to better performances for some specific learning solutions for human activity recognition. Most of the
type of activities. The DL and FL results, as well as the conclusions are also generalizable to other federated learning
gaps between the DL and the FL models are consistent applications beyond human activity recognition.
for the two investigated datasets.
4) Clients with heterogeneous sensor placements. The VIII. CONCLUSION
experiment conducted on clients with heterogeneous This paper presents a performance analysis for FL-based
sensor placements revealed that compared to DL mod- HAR, from a system level perspective and under various real-
els, FL models needed a slightly higher number of world conditions, such as communication cost/bandwidth
clients that have access to data from the more infor- efficiency, model complexity, erroneous data, etc. The analy-
mative sensor placement before they are able to start sis also provides a head-on comparison between FL and DL
leveraging this data source and improve their results. when using two different datasets. The results clearly show
In addition, our results also showed that, when using that various system parameters and configurations like the
clients with data from heterogeneous sensor place- type of sensor placement, FL optimizer, model complexity,
ments, choosing one C value (fraction of clients) over data volume as well as erroneous data can play a crucial role
the others does not make much sense as there was no in the robustness and applicability of FL-based HAR.
substantial difference between their results. Future work will focus on several different optimality and
5) Bandwidth efficiency. Regarding bandwidth effi- optimization aspects that will build upon the findings from
ciency, FL demonstrated better performance than DL this work. Specifically, the future work will investigate the
by achieving a nearly optimal macro F-score with the analytical tractability and generalization of the optimization
transfer of only tens of megabytes of data. The inves- problem related to system-level parameters, including band-
tigation also looked into the C parameter and revealed width efficiency, energy efficiency, model complexity, and
that increasing the number of active clients per round the model performance. Additionally, it will broaden the
led to improved model stability but required more data analysis of the erroneous data effect, by including non-iid
to be transferred for the FL models to converge. In other data points, noising of the data samples, as well as label
words, the study highlighted a clear trade-off between smoothing.
bandwidth efficiency, model accuracy, and model sta-
bility for the FL paradigm. REFERENCES
[1] M. Luštrek et al., ‘‘A personal health system for self-management of
6) Model complexity and feature selection. The exper- congestive heart failure (HeartMan): Development, technical evaluation,
iment used the Recursive Feature Elimination (RFE) and proof-of-concept randomized controlled trial,’’ JMIR Med. Informat.,
to select the best 100 features, and the models were vol. 9, no. 3, Mar. 2021, Art. no. e24501.
[2] I. Kiprijanovska, H. Gjoreski, and M. Gams, ‘‘Detection of gait
trained and tested on these 100 features. The results abnormalities for fall risk assessment using wrist-worn inertial
showed a substantial improvement in the bandwidth sensors and deep learning,’’ Sensors, vol. 20, no. 18, p. 5373,
efficiency of the FL strategy when compared to the Sep. 2020.
[3] I. Husain and D. Spence, ‘‘Can healthy people benefit from health apps?’’
full feature set, with a 4MB data volume needed for BMJ, Clin. Res. Ed., vol. 350, p. h2520, Apr. 2015.
near-optimal macro F-score. However, this increase in [4] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao, ‘‘FedHealth: A federated
bandwidth efficiency came at the cost of reduced model transfer learning framework for wearable healthcare,’’ IEEE Intell. Syst.,
accuracy, with a noticeable drop in macro F-score and vol. 35, no. 4, pp. 83–93, Jul./Aug. 2020.
[5] Y. Chen, X. Yang, B. Chen, C. Miao, and H. Yu, ‘‘PdAssist: Objective and
an increase in confidence intervals for all inspected quantified symptom assessment of Parkinson’s disease via smartphone,’’
FL use cases. The main conclusion is that DL with in Proc. IEEE Int. Conf. Bioinf. Biomed. (BIBM), Kansas City, MO, USA,
optimized feature sets may be a satisfactory solution for Nov. 2017, pp. 939–945.
[6] M. Lee, A. M. Khan, J. Kim, Y. Cho, and T. Kim, ‘‘A single tri-axial
bandwidth-efficient ML for HAR, but FL still remains accelerometer-based real-time personal life log system capable of activ-
the main choice for online operation, privacy preserva- ity classification and exercise information generation,’’ in Proc. Annu.
tion, and reasonable performances. Int. Conf. IEEE Eng. Med. Biol., Buenos Aires, Argentina, Aug. 2010,
pp. 1390–1393.
7) Erroneous data effect. The experiment compared the [7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas,
performance of FL to that of traditional DL when ‘‘Communication-efficient learning of deep networks from decentralized
working with a dataset that has a varying percentage of data,’’ in Proc. AISTATS, Ft. Lauderdale, FL, USA, vol. 54, Apr. 2017,
pp. 1273–1282.
erroneous labels. The results of the experiment show
[8] K. Kirsten, B. Pfitzner, L. Löper, and B. Arnrich, ‘‘Sensor-based obsessive-
that the DL model is more vulnerable to label errors compulsive disorder detection with personalised federated learning,’’ in
than the FL model. This finding highlights the advan- Proc. 20th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Pasadena, CA,
tage of FL in mitigating the effect of erroneous data, USA, Dec. 2021, pp. 333–339.
[9] S. Ek, F. Portet, P. Lalanda, and G. Vega, ‘‘A federated learning aggregation
limiting the error propagation due to the averaging algorithm for pervasive computing: Evaluation and comparison,’’ in Proc.
process for the global model update. IEEE PerCom, Kassel, Germany, Mar. 2021, pp. 1–10.
[10] R. Presotto, G. Civitarese, and C. Bettini, ‘‘Semi-supervised and person- STEFAN KALABAKOV received the B.Sc. degree
alized federated activity recognition based on active learning and label in computer technologies and engineering from the
propagation,’’ Pers. Ubiquitous Comput., vol. 26, no. 5, pp. 1281–1298, Faculty of Electrical Engineering and Information
Oct. 2022. Technologies (FEEIT), Skopje, North Macedonia,
[11] L. Tu, X. Ouyang, J. Zhou, Y. He, and G. Xing, ‘‘FedDL: and the M.Sc. degree from the Joef Stefan Inter-
Federated learning via dynamic layer sharing for human activity national Postgraduate School, Ljubljana, Slovenia.
recognition,’’ in Proc. ACM SenSys, Coimbra, Portugal, Nov. 2021, He is currently pursuing the Ph.D. degree.
pp. 15–28.
He is also a Research Assistant with the Dig-
[12] C. Li, D. Niu, B. Jiang, X. Zuo, and J. Yang, ‘‘Meta-HAR:
ital Health-Connected Healthcare Group, Hasso
Federated representation learning for human activity recog-
nition,’’ in Proc. Web Conf., Ljubljana, Slovenia, Apr. 2021,
Plattner Institute (HPI), Germany. His research
pp. 912–922. interests include federated learning, electronic health records, and human
[13] G. K. Gudur and S. K. Perepu, ‘‘Resource-constrained federated activity recognition.
learning with heterogeneous labels and models for human
activity recognition,’’ in Proc. DL-HAR, Kyoto, Japan, Jan. 2021,
pp. 57–69.
BORCHE JOVANOVSKI received the B.Sc.
[14] X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, ‘‘ClusterFL:
A similarity-aware federated learning system for human degree in electrical engineering and information
activity recognition,’’ in Proc. ACM MobiSys, Jun. 2021, technologies and in the field of telecommuni-
pp. 54–66. cations and the M.Sc. degree in electrical and
[15] X. Zhou, W. Liang, J. Ma, Z. Yan, and K. I. Wang, ‘‘2D federated information technology and in the field of wire-
learning for personalized human activity recognition in cyber-physical- less systems, services and applications from the
social systems,’’ IEEE Trans. Netw. Sci. Eng., vol. 9, no. 6, pp. 3934–3944, Faculty of Electrical Engineering and Information
Nov. 2022. Technologies (FEEIT), Ss. Cyril and Methodius
[16] K. Sozinov, V. Vlassov, and S. Girdzijauskas, ‘‘Human University in Skopje (UKIM), Skopje, Macedonia,
activity recognition using federated learning,’’ in Proc. IEEE in 2019 and 2021, respectively, where he is cur-
ISPA/IUCC/BDCloud/SocialCom/SustainCom, Melbourne, VIC, rently pursuing the Ph.D. degree in electrical engineering and information
Australia, Dec. 2018, pp. 1103–1111. technologies. He is also a Research Associate and part of the Laboratory
[17] H. Cho, A. Mathur, and F. Kawsar, ‘‘Device or user: Rethinking federated for Wireless and Mobile Networks, UKIM in Skopje. His research interests
learning in personal-scale multi-device environments,’’ in Proc. ACM Sen- include wireless networks, wireless communications, cloud computing, and
Sys, 2021, pp. 446–452. recently application of machine learning and federated learning in different
[18] S. Ek, F. Portet, P. Lalanda, and G. Vega, ‘‘Evaluation of federated learning
domains.
aggregation algorithms: application to human activity recognition,’’ in
Proc. ACM UbiComp-ISWC, 2020, pp. 638–643.
[19] S. Kozina, H. Gjoreski, M. Gams, and M. Luštrek, ‘‘Three-layer
activity recognition combining domain knowledge and meta- DANIEL DENKOVSKI is currently an Associate
classification,’’ J. Med. Biol. Eng., vol. 33, no. 4, pp. 406–414, Professor with the Faculty of Electrical Engi-
Jan. 2013.
neering and Information Technologies, Ss. Cyril
[20] H. Gjoreski, B. Kaluža, M. Gams, R. Milić, and M. Luštrek,
and Methodius University in Skopje. His major
‘‘Context-based ensemble method for human energy expen-
diture estimation,’’ Appl. Soft Comput., vol. 37, pp. 960–970,
research interests include concentrated on signal
Dec. 2015. processing, information theory, wireless commu-
[21] A. Reiss and D. Stricker, ‘‘Introducing a new benchmarked dataset nications, cloud computing, and recently machine
for activity monitoring,’’ in Proc. 16th Int. Symp. Wearable Comput., learning and federated learning and their applica-
Newcastle, U.K., Jun. 2012, pp. 108–109. tion in different domains. He has notable research
[22] A. Reiss and D. Stricker, ‘‘Creating and benchmarking a new dataset for experience having worked on 12 internationally
physical activity monitoring,’’ in Proc. PETRA, Crete, Greece, Jun. 2012, funded research projects (FP7, H2020, and NATO SpS) and several domestic
pp. 1–8. projects in his research areas. Besides theoretical research, he has serious
[23] S. Kalabakov, S. Stankoski, I. Kiprijanovska, A. Andova, N. Reščič, prototyping experience, which has resulted in several awarded ICT system
V. Janko, M. Gjoreski, M. Gams, and M. Luštrek, ‘‘What actually works prototypes. He has more than 60 publications, out of which 16 in top journals
for activity recognition in scenarios with significant domain shift: Lessons with impact factor, and seven chapters in Springer books. He was a recipient
learned from the 2019 and 2020 sussex-huawei challenges,’’ Sensors, of the Award ‘‘Best Young Scientist’’ for 2014 from the President of the
vol. 22, no. 10, p. 3613, May 2022. Republic of Macedonia.
[24] X. Glorot, A. Bordes, and Y. Bengio, ‘‘Deep sparse rectifier neural net-
works,’’ in Proc. AISTATS, Ft. Lauderdale, FL, USA, vol. 15, 2011,
pp. 315–323.
[25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and VALENTIN RAKOVIC (Senior Member, IEEE)
R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks received the Dipl.-Ing., M.Sc., and Ph.D. degrees
from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, in telecommunications from the Faculty of Elec-
2014. trical Engineering and Information Technologies,
[26] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’ Ss Cyril and Methodius University in Skopje
2014, arXiv:1412.6980.
(UKIM), in 2008, 2010, and 2016, respectively.
[27] J. Wang, Z. Xu, Z. Garrett, Z. Charles, L. Liu, and G. Joshi, ‘‘Local
He currently holds the position of an Associate
adaptivity in federated learning: Convergence and consistency,’’ 2021,
arXiv:2106.02305. Professor and the Head of the Laboratory for
[28] I. Tenison, S. A. Sreeramadas, V. Mugunthan, E. Oyallon, E. Belilovsky, Wireless and Mobile Networks, Faculty of Elec-
and I. Rish, ‘‘Gradient masked averaging for federated learning,’’ 2022, trical Engineering and Information Technologies
arXiv:2201.11986. (FEEIT), UKIM in Skopje. He has coauthored more than 70 publications
[29] H. Mcmahan, E. Moore, D. Ramage, and B. Yarcas, ‘‘Federated learn- in international conferences and journals. His research interests include
ing of deep networks using model averaging,’’ 2016, arXiv:1602. wireless networks, signal processing, optimization theory, machine learning,
05629. and the prototyping of wireless networking solutions.
BJARNE PFITZNER received the M.Eng. degree BERT ARNRICH is currently the Head of
in computing from the Imperial College London. the Chair Digital Health—Connected Healthcare,
He is currently pursuing the Ph.D. degree with joint Digital-Engineering Faculty, Hasso Plattner
the Digital Health—Connected Healthcare Group, Institute (HPI), and the University of Potsdam.
Hasso Plattner Institute (HPI), Germany. He is also He has been a PI in several European and national
a Research Assistant with the Digital Health— research projects. He studied ‘‘Informatics in the
Connected Healthcare Group, HPI. For the last Natural Sciences.’’ In his Ph.D. thesis, he imple-
four years, he worked in the area of federated mented an early big data approach that collects and
learning with a focus on privacy-preserving algo- consolidates patient data for scientific data anal-
rithms using differential privacy and healthcare ysis. At ETH Zurich, he established and headed
applications, such as medical imaging and risk stratification for the intensive the Wearable Computing Laboratory, Research Group Pervasive Healthcare.
care unit. He received a Marie Curie Cofound Fellowship from the European Union and
was appointed to tenure track professorship with the Computer Engineering
Department, Bosporus University. He was the Science Manager of Emerging
Technologies with Accenture Technology Solutions.