0% found this document useful (0 votes)
8 views10 pages

An Explainable Method for Cost Efficient 250226 210338 Compressed

This document presents a cost-efficient multi-view fall detection method that utilizes real-time video analysis to enhance safety in healthcare environments. The method employs a lightweight human pose estimator to extract key body points from two video feeds, enabling the detection of falls without the need for expensive hardware. Preliminary results indicate the proposed approach is effective compared to existing state-of-the-art methods, addressing challenges such as occlusion and perspective ambiguity.

Uploaded by

Wided Miled
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

An Explainable Method for Cost Efficient 250226 210338 Compressed

This document presents a cost-efficient multi-view fall detection method that utilizes real-time video analysis to enhance safety in healthcare environments. The method employs a lightweight human pose estimator to extract key body points from two video feeds, enabling the detection of falls without the need for expensive hardware. Preliminary results indicate the proposed approach is effective compared to existing state-of-the-art methods, addressing challenges such as occlusion and perspective ambiguity.

Uploaded by

Wided Miled
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

An Explainable Method for Cost-Efficient

Multi-View Fall Detection


Amani ELAOUD∗ , Achraf KHAZRI∗ and Walid BARHOUMI∗†
∗ Université
de Tunis El Manar, Institut Supérieur d’Informatique, Research Team on
Intelligent Systems in Imaging and Artificial Vision (SIIVA),
LR16ES06 Laboratoire de recherche en Informatique, Modélisation et Traitement
de l’Information et de la Connaissance (LIMTIC),
2 Rue Abou Rayhane Bayrouni, 2080 Ariana, Tunisia
† Université de Carthage, Ecole Nationale d’Ingénieurs

de Carthage (ENICarthage), 45 Rue des Entrepreneurs, 2035 Tunis-Carthage, Tunisia

Abstract—Human fall detection is a crucial topic to study, since literature in order to solve automatically the problematic
there are a lot of cases of person’s fall at hospitals, homes and of fall detection [32]. Overall, although that marker-based
retirement homes. In fact, falls are very costly, especially for methods have proved to be accurate for detecting falls, their
elderly people and persons with special needs, since they may
cause death or serious injuries that require instance medical effectiveness decreases remarkably when markers are placed
intervention. In order to prevent further repercussions after this incorrectly on the body, in addition to being impractical,
type of accidents, modern automated fall detection methods are difficult to install in the environment or to carry around
presented as a type of effective alerting systems that are widely all day. Thus, recent works are focusing on non-wearable
used for emerging healthcare applications. In this study, we methods, notably those based only on video sequences [33].
present a multi-view-based fall detection method that runs in
real time, using only CPU and consequently it can be deployed These methods can be classified into three main groups. The
in hospitals, retirement homes and cribs without any financial first group consists of methods based on hand-crafted feature
problems related to expensive hardware. Indeed, a light weight extraction followed by a supervised classification procedure.
human pose estimator has been adopted in order to detect human However, methods belonging to this first class are in generally
body key-points from two different views in order to solve the not sufficiently accurate and the extraction of hand-crafted
problematic of image depth ambiguity. Then, we extract few
explainable features, based on the automatically detected key- features heavily relies on human experience, while being
points, while being associated to confirmed descriptors of posture not thorough enough to cover all the different cases. The
and balance biomechanics. The extracted features are thereafter second group consists of end-to-end deep learning-based
fed into a machine learning classifier in order to predict whether methods. These methods promise very good results but, on
there is a fall or not. The proposed method has been tested on a the other hand, they learn the fall prediction in the annotated
challenging public dataset, and the preliminary obtained results
show its effectiveness compared to other relevant state-of-the-art data, without explicit regularization and with the risk of
methods. overfitting. The third main group consists of hybrid methods
Index Terms—Fall detection, pose estimation, explainable fea- that are based on deep learned features that are thereafter
tures, multi-view, healthcare processed using supervised classifiers. Nevertheless, the
methods composing this last group are also facing the risk
I. I NTRODUCTION of overfitting, in addition to the black-box aspect of the
In recent years, human motion analysis has been explored extracted features.
in various applications and many effective tools have been
developed to accurately analyze and interpret human behaviour In this study, we are interested by the first class, which
[31]. In fact, by extracting significant characteristics, human is based on investigating hand-crafted features, given that
behaviour can be analyzed and understood very well, without fall detection systems should be designed to perform an
the need for sensors or markers that limit the movements increasingly complex task in environments shared with
of persons [13]. In particular, human fall detection is a humans, and thus the need for explainable models will
crucial topic to study since there are a lot of cases of falls persist. However, this problem remains difficult to solve
at hospitals, homes and retirement homes. Across the globe, for two major reasons. The first reason is related to the
according to the World Health Organization (WHO), falls lack of representative datasets, since most of the available
are the second leading cause of unintentional injury deaths datasets are recorded in laboratories and do not reflect the
worldwide, and each year there are over 37 million falls real use case and consequently presented solutions cannot
[1]. Indeed, the repercussions generally are injuries that generalize well. In addition, by addressing the real-world
require a medical intervention, hospitalization and some-time situations, we may face other difficulties such as occlusion,
death (approximately 684, 000 of death case each year [1]). variation in human clothing, lighting condition and depth
Thus, several approaches have been proposed in the scientific ambiguities in images. The second reason is the problem
of fall detection solutions deployment. Generally, solutions or not. If an abnormal velocity is detected, the body joints
presented in literature cannot be deployed in hospitals, homes are checked to determine whether they are on the ground
or retirement homes and this is related generally to the use or not before concluding that a fall has occurred. Likewise,
of expensive or unpractical hardware. In order to address the Yang et al. [4] have detected falls through head tracking with
above-mentioned problems, we propose a multi-view-based dense Spatio-Temporal Context (STC) algorithm based on
fall detection method that runs in real time, using only CPU Kinect RGB-D images. Others studies have relied on human
and two regular RGB cameras. Thus, the proposed method pose estimators to provide the position of human body joints
can be deployed anywhere without any financial problems used for fall detection. For example, in the work of [5], a
related to expensive hardware. Indeed, we start by estimating video-based fall detection method was proposed using a pose
human poses in videos using a light weight human pose estimator in the first step, followed by a classification step for
estimator. Then, we extract explainable feature, according to a sequence of images into two main classes, fall/Not fall, using
relevant posture and balance biomechanics patterns, based a fully convolutional neural network. Similarly, authors in [17]
exclusively on some human body keypoints. These features have utilized a human pose estimator in order to detect human
are thereafter filtered before being fed to a supervised body joints in RGB video frames. In fact, the extracted joints
classifier in order to predict fall detection. In addition to are used to build spatial and temporal features classified by the
being costly efficient, the main contribution of the method Long-Short-Term-Memory (LSTM) recurrent neural network
resides in investigating interpretable hand-crafted features with the aim of detecting falls.
while exploring multi-camera in order to overcome the issues
of perspective ambiguity and occlusions. B. End-to-end deep learning based-methods
Within the context of automated fall detection methods
The rest of this paper is organized as follows. In Section 2, based on end-to-end deep learning models, Lu et al. [25]
we briefly review the related work on fall detection based on have designed an effective fall detector deep learning model
video analysis. In Section 3, we describe the proposed method. based on the combination of a 3D convolution neural network
Experimental results are discussed in Section 4 to demonstrate and an LSTM recurrent architecture. The model presented
the effectiveness of the proposed method. Finally, in Section has been trained on various kinematic video data and has
5, we conclude the proposed work and present some ideas for shown to be capable of automatically extracting spatial and
future investigations. temporal features. Similarly, Abobakr et al [21] have inves-
tigated an end to end deep learning model for fall detection,
II. R ELATED W ORK while taking videos recorded using Kinect RGB-D sensor as
Due the advances in technologies, cameras have become inputs. In fact, the proposed model uses two different types
accessible to everyone, and can be installed everywhere from of neural networks: convolution neural networks followed by
hospitals and airports to retirement homes. In particular, this LSTM recurrent neural networks. Differently, the fall detection
has made solving the problem of fall detection more practical system introduced in [23] has relied on a multi-stream 3D
than using other sensors. Thus, various methods have been convolutional neural networks for identifying human fall. In
proposed in order to solve the fall detection problem and, fact, the proposed method makes a fusion between multiple
as described in the previous section, these methods can be and consecutive images in a first stage then fed it into a multi-
grouped into three main groups: hand crafted based-methods, stream 3D convolutional neural network in order to make
methods based on end-to-end deep learning models and hybrid prediction to one of the following classes: standing, falling,
methods. fallen and others. Furthermore, in their work, Xiaogang Li
et al. [27] have adopted a convolution neural network model
A. Hand crafted based-methods similar to Alexnet [28] in order to detect falls in a video
Within the framework of fall detection based on hand- surveillance environment.
crafted features, Kishanprasad Gunale et al. [2] have imple-
mented a fall detection method based on background subtrac- C. Hybrid methods
tion followed by feature extraction and classification stages. Hybrid methods are based on extracting features using deep
In fact, they have proposed to form silhouettes using visual learning architectures before classifying inputs according to
features such as Motion History Image (MHI), aspect ratio the extracted features using machine learning classifiers. For
and orientation angle, and then tested different classifiers such instance, Chhetri et al [22] have proposed an automated vision-
as SVM, KNN and decision tree. Similarly, Feng et al. [24] based fall detection system that uses the optical flow for the
have adopted background subtraction as a preprocessing step, data pre-processing followed by a convolution neural network
while combing deep belief networks and restricted Boltzman for feature extraction and classification. In this work, the
machine for the classification stage. Differently, Zhao et al. transfer learning was adopted as the fine-tuning technique
[3] have used the Microsoft Kinect Sensor in order to capture for the convolution neural network. Likewise, Anishchenko
RGB-D images as well as the person’s position and skeleton [29] have also adopted transfer learning to train the Alexnet
joints. The provided data is thereafter investigated in order to [28] convolution neural network architecture to solve the
compute person’s velocity to determine whether it is abnormal problem of fall detection. Similarly, in [30] a wheelchair fall
detector system has been introduced based on a low-cost,
light-weight inertial sensing method based on a hybrid scheme
and unsupervised One-Class SVM (OCSVM) for detection of
cases leading to fall. However, hybrid methods are facing the
risk of overfitting, in addition to the black-box aspect of the
extracted features.

III. P ROPOSED M ETHOD


Recently, a growing interest has attracted researchers for
fallen person detection systems to improve advancement of
assistive systems. For instance, fall detection in healthcare is a
crucial area of research that aims to improve safety of persons
and prevent further damage after falls. However, injuries
caused by fall are; fractures of the hip, forearm, humerus,
and pelvis usually result from the combined effect of falls
and osteoporosis. In fact, falls require immediate attention to
reduce the risk of injuries, and the problem that hospitals and
retirement homes cannot afford enough budget to pay nurses
for monitoring persons all the day and unable also to afford
expensive fall detection systems. In this work, we propose
a practical real time multi-view fall detection method able
to run on CPU, without requiring any expensive hardware.
The input is simultaneously two videos for a person from
different views and consequently only two RGB camera are
needed for video acquisition.

As shown in Fig.1, the proposed method can be divided


into four main parts, The first part consists in feeding the
Human Pose Estimator (HPE) with two videos of two different
views. In fact, for each frame in both videos, we automatically
extract 32 body keypoints. Then, we proceed to a feature
extraction stage to construct spatial and temporal features.
These features are thereafter filtered in the next step in order
Fig. 1: Outline of the proposed fall detection method.
to reduce data size while eliminating meaningless information.
Finally, a classifier performs a prediction based on the filtered
features to detect whether or not a fall has occurred. factors: it is a lightweight model that we can use without the
need for powerful hardware, and it performs well in estimating
A. Human Pose Estimation
human pose.
In this work, we have investigated skeletons from RGB
videos using the Blaze Pose extractor [6]. In the literature, the B. Feature Extraction
Blaze pose estimator has been used in several applications, Based on the key points of the human body extracted by
including sign language recognition [7], human action recog- the human pose estimator, we investigate on four types of
nition [8] [9] [26] and fall detection [11]. This estimator is a features; such as the angle between the axis of the body’s
single-person body pose estimator that predicts 32 key body torso and the vertical axis, the mean angular velocity, the
points. It is a lightweight model that runs in real time on CPUs. height/width ratio of the human body and the projection
The architecture of this estimator consists of two main parts: of the body’s center of gravity on the ground; based on
a body pose detector that initially runs to predict the body’s confirmed descriptors of posture and balance biomechanics
key points, and a tracker that follows the detected key points [19].
until the person disappears or he/she is occluded. This pose
detector uses a face detection system that detects faces and 1) Angle and Previous Angle: This feature illustrates the
predicts the circle surrounding the whole person and other angle between the torso axis (the vertical axis of the human
important features. These features are then integrated into a body) and the vertical axis (Fig. 2). Generally, when this angle
neural network model to detect key points on the body. The exceeds 90◦ , the human body is most likely lying on the
tracker is a neural network model that predicts the presence ground. This feature has been widely used in literature [12],
of the person and the refined person body key points. In this and the common formula used to calculate this angle is given
work, we chose the Blaze Pose estimator based on two main as follows:
space of time, making it possible to determine whether or not
a fall has occurred.

AM V = θ(t) − θ(t − 1). (3)

3) Height/Width Ratio (HWR): In this work, the proposed


height and width do not reflect the actual height and width
of the person, but we propose a rectangle formed by the
following four key points: the left shoulder, the right shoulder,
the left knee and the right knee, before dividing the height
by the width (Fig. 2). From a visual point of view, a person
falls when the height of the rectangle is less than its width.
We haven not chosen the height and width of a real person,
because if we did, we might get a ratio of less than 1, when
the person is not falling or has already fallen, but is simply
opening his arms and legs wide. This feature is important and
has been widely used in the literature to solve the problem of
(a) fall detection [15].

4) Gravity Center Projection (GCP): According to the


physics of the human body [19], [20], a standing person is
stable when the projection of the center of gravity on the
ground is within the base of support, which is the area between
the feet (Fig. 2). When the center of gravity’s projection
leaves the support base, the person becomes unstable. Indeed,
a person in a state of instability is very probably capable of
falling. In general, the center of gravity has no specific location
and changes according to how the person moves and their
body mass, but in the case of a standing adult, the center of
gravity is located in the stomach and for babies, the center
of gravity is located between the shoulders as babies’ heads
are generally heavy. In this study, we have chosen to use the
(b)
center of gravity at the center of the heaps for reasons of
Fig. 2: A schematic diagram that describes spatial features: simplicity. In literature, this characteristic has been used to
(a) Angle and Height/Width Ratio (HWR), (b) Gravity Center detect a person’s fall [15].
Projection (GCP).
C. Features Filtering
In the proposed fall detection method, we have five features
m1 that we have extracted from the automatically detected body
tan(θ) = , (1)
1 + m1 .m2 key points. As we have used two cameras, these features
are doubled. Furthermore, the filtering step aims to reduce
where, m1 and m2 denote the slopes of the vertical and data size and it eliminates unnecessary information, thus
the torso axes, respectively. The previous angle is the angle maximizing accuracy while minimizing the processing time.
calculated in a previous frame in relation to a current frame This step consists of dividing these features by two and
in a video scene, and it is defined as follows: retaining only one feature per type. For each feature, we
perform the following operations to select the best one (for
β(t) = θ(t − 1). (2) the ”previous angle”, we choose the one that is related to the
selected angle):
2) Angular Mean Velocity (AMV): This feature has been
used in the literature by [14], presenting the change in body • Angle = max(Angle1 , Angle2 )
angle between two successive images. From a visual point • AM V = max(AM V1 , AM V2 )
of view, when a person falls, the angle between the vertical • HW R = min(HW R1 , HW R2 )
axis and the axis of the torso changes considerably in a short • GCP = max(GCP1 , GCP2 )
D. Classification
Once the spatial and temporal features have been extracted
• The Precision is the ratio of the number of correctly
and filtered, we predict whether a person has fallen or not
classified positive samples to the entire set of samples
using one of three relevant classifiers: Multilayer Perceptron
classified positive.
(MLP), Support Vector Machine (SVM) and Logistic
Regression (LR). To train and test these classifiers, we have TP
used the Fall-Up dataset presented by Martinez et al. [16] P recision = (5)
TP + FP
(Fig. 3). This dataset contains videos recorded from two
views of a single person performing 11 different activities:
1-falling forward using the hands, 2-falling forward with • The Recall, called also sensitivity or true positive rate,
the knees, 3-falling backward, 4-falling sideways, 5-falling is the ratio of the correctly classified positive samples
sitting on an empty chair, 6-walking. sitting on an empty over the correctly classified positive samples and the
chair, 6-walking, 7-standing, 8-sitting, 9-gripping an object, incorrectly classified positive samples.
10-jumping and 11-lying. and 11-lying. These activities were
performed by 17 different individuals, with 3 trials for each TP
Recall = (6)
activity. In total, this dataset contains 1122 videos. TP + FN

The main challenges encountered in the Fall-Up dataset


are the occlusion and the variety of actions that can be • F1-score is a combination of the precision and the recall
performed. Indeed, the investigated videos contain multiple metrics.
objects and people performing different actions such as walk-
ing and sitting, but without falling. We used this dataset for
two main reasons: firstly, the videos are recorded with high P recision × Recall
F 1 − score = 2 × , (7)
quality resolution, so we have no problems with the pose P recision + Recall
estimator detection. The second reason is that for each fall where,
case, two videos have been recorded from two orthogonal
views, which help us to solve the problem of depth ambiguity – True Positive(TP) represents the total of correct
when estimating certain key body points. The training of the prediction for positive class.
classifiers has been based on the video images. We extracted
all video frames from the dataset and labeled each frame – True Negative(TN) denotes the total of correct
with labels (Fall/Not Fall). In videos where there is no fall, prediction for negative class.
all frames have been labeled ”Not Fall”, and for videos that
contain falls, we labeled the first part of the frames before the – False Positive(FP) represents the total of incorrect
fall occurs with the label ”Not Fall”. When the person starts prediction for positive class.
to fall, during the fall and after the fall when lying down, we
labeled the frames with the label ”Fall”. – False Negative(FN) illustrates the total of incorrect
prediction for negative class.
IV. R ESULTS
To evaluate the proposed method, we have used the Fall-UP In fact, we have firstly calculated the accuracy and the F1-
dataset. In fact, this dataset contains 200,000 images, and score per image for each investigated model (MLP, SVM and
we have divided it into two parts, 70% for training and 30% LR). To evaluate the method as a whole, we have presented
for testing. To measure the performance of the investigated a video-based assessment, predicting that there is a fall in
machine learning models, we have used the confusion matrix, a video when we have n successive frames that have been
which is a visualization table that stores the predictions made detected as a fall. We have tested n from 1 to 6 to see
by machine learning models. After obtaining the confusion which number of successive frames gave the best performance
matrix, various types of metrics are calculated in order for each model for the number of successive frames. To
to evaluate numerically the model’s performance. These demonstrate the effectiveness of the proposed features, we
measures are as follows: have used all the features initially proposed, before eliminating
one feature each time to determine whether the eliminated
feature had an influence on prediction or not.
• The Accuracy is the most used evaluation metric in
classification problems. It is defined as follows: A. MLP classifier
For the multilayer perceptron, we have built an architecture
TN + TP with an input layer, a hidden layer and an output layer. To
Accuracy = (4) fine-tune this machine-learning model, we have changed the
TP + TN + FN + FP
Fig. 3: Example from the Fall-Up Dataset [16]: (a) and (b) represents a young man performing ”falling forward using the
hands” action from two views. (c) and (d) represents a young woman performing ”gripping an object” action from two views.

number of neurons in the hidden layer and tested the model


each time on the testing set to see which model performs
better. As a result, the MLP with 19 neurons in the hidden
layer has outperformed the other models with different
numbers of neurons in the hidden layer, with an accuracy
of 93.17%. For this choice, the normalized confusion matrix
shown in Fig. 4, whose diagonal elements represent the
majority of percentages representing the number of instances
for which the predicted label is equal to the actual label
(fall), allows us to conclude that MLP-19 performs well in
terms of prediction. Using the confusion matrix, we have also
calculated the accuracy and the F1-score, giving us 87.16%
for the accuracy and 86.37% for the F1-score.

Furthermore, Fig. 5 shows the evolution of the loss as Fig. 4: Confusion matrix for MLP-19 video tests.
a function of training and test data for 50 epochs, and as
we can clearly see, training and test losses are too close
and converge towards zero. Thus, we can conclude that we an influence on prediction. The importance of the features
have no over-fitting problem. Besides, in order to identify is ranked in ascending order as follows: HWR, GPC, Angle,
how many successive frames are needed to make the best AMV, and ”Previous angle”.
decision, we have tested MLP-19 (MLP with 19 neurons in
B. SVM classifier
the hidden layer) on all the videos. Each time, we have made a
decision based on n successive images (from 1 to 6 images). For the SVM classifier, we have performed the same tests
Moreover, in order to test the effectiveness of the proposed as for the MLP. In fact, we have tested the whole method
features, we have carried out several tests. For each test, we on videos, looking for the best number of successive images,
subtract one feature to see the impact of that feature on the and we have finally investigated features effectiveness, by
model performance. The table I shows that all features have subtracting each time one feature from the five we set. For
angle, GPC.

Fig. 5: Loss function curve of MLP model with different


epochs. Fig. 7: SVM features effectiveness test.

TABLE I: MLP-19 feature effectiveness test. C. LR classifier


Feature Accuracy F1-score As well as the MLP and the SVM, we have trained the
Angle, Previous angle, AMV, HWR, GPC 87.16% 86.37% logistic regression model on the frames, and the validation on
Without Previous angle 86.98% 85.94% the test data has yielded 90.53% accuracy and 39.77% for the
Without AMV 86.44% 85.71%
F1-score. The second phase of the test has consisted in testing
Without Angle 86.26% 85.03%
Without GPC 85.35% 84.62% the videos, and the test result has gave us the best result with
Without HWR 82.46% 79.66% an accuracy of 80.47% and an F1-score of 80.85% using just
one image. Indeed, the normalized confusion matrix for the
video test (Fig. 8) has confirmed that logistic regression has
the image test, we have obtained 93.6% accuracy and 65.29% some difficulty in predicting falls compared with SVM and
F1-score. For the video test, we have recorded an accuracy MLP, but performs well in predicting non-falls cases.
of 88.97% and an F1-score of 88.51% using four successive
images. These quantitative assessment metrics have been com-
puted according to the normalized confusion matrix shown in
(Fig. 6).

Fig. 8: Confusion matrix for LR video test.

The final phase of the test to check the effectiveness of


the chosen features has gave us the result (Table II) that all
features had an influence on prediction, sorted in descending
Fig. 6: Confusion matrix for SVM video test. order as follows: HWR, GPC, Angle, AMV, Previous angle.

Fig. 7 shows the results of the SVM test with 4 successive Furthermore, Table III shows the best results, while
images using all the 5 features. As a result, angles have no including the effective and ineffective predictive features of
influence on prediction (subtracting this feature has improved each model. As we can see, the SVM classifier has achieved
the results), and for the rest of the features, their importance is the best results, followed by MLP, and LR comes last with
ranked in ascending order as follows: HWR, AMV, Previous poor results compared to the other models. Moreover, the
TABLE II: LR feature effectiveness test.

Feature Accuracy F1-score


Angle, Previous angle, AMV, HWR, GPC 80.47% 80.85%
Without Previous angle 80.29% 80.43%
Without GCP 77.39% 76.1%
Without Angle 81.73% 82.19%
Without AMV 81.19% 81.16%
Without HWR 78.48% 78.79%

HWR and GPC features are the most effective compared to


the other investigated three features. HWR has been well used
in the literature and the results confirm this, but GCP has not
been used much in the literature despite its explanation in the
physics of human body stability [19]. Overall, the obtained
results we obtained by testing our method are slightly lower
Fig. 9: Comparison between the proposed mehod and two
than those presented in the literature (Fig. 9) which were
existing relevant methods that have used the same challenging
tested on the same UP-Fall public dataset, and this difference
UP-Fall dataset.
is due to the existence of several people in the videos in the
dataset. In fact, as the Blaze pose estimator that we have used
deals only with a single person’s pose, the presence of several
people in the image can lead to some confusion as to which
person to choose, so we could calculate a pose estimate for
two different people from the two different views. Fig. 10
shows an example illustrating this discussed case. Indeed, the
HPE has estimated the wrong person in the second view and,
as a result, incorrect features have been calculated, which
could have a negative influence on the prediction. However,
most of the time, given that the person performing the fall is
the most visible in the videos, the pose estimator predicts the
right person well. Moreover, the proposed method is based
solely on images from RGB camera. The HPE we used does
not require expensive hardware to run in real time, therefore,
it is more practical to use in real use cases compared to other
works in the literature based on sensors or heavyweight deep
learning models.
(a)
TABLE III: Comparison of the performances between the
investigated models.

Classifier Accuracy F1-score Predictive feature


MLP-19 87.16% 86.37% Angle,Previous angle, AMV, HWR, GPC
SVM 89.15% 88.72% Previous angle, AMV, HWR, GPC
LR 81.73% 82.19% Angle,Previous angle, AMV, HWR, GPC

V. C ONCLUSION
In recent years, human motion analysis has been explored
in various applications and many effective tools have been
developed to accurately analyze human behaviour. In fact,
by extracting significant characteristics, human behaviour can (b)
be analyzed and understood very well, without the need for Fig. 10: An illustrative sample of an human pose estimator
sensors or markers that limit the movements of persons. In failure case: (a) View 1, (b) View 2.
particular, human fall detection is a crucial topic to study
since there are a lot of cases of falls at hospitals, homes
and retirement homes. In this work, we have proposed an
effective human fall detection method that is easy to deploy [10] Gutierrez-Gallego, Jesus & Rodriguez, Victor & Martı́n, Sergio. (2022).
and does not require expensive equipment to operate. In fact, Fall Detection System Based on Far Infrared Images. 1-7. doi:
10.1109/TAEE54169.2022.9840598
we have presented a study analyzing the effectiveness of [11] Wei Liu, Xu Liu, Yuan Hu, Jie Shi, Xinqiang Chen, Jiansen Zhao,
five main features for multi-view-based fall detection, and Shengzheng Wang, and Qingsong Hu. Fall Detection for Shipboard
this study led to the conclusion that the Height/Width ratio Seafarers Based on Optimized BlazePose and LSTM. Sensors. 2022;
22(14):5449. https://doi.org/10.3390/s22145449
and the projection of the body’s center of gravity are the [12] Hernandez-mendez, s.; maldonado-mendez, c.; marin-hernandez, a.;
most important features of the entire list presented, while rios-figueroa, h.v. detecting falling people by autonomous service robots:
being clearly explainable. Moreover, we have evaluated the A ros module integration approach. In proceedings of the 2017 inter-
national conference on elec- tronics, communications and computers
proposed fall detection method with three relevant machine (conielecomp), cholula, mexico, 22–24 february 2017; pp. 1–7. doi:
learning classifiers (MLP, SVM and LR), while comparing 10.1109/CONIELECOMP.2017.7891823
the effectiveness of the features, by eliminating one feature [13] Elaoud, A., Barhoumi, W., Zagrouba, E., & Agrebi, B. (2020). Skeleton-
based comparison of throwing motion for handball players. Journal
each time, before rigorously examining the quantitative results of Ambient Intelligence and Humanized Computing, 11, 419-431. doi:
as well as the qualitataive ones. In further work, we aim https://doi.org/10.1007/s12652-019-01301-6
to develop a multi-person fall detection method that uses [14] Bourke a. k., and lyons g. m. (2008). A threshold-based fall-detection
algorithm using a bi-axial gyroscope sensor. Medical Engineering &
person tracking and matching techniques in order to solve Physics. 30, 84–90. doi: 10.1016/ j.medengphy.2006.12.001
the problems encountered in this work. We can also test [15] Apichet Yajai, Annupan Rodtook, Krisana Chinnasarn, Suwanna Ras-
other recent multi-person pose estimators that can estimate mequan. (2015). Fall detection using directional bounding box. . 12th
International Joint Conference on Computer Science and Software
human body pose with similar performance to the Blaze pose Engineering (JCSSE): 52-57. doi:10.1109/JCSSE.2015.7219769
estimator. [16] Martı́nez-Villaseñor L, Ponce H, Brieva J, Moya-Albor E, Núñez-
Martı́nez J, Peñafort-Asturiano C. (2019) UP-Fall Detection Dataset:
ACKNOWLEDGMENT A Multimodal Approach. Sensors (Basel). Apr 28;19(9):1988. doi:
10.3390/s19091988. PMID: 31035377; PMCID: PMC6539235.
This work was funded by the Tunisian Ministry of Higher [17] Heinrich, Christian & Koita, Samad & Taufeeque, Mohammad &
Education and Scientific Research (MESRS) and the French Spicher, Nicolai & Deserno, Thomas. (2021). Abstract: Multi-camera,
Ministry of Foreign Affairs and Ministry of Higher Education Multi-person, and Real-time Fall Detection using Long Short Term
Memory. 10.1007/978-3-658-33198-6 29.
under the PHC Utique program in the CMCU project number [18] Espinosa, R., Ponce, H., Gutiérrez, S., Martı́nez-Villaseñor, L., Brieva,
23G1411. Any opinions, findings, and conclusions or recom- J. and Moya-Albor, E., 2019. A vision-based approach for fall detection
mendations expressed in this material are those of the authors using multiple cameras and convolutional neural networks: A case study
using the UP-Fall detection dataset. Computers in biology and medicine,
and do not necessarily reflect the views of MESRS. 115, p.103520. doi: https://doi.org/10.1016/j.compbiomed.2019.103520
[19] Richmond, S. B., Fling, B. W., Lee, H., & Peterson, D. S. (2021) ”The
R EFERENCES assessment of center of mass and center of pressure during quiet stance:
[1] World health organization (WHO), falls, world health organization, current applications and future directions.” Journal of biomechanics 123
11 november 2022. [Online]. Available: https://www.who.int/news- : 110485. doi: https://doi.org/10.1016/j.jbiomech.2021.110485
room/fact-sheets/ detail/falls. [20] Morasso, Pietro. ”Centre of pressure versus centre of mass stabilization
[2] Gunale.K & Mukherji, P. (2018). Indoor human fall detection system strategies: the tightrope balancing case.” Royal Society open science 7.9
based on automatic vision using computer vision and machine learning (2020): 200111. doi: https://doi.org/10.1098/rsos.200111
algorithms. J. Eng. Sci. Technol, 13(8), 2587-2605. [21] Abobakr, A., Hossny, M., Abdelkader, H., & Nahavandi, S. (2018).
[3] Zhao, Feng & Cao, Zhi-Guo & Xiao, Yang & Mao, Jing & Yuan, ”Rgb-d fall detection via deep residual convolutional lstm networks.”
Junsong. (2018). Real-Time Detection of Fall From Bed Using a Digital Image Computing: Techniques and Applications (DICTA). IEEE,
Single Depth Camera. IEEE Transactions on Automation Science and 2018. doi: 10.1109/DICTA.2018.8615759
Engineering. PP. 1-15. doi: 10.1109/TASE.2018.2861382. [22] Chhetri, S., Alsadoon, A., Al-Dala’in, T., Prasad, P. W. C., Rashid,
[4] Yang, Lei & Ren, Yanyun & Hu, Huosheng & Tian, Bo. (2015).(2015). T. A., & Maag, A. ”Deep learning for vision-based fall detection
New fast fall detection method based on spatio-temporal context tracking system: Enhanced optical dynamic flow.” Computational Intelligence
of head by using depth images. Sensors, 15(9), 23004-23019. doi: 37.1 (2021): 578-595. doi: https://doi.org/10.48550/arXiv.2104.05744
23004-19. 10.3390/s150923004. [23] Alanazi, Thamer, and Ghulam Muhammad. ”Human fall detection using
[5] Chen, Ziwei & Wang, Yiye & Yang, Wankou. (2021). 3D multi-stream convolutional neural networks with fusion.” Diagnostics
Video Based Fall Detection Using Human Poses. doi: 12.12 (2022): 3060. doi: https://doi.org/10.3390/diagnostics12123060
https://doi.org/10.48550/arXiv.2107.14633 [24] Feng, P., Yu, M., Naqvi, S.M., & Chambers, J.A. (2014, August). Deep
[6] Bazarevsky, Valentin & Grishchenko, Ivan & Raveendran, learning for posture analysis in fall detection. In 2014 19th International
Karthik & Zhu, Tyler & Zhang, Fan & Grundmann, Matthias. Conference on Digital Signal Processing (pp. 12-17). IEEE, Canada. doi:
(2020). BlazePose: On-device Real-time Body Pose tracking. doi: 10.1109/ICDSP.2014.6900806
https://doi.org/10.48550/arXiv.2006.10204 [25] Lu, N., Wu, Y., Feng, L., & Song, J. (2018). ”Deep learning for fall
[7] Samaan, Gerges & Wadie, Abanoub & Attia, Abanoub & Asaad, detection: Three-dimensional CNN combined with LSTM on video
Abanoub & Kamel, Andrew & Slim, Salwa & Abdallah, Mohamed kinematic data.” IEEE journal of biomedical and health informatics 23.1
& Cho, Young-Im. (2022). MediaPipe’s Landmarks with RNN for (2018): 314-323. doi: 10.1109/JBHI.2018.2808281
Dynamic Sign Language Recognition. Electronics. 11. 3228. doi: [26] Elaoud, A., Barhoumi, W., Drira, H., & Zagrouba, E. (2019, Febru-
https://doi.org/10.3390/electronics11193228 ary). Weighted linear combination of distances within two manifolds
[8] Alsawadi, Motasem & El-kenawy, El-Sayed & Rio, Miguel. (2022). for 3D human action recognition. In VISIGRAPP (5: VISAPP). doi:
Using BlazePose on Spatial Temporal Graph Convolutional Networks 10.5220/0007369006930703
for Action Recognition. Computers, Materials & Continua. 74. doi: [27] Xiaogang Li; Tiantian Pang; Weixiang Liu; Tianfu Wang. (2017).
10.32604/cmc.2023.032499 ”Fall detection for elderly person care using convolutional neural net-
[9] Maaoui,H., Elaoud, A., Barhoumi,W. (2023). An Accurate Random works.” 10th international congress on image and signal processing,
Forest-Based Action Recognition Technique Using only Velocity and biomedical engineering and informatics (CISP-BMEI). IEEE, 2017. doi:
Landmarks’ Distances. In International Conference on Information and 10.1109/CISP-BMEI.2017.8302004
Knowledge Systems (pp. 129-144). Cham: Springer Nature Switzerland. [28] A. Krizhevsky, I. Sutskever, and G. E Hinton. Imagenet classification
doi: https://doi.org/10.1007/978-3-031-51664-1 9 with deep convolutional neural networks. In International Conference
on Neural Information Processing Systems, pp. 1097–1105, 2012. doi:
https://doi.org/10.1145/3065386
[29] L. Anishchenko, ”Machine learning in video surveillance for fall detec-
tion,” 2018 Ural Symposium on Biomedical Engineering, Radioelectron-
ics and Information Technology (USBEREIT), Yekaterinburg, Russia,
2018, pp. 99-102, doi: 10.1109/USBEREIT.2018.8384560.
[30] Sheikh, S.Y., Jilani, M.T. A ubiquitous wheelchair fall detection system
using low-cost embedded inertial sensors and unsupervised one-class
SVM. J Ambient Intell Human Comput 14, 147–162 (2023). doi:
https://doi.org/10.1007/s12652-021-03279-6
[31] Elaoud, A., Barhoumi, W., Drira, H., & Zagrouba, E. (2020). Modeling
Trajectories for 3D Motion Analysis. In Computer Vision, Imaging
and Computer Graphics Theory and Applications: 14th International
Joint Conference, VISIGRAPP 2019, Prague, Czech Republic, February
25–27, 2019, Revised Selected Papers 14 (pp. 409-429). Springer
International Publishing. doi: https://doi.org/10.1007/978-3-030-41590-
7 17
[32] Bhola, G., et Vishwakarma, D. K. (2024). A review of vision-based in-
door HAR: state-of-the-art, challenges, and future prospects. Multimedia
Tools and Applications, 83(1), 1965-2005.
[33] Yao, L., Yang, W., et Huang, W. (2022). A fall detection method based
on a joint motion map using double convolutional neural networks.
Multimedia Tools and Applications, 1-18.

You might also like