0% found this document useful (0 votes)

17 views15 pages

EdgeActNet_compressed

Human activity recognition using radar point cloud and edge AI

Uploaded by

fei luo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views15 pages

EdgeActNet_compressed

Human activity recognition using radar point cloud and edge AI

Uploaded by

fei luo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

This article has been accepted for publication in IEEE Transactions on Mobile Computing.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMC.2023.3309938

EdgeActNet: Edge Intelligence-enabled Human

Activity Recognition using Radar Point Cloud
Fei Luo, Salabat Khan, Anna Li, Yandao Huang, Kaishun Wu, Fellow, IEEE

Abstract—Human activity recognition (HAR) has become sensors, accelerometers and gyroscopes in smartphones, etc.,
a research hotspot because of its wide range of application need to be worn or carried by users continuously. External
prospects. It has higher requirements for real-time and power- sensors are placed outside of the human body. They do not
efficient processing. However, a large amount of data transfer
between sensors and servers, and computation-intensive recogni- need to be worn or carried and can achieve non-contact HAR.
tion models hinder the implementation of real-time HAR systems. Currently, two widely researched external sensor-based activ-
Recently, edge computing has been proposed to address this ity recognition techniques are WiFi CSI and radar. Compared
challenge by moving computational and data storage resources to to WiFi CSI, radar usually has a larger detection range, higher
the sensors, rather than depending on a centralized server/cloud. time resolution, and stronger anti-interference ability; ultra-
In this paper, we investigated binary neural networks for edge
intelligence-enabled HAR using radar point cloud. Point cloud wideband (UWB) radar also has a higher range resolution.
can provide 3-dimensional spatial information, which is helpful Various radar sensors with different frequency bands, wave-
to improve recognition accuracy. Time-series point cloud also forms, and antenna configurations have been used in HAR.
brings challenges, such as larger data volume, 4-dimensional Generally, radar-based HAR has the following advantages:
data processing, and more intensive computation. To tackle these a larger detection range, a certain penetration ability, non-
challenges, we adopt the 2-dimensional histograms for point
cloud multi-view processing and propose the EdgeActNet, a contact and non-perception detection ability, not being affected
binary neural network for point cloud-based human activity by light, and privacy-preserving. Hence, radar-based activity
classification on edge devices. In the evaluation, the EdgeActNet recognition has a great application prospect.
achieved the best results with average accuracies of 97.63% Most radar-based HAR systems use micro-Doppler signa-
on the MMActivity dataset and 95.03% on the point cloud tures of radar to differentiate human activities. Micro-Doppler
samples of the DGUHA dataset respectively; and saved 16.9×
memory consumption and 11.5× inference time compared to its signatures refer to the extra modulations stapled with the main
full-precision version. Our work also is the first to apply 2D Doppler frequency shift. However, it is also possible to collect
histogram-based multi-view representation and BNNs for time- the point cloud reflected from human targets by using radar. In
series point cloud classification. [3], the authors used a millimeter-wave radar to generate a 3D
Index Terms—Human activity recognition, radar, point cloud, point cloud. In [4], the author used four radar sensors to collect
binary neural network, edge intelligence point clouds for automotive applications. The spatial resolution
of a radar is related to its bandwidth. They are related as
c
dres = 2B , where dres denotes the range resolution of a
I. I NTRODUCTION
radar, B denotes the bandwidth swept by the chirp of a radar,
Human Activity Recognition (HAR) has attracted re- and c denotes the speed of light. Thus, a radar with higher
searchers’ attention due to various applications and usage bandwidth can provide a better range resolution [5]. Compared
in healthcare, surveillance, smart home, human-computer in- to Lidar, radar can collect point clouds in a low-cost and
teraction, etc. With the widespread adoption of IoT, various weatherproof way. In [6], the authors only used point clouds
sensors and smart devices have been used for HAR. Current to localize and track individuals and inferred their activities
HAR techniques are categorized into two types: vision- and from their trajectories. The limitation is that it can merely
sensor-based techniques [1]. Vision-based HAR technique is to recognize the activities that cause people’s location changes.
infer people’s activity by examining images or videos captured Human posture keeps changing while human activities. Point
by cameras. Due to the wide deployment of cameras and cloud can provide 3D information of human postures. Hence,
image/video data conforming to the observation with the naked except for human detection, object classification, and scene
eye, vision-based techniques have been widely studied and segmentation using point cloud, point cloud-based HAR is
applied in activity recognition. Sensor-based techniques can attracting more and more attention.
be further divided into wearable sensors and external sensors Sensor-based HAR schemes are reliant on machine-learning
according to the use approach of sensors [2]. Wearable sen- models to filter the mapping from sensor-generated signal
sors, including Bluetooth/RFID tags, electrocardiogram (ECG) profiles to human activities. Due to the hierarchical structure
of deep-learning models, they can learn feature representations
Fei Luo, Salabat Khan, Yandao Huang, and Kaishun Wu are with the
College of Computer Science and Software Engineering, Shenzhen Univer- from unprocessed data with slight data pre-processing. These
sity, Shenzhen, China. Anna Li (corresponding author) is with the School models have outperformed in the areas of image classification,
of Electronic Engineering and Computer Science, Queen Mary Univer- object recognition, game competition, etc. Research has re-
sity of London, London, UK, E1 4NS. E-mail: [email protected],
[email protected], [email protected], [email protected], cently identified significant leads of deep-learning models over
[email protected]. traditional machine learning models for HAR [7], [8]. Because

Authorized licensed use limited to: SUN YAT-SEN UNIVERSITY. Downloaded on September 05,2023 at 03:53:22 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMC.2023.3309938

of the robust learning capability of deep learning, it turned to the cloud platform for requesting different services and
into an ideal and leading approach for HAR and is extensively delivering timely applications for surveillance, assisted living,
used in HAR [6], [9], [10]. Nonetheless, deep learning models healthcare monitoring, etc. This work focuses on point cloud
are usually computation-intensive. Most of them are installed data processing and implementing deep-learning models on
on the cloud platform to analyze IoT sensors-generated data limited-resources edge devices. We leverage BNNs to realize
and feed the outputs, which is inappropriate for real-time and real-time HAR with low latency and low memory. This work
time-sensitive applications. HAR has a high demand for time- aims to enhance the practicality of sensor-based HAR by
sensitive and real-time performance. For instance, providing reducing the delay and resource usage of deep learning on
timely healthcare assistance after fall events detection or edge devices.
raising the alarm after stealing events happened is necessary. It In this paper, we conducted radar point cloud-based HAR.
is highly recommended to move artificial intelligence (AI) to We constructed two-dimensional (2D) histograms from the
the network edge to avoid communication delays and enable radar point cloud and adopted several classic BNN structures,
quick responses [11]. Technologies and methods developed including BinaryResNet, BiRealNet, and BinaryDenseNet, and
by AI help edge computing (EC), and platforms developed by adapted them for 4D time-series point cloud data. For conduct-
EC facilitate AI applications. Edge intelligence (EI) is born ing a comparison, we also implemented several classic deep-
from the solid demand to integrate EC and AI. In [12], EI learning models, including ResNet, DenseNet, and SwinTrans-
is defined as enabling edges to execute AI algorithms. In a former as the baseline. Our contribution can be summarized
fast-changing environment, EI can provide an understanding of as follows:
sensor-generated data and offer real-time decision-making and
1) We implemented 2D histogram multi-viewing process-
prediction. In terms of real-time performance at the network
ing for the time-series point cloud. Muli-view-based
edge, this meets the requirements of HAR.
point cloud classification has been widely applied, but
There are many challenges facing EI since it is still in
2D histogram-based multi-view representation for point
its infancy. A major challenge is balancing optimality and
cloud is implemented for the first time in our work.
efficiency. While AI offers optimal solutions, it is crucial
It is found that 2D histograms can reduce the volume
to consider the trade-off between efficiency and optimality
of the processed data and improve activity classification
when dealing with resource-constrained edges [11]. In order
accuracy.
to overcome this challenge and enable EI, both Zhou et
2) We adapted three well-known binary neural networks for
al. [13] and Zhang et al. [12] describe a few technological
4D time-series point cloud classification for the first time.
solutions that allow model inference at the edge, including
Since most BNNs are proposed for image classification,
model compression, partitioning, and conditional computation.
there is still no BNN proposed for 4D data, especially
Model compression is one of the most active directions
for the dynamic time-serial point cloud. We modified
that facilitates AI at the network edge. It reduces model
the structures of BirealNet, BiResNet, and DenseNet into
complexity and resource requirements for enhancing model
BirealNet3D, BiResNet3D, and BinaryDenseNet3D for
inference. Presently, model compression methods comprise
adapting them to the time-series point cloud data.
pruning, quantization, low-rank factorization, and knowledge
3) We proposed a BNN, the EdgeActNet for point cloud-
distillation [14]. The quantization focuses on reducing the
based HAR. We evaluated the EdgeActNet on a Rasp-
number of bits corresponding to weights and activations in
berry Pi 4B Board, not just from accuracy but also
neural networks (NNs), simplifying the implementation of
from memory footprint and inference latency. It has
NNs on hardware. One extreme case of quantization is a binary
achieved state-of-the-art with average classification accu-
neural network (BNN), where +1 and −1 bound the weights
racies of 97.63% on the MMActivity dataset and 95.03%
and activations. Several recent studies have shown impressive
on the point cloud samples of the DGUHA dataset
results for BNN. A BNN can provide 58× speedup on CPUs
respectively, a memory footprint of 1.8 MB, and an
and 32× memory savings by converting both weights and
inference latency of around 12 ms. Overall, the proposed
activations to binary values [15]. MeliusNet proposed in [16]
EdgeActNet outperforms the BirealNet3D, BiResNet3D,
outperforms the full-precision NN MobileNetV1 [17] in terms
BinaryDenseNet3D, ResNet3D, DenseNet3D, and Swin-
of accuracy for the first time. ReActNet [18] even surpasses
Transformer in terms of classification accuracy, memory
the ResNet in terms of classification accuracy on the ImageNet
footprint, and latency.
dataset. They reveal that BNN has greater potential to realize
AI on the network edge. The remainder of this paper is organized as follows. A brief
Currently, little work implements BNNs in HAR while con- review of sensor-based HAR, EI, binary neural networks, and
sidering resource-constrained devices and communication la- point cloud classification is provided in Section II. In Section
tency. Especially time-series point cloud generally has a larger III, we introduce the methodologies and techniques for sensor-
volume than other sensor data, which can result in higher based HAR and binary neural networks as well as describe
latency in data transfer. Figure 1 illustrates our model for EI- our 2D histogram multi-view point cloud processing and the
enabled HAR using radar point cloud. Instead of transmitting architecture and design of the proposed EdgeActNet in detail.
point cloud data to the cloud platform, an edge device layer is We evaluate EdgeActNet with four metrics and analyze the
set up near radar sensors to offer real-time signal processing results in Section IV. The final section of the paper concludes
and activity detection. It only submits a few detection outputs the paper.

Fig. 1. The architecture of EI-enabled HAR using radar point cloud. EC architecture comprises four layers: sensor layer, edge device layer, cloud layer, and
application layer. High latency is shown by the red arrow and high bandwidth are consumed by the direct offloading of huge volumes of data from the sensor
layer to the distant centralized cloud layer.

II. R ELATED WORK changes of WiFi RSSI and CSI [24] caused by the multi-
As our work focuses on adopting BNNs for HAR on edge path effect enable activity recognition. Radar micro-Doppler
devices using radar point cloud, this section reviews state-of- signatures can classify human actions [25]. Lidar can track
the-art works addressing sensor-based HAR, edge intelligence, people’s location with mm-level accuracy and has been used
BNNs, and point cloud classification. in HAR based on users’ trajectories [6]. Each type of sen-
sor has its advantages and disadvantages. Vision-based HAR
can be visually understandable by the naked eye, and it is
A. Sensor-based HAR easy to collect data by using widespread cameras. However,
HAR has attracted high attention and interest from the mo- vision-based techniques are illumination-sensitive and privacy-
bile computing community and ubiquitous sensing community intrusive. Wearable sensors have the potential to achieve high
due to its wide applications in the military, healthcare, and accuracy but require people to wear them constantly. Despite
security. With the prosperity of IoT, various sensors have been their device-free capabilities, radar and WiFi are sensitive to
applied in HAR. These sensors are divided into wearable and environmental noise. Even though LiDAR and GPS can map
external. Wearable sensors are attached to users’ bodies. Sen- actions and locations, they can merely deduce the activities
sors such as accelerometers, gyroscopes, electroencephalogra- and actions that could cause location changes, and GPS has a
phy (EEG), and magnetometers are commonly used in HAR. high latency. For different activities and contexts, it is essential
The accelerometer is widely used among them, as it is widely to choose appropriate sensors. In this work, we investigate
embedded in smartphones and smartwatches. Ambient sensors, HAR using radar point cloud, which belongs to device-free
including pressure sensors, microphones, thermometers, and techniques.
hygrometers, also have been used to compute the changes
in environmental factors and variables generated by human
activity [19]. Several benchmark datasets have been widely B. Edge intelligence and binary neural network
used for wearable sensor-based HAR research, including UCI- HAR is a very complicated time-series procedure. Cur-
HAR dataset [20], Unimib shar [21], MotionSense [22], etc. rently, human activity recognition relies on machine learning
GPS tags also have been applied to track users’ locations techniques. Conventional machine learning approaches need
and infer their activities based on their trajectories on a city features that are extracted from raw sensor signals. Their
or a regional scale [23]. External sensors are deployed in performance heavily relies on the quality of extracted features.
a specific area of interest without requiring users to wear Deep learning can overcome this limitation. It has the char-
anything. The commonly used external HAR sensors involve acteristic of automatically learning high-level representation
cameras, WiFi routers, and radar. HAR using cameras is also from sensor-generated data through its deep hierarchical struc-
called vision-based HAR, which recognizes human action ture without needing expert knowledge in signal processing
by inspecting images or video collected from cameras. The and feature extraction. CNN and RNN are massively used to

learn the spatial pattern and temporal dependencies of sensor (e.g., autonomous driving and robotics), point cloud classifica-
signals in HAR. Several deep learning structures were pre- tion and segmentation are attracting more and more attention.
sented for sensor-based HAR. Temporal convolutional network Point cloud classification techniques can be categorized into
[26] is created to identify dependencies in time-series signal multi-view-, volumetric-, and point-based approaches. Multi-
channels from human actions. Squeeze-and-Excitation Net- view-based approaches map 3D point clouds into 2D images,
works [27] were presented to model the relationship and inter- which fill the gap between 2D and 3D learning. They rendered
dependencies between signal channels. Temporal Transformer some views for a given 3D body. Bradski et al. [32] pioneered
Networks were presented to recognize 3D actions based on leveraging 2D views to classify 3D objects. MVCNN is
comprehension of discriminative warps between time-series the first usage of 2D CNNs for recognizing 3D objects. It
[28]. Deep learning models are used in sensor-based HAR, but leverages the max pooling approach to combine features from
the majority of them are reliant on GPUs and a huge memory different scenarios. Subsequent works, including RotationNet
for inference. To deploy deep learning on edge devices, model [33], ViewGCN [34], and MVTN [35] proposed new novel
compression techniques are expected to reduce memory and deep learning structures to model inter-relationship between
computation consumption. views. Volumetric-based approaches usually voxelized point
Furthermore, an enormous quantity of data is generated by clouds to 3D grids and then applied a 3D CNN for classifica-
sensors. When these data are offloaded to a centralized server, tion. VoxNet [36] recognized 3D objects by introducing a vol-
they will suffer from a high delay and bandwidth consumption. umetric occupancy. Density occupancy grid representations of
A new computing concept, EC (edge computing), is on the PointNet [37] were used for the intake data and integrated into
horizon because of the robust demand for transitioning compu- a 3D CNN. OctNet utilizes a cluster of unbalanced octrees to
tation from the cloud platform to the network edge. EC reduces partition a point cloud. Each leaf in octree maintains a pooled
communication delay and bandwidth cost as well as enables feature representation, which requires much less memory
faster responses by moving computation and communication and facilitates deeper networks without degrading resolution.
resources to the network edge [11]. Edge devices refer to Point-based approaches directly rely upon raw point clouds,
computing or networking resources that reside among IoT excluding projection and voxelization. They are classified
devices and are connected to cloud-based data centers [29]. into pointwise MLPs (Multi-Layer Perceptrons), convolution-,
A smartphone that connects wearable sensors and a cloud graph-, hierarchical data structure-based, etc. Pointwise MLPs
platform in a smart home between a cloud platform and home techniques model per point independently with several shares
items can act as an edge device. There has been some study MLPs and aggregate a global attribute employing a symmetric
justifying the possible benefits of EC. aggregation function [31]. Convolution-based methods use
To analyze large volumes of data, AI is important. It can continuous or discrete convolution for irregular point clouds.
offer profound comprehension of data characteristics. Since Graph-based methods take each point as a node of a graph and
enormous portions of data are produced at the network edge, yield directed edges based on the neighbors of each point.
there is a solid need to combine EC and AI, giving birth Hierarchical data structure-based methods are implemented
to EI (edge intelligence) [11]. EI, however, faces some chal- on hierarchical structures(e.g., kd-tree and octree). They learn
lenges. AI is computationally intensive, while edge devices point features hierarchically, initiating from the leaf to the root
are resource-limited. Developing edge-friendly AI algorithms of a tree. There are still many other methods that address 3D
requires balancing accuracy and efficiency. The most intense point cloud classification; more related work can be found in
quantization method is binarization—a 1-bit quantization pro- [31].
cess where weights and activations are defined using only Although wide research has been done for 3D point clas-
+1 and −1. In binarized neural networks, the heavy matrix sification, these works are implemented on static point cloud
multiplication operations are substituted by light-weighted data. Human activities are dynamic and changing through time.
XNOR (bitwise) and Bitcount operations. Hubara et al. [30] Recently, some researchers have begun to study HAR by using
has shown that using BNN in HAR can save up to 32× 4D time-series point cloud classification. In [38], the authors
memory and gain 58× speedup on CPUs. In this work, we used sparse radar point clouds to differentiate 21 gestures
proposed a BNN, EdgeActNet, which is inspired by the recent articulated by 41 participants in two indoor environments
BNN structure designs and further adapted for 4D time-series and achieved 95% accuracy. In [39], the author proposed the
point cloud classification. Tesla-Rapture, which is a real-time implementation of point
cloud-based gesture recognition using a mmWave Radar. In
[40], [41], radar point clouds were used to construct 3D
C. Point cloud classification human poses by using deep learning approaches. In [42],
3D sensors are commonly available and different kinds of the authors designed a transformer architecture to distinguish
3D sensors, including LiDAR, RGB-D cameras, and radars, human activities by using radar point clouds. Although radar
have been used in remote sensing, self-driving, and local- point cloud-based human activity recognition is getting more
ization. 3D sensors can offer a better comprehension of the attention from researchers, there is still no research that
environment. The point cloud is a frequently evaluated format considers using BNNs to accelerate the HAR at the edge. The
of 3D data. Point cloud maintains the genuine geometric collected point clouds contain many frames and have a larger
information in 3D space [31]. With the requirement of object data volume. It is not enough to just model the spatial patterns
recognition and scene understanding in many applications of point clouds but also needs to model the time dependencies

through the frames of point clouds. In addition, human action B. Multi-view representation with 2D histograms
recognition has a high requirement in real-time performance We adopted the multi-view-based method for point cloud
and edge-enabled applications. All these challenges drive us sequence classification in our work. The reason is that there is
to implement this work. still no suitable network structure for time-series point cloud
III. M ETHODOLOGY classification. By converting 4D time-series point clouds into
multiple 2D images, the dataset dimension can be reduced. In
Point cloud-based HAR comprises point cloud data process- this way, we can use the 3D CNN to accomplish point cloud
ing and activity recognition model design. To process data, we classification in the same way as video classification.
implemented a multi-view representation using histogram2D In our work, we first project point clouds into three 2D
to convert a point cloud to multiple 2D histogram maps. planes: XY, XZ, and YZ. The three planes are enough to
For model design, we designed a binary neural network for reflect a point’s location in each dimension. Instead of just
computation-efficient and low-latency human activity classifi- feeding the projected 2D images to the classifier, we further
cation. In this section, we firstly give a mathematical overview implement a two-dimensional (2D) histogram processing upon
of point cloud-based HAR; then, we describe the histogram2D these images. A 2D histogram illustrates the relationship of in-
method. Finally, we describe the architecture of our BNN in tensities at the same position between two images. It has been
detail. frequently used in image thresholding [43]–[45]. Abutaleb [44]
used an original gray-level histogram along a local average
A. Overview of point cloud-based HAR
of the neighboring pixels to form a 2D histogram, which
Assume a user performs several types of activities related was further refined by maximizing the smaller two entropies
to a predefined set of activities A [19]: relating to the background and object classes [46]. By applying
n a 2D histogram on projected views of point clouds, it is
A = {Ai }i=1 (1)
possible to find out the correlation between different views of
where n represents activity classes. For each activity, a set of point clouds. A 2D histogram can be mathematically expressed
point cloud frames F is collected from a 3D sensor. as follows.
F = f 1 , f 2 , ..., f i , ..., f T ,

Consider a 2D image (a projected view of point clouds) F =
(2) [f (x, y)]M ×N of dimension M × N with [0, ..., L − 1] gray
f t = pt1 , pt2 , ..., pti , ..., ptm

levels, where f (x, y) is the gray level (value) of the pixel at
where F is a set that contains F frames of point cloud, f t point (x, y). Consider another 2D picture G = [g(x, y)]M ×N ,
denotes the point cloud at time t, with t = 1, 2, 3, ..., T ; pti where g(x, y) is the gray level of the pixel at location (x, y).
denotes a point in f t , each point contain 3 values x, y, z to The 2D histogram of pictures F and G, is a L × L size matrix
denote its location in a 3D space. C = [cij ]L×L , which carries some statistical information
Since we applied multi-view representation for point cloud, concerning the number of gray level instances of two pixels
a convert function or projection P is used to convert each at the same point (x, y). The 2D histogram from any couple
point cloud frame f t to multiple 2D images. of pixels f (x, y) and g(x, y) containing respectively the gray
vt = P f t ,

levels (i, j) can be formalized as follows [47]:
(3)
V = v 1 , v 2 , ..., v t , ..., v T = P(F ),

cij = Card {(f (x, y), g(x, y))/f (x, y) = i, g(x, y) = j)}
Where v t is the output of P, it is a set 2D images. V is a (5)
set of v t , it contains all the output of a point cloud sequence where Card represents cardinality.
after conversion. V is fed into a recognition model for activity Figure 2 shows the process of multi-view representation
classification. with 2D histograms. The color of points in the projected view
The F is a model predicting the activity class based on V implies different densities. Each 2D histogram is generated
and outputting a categorical distribution over n class labels from a pair of views/images (XY-XZ, XY-YZ, XZ-YZ). As
having softmax values [l1 , l2 , ..., ln ], where lj denotes the each frame of the point cloud is projected to three views, it
softmax score for a class Aj . The learning task of the model can generate 3 2D histograms from these views.
on a 4D point cloud dataset that is labeled with activities is
formulated as follows:
K
C. Edge-oriented time-series point cloud classification
X
arg min L(F(Vk ), yk ) After point cloud processing for multi-view representation, a
k model is required for activity classification. As the point cloud
(4)
K
X has a large volume and human activity recognition demands
= arg min L(F(P(Fk )), yk ), yk ∈ A high real-time performance, we decide to build an edge-
k oriented classifier. Our decision to adopt BNNs is prompted by
Where L represents the cross-entropy loss of classification the limited memory and computational power of edge devices.
defined over N point cloud sequence samples in the dataset. 1) Binary neural networks: Binary neural networks repre-
HAR tries to learn F by minimizing the overall L. In our sent the float-point weight w and activations α using 1-bit. One
work, F is the binary neural network which is described in hurdle during the training of a BNN is that it limits the direct
Section III-C. application of the backward propagation (BP) algorithm based

Fig. 2. Multi-view representation of point clouds using 2D histograms.

connecting all the layers. A dense block derives 64 chan-

nels of new features based on the input feature map, thus
increasing feature capacity. Each dense block is composed of
layer combinations. In every layer combination l, a non-linear
transformation Hl (·) is applied. Hl (·) is usually composed of:
a) batch normalization (BN), b) rectified linear unit (ReLU), c)
1 × 1 convolution, and d) 3 × 3 convolution (Conv) operations.
Before each 3×3 convolution, the 1×1 convolution is utilized
Fig. 3. Dense Block of different network architectures as a bottleneck layer to decrease the amount and size of
input feature maps, as shown in Figure 3(a). Huang et al.
[48] improved the flow of information between layers through
on common gradient descent to revise the binary weights as the a connectivity pattern: dense connectivity. The input for lth
binarization function, such as sign, is not differentiable, and layer is the output in feature maps form from preceding layers,
the corresponding gradient vanishing occurs. Luckily, Hubara x0 , · · · , xl−1 , as:
et al. [30] presented a straight-through estimator (STE) to fix
the gradient hurdle while training BNNs. Through STE, BNNs xl = Hl ([x0 , x1 , · · · , xl−1 ]), (6)
can be trained using the same gradient descent procedure as
full-precision NNs. The [x0 , x1 , · · · , xl−1 ] in equation (6) denotes the concatena-
2) EdgeActNet: In this work, we proposed an EdgeActNet, tion of output (feature map) generated by layers 0, · · · , l − 1.
which is a binary neural network that adopts 3D CNN for The growth rate of function Hl shows the number of features
time-series multi-view-based point cloud classification. Our produced Hl , which helps to control the DenseNet’s growth
EdgeActNet adopts several structural advantages of several speed.
classic BNNs for increasing the quality and capacity of fea- Binary dense block uses binarization on the dense block. In
tures efficiently. Before describing the implementation details addition, the dense block structure is re-designed to balance
of EdgeActNet, it is necessary to introduce the components of the precision loss induced by binarization. The change is the
it. mitigation of bottleneck design in binary dense blocks. Though
Binary 3D dense block: The main building block of the bottleneck design can decrease the input parameters, it
EdgeActNet comprises a binary 3D dense block, and a binary also reduces information propagation through BNNs [49]. As
3D improvement block follows each dense block. The dense illustrated in Figure 3(b), the bottleneck design is enhanced by
block is the basic unit in the DenseNet architecture. The substituting two convolution layers with one 3×3 convolution.
dense block increases the information capacity by densely The growth rate is halved to keep the number of parameters

Fig. 4. Transition Block of different network architectures

Fig. 6. Standard convolution (a) and group convolution (b). The latter
partitions the input feature maps into disjoint groups

altering the original c − 64 feature map’s channels. Thus, the

addition enhances the quality of the last 64 channels [16].
The improvement block can improve the quality with fewer
operations than the fully residual connection. In EdgeActNet,
all binary 2D convolution in improvement blocks is replaced
Fig. 5. Binary 2D (a) and 3D (b) improvement Blocks
with binary 3D convolution.
3D group convolution: group convolution is firstly intro-
and the information flow in BNNs, and the number of layer duced in AlexNet [50]. A group convolution splits feature
combinations is doubled. maps into several groups and uses the same number of kernels
Binary dense block is just suitable for the analysis of static to convolve with them respectively. The major benefit of group
2D images. For applying the binary dense block in time-series convolution is that it can efficiently reduce the parameters in
images sequence generated from time-series point clouds. The neural networks. In ResNeXt [51], it was shown that group
2D convolution in binary dense blocks is replaced with 3D convolution could result in a wider neural network, which
convolution, as shown in Figure 3(c). 3D convolution can improves classification accuracy. Standard convolution layers
convolve with feature maps not just along the width and height, (Figure 6(a)) output O features by applying a convolution
but also along with the depth (the frames of point cloud), kernel on all R input features, learning to a computation cost
which is helpful to model the temporal profile. of R × O. While group convolution (Figure 6(b)) lowers the
Binary 3D transition block: because the pooling process computation cost by dividing the input features into G groups,
alters the volume of feature maps, the transition block is placed each producing outputs, lowering the computation cost by a
between dense blocks to facilitate the concatenation operation. factor G to R×OG [52]. In EdgeActNet, we adopt group con-
And transition blocks also control the number of feature maps volution for reducing the computation cost. The conventional
generated from the dense block and improve model compact- group convolution is modified by using 3D convolution, which
ness. In this method, the network is partitioned into multiple can be called 3D group convolution.
dense blocks with identical feature maps in each block. As The implementation details of EdgeActNet: The Edge-
shown in Figure 4(a), the transition blocks in DenseNet con- ActNet is composed of the above components. The architec-
tain: 1) batch normalization layer, 2) 1×1 convolutional layer, ture of the EdgeActNet is illustrated in Figure 7. Assuming
and 3) 2 × 2 average pooling layer. In BNNs, the transition the point cloud sequence contains n frames, each frame can
blocks are also redesigned to compensate for the precision generate three 2D histograms through the process introduced
loss. It is replaced by M axP ool → ReLU → 1 × 1 Conv, as in Section III-B. Hence, the input shape can be (w, h, d, c),
given in Figure 4(b). In EdgeActNet, the 2D convolution in the where w and h are the width and height of the 2D histogram,
transition block is also replaced with 3D convolution, as shown respectively, while d is the depth that is equivalent to the
in Figure 4(c). Again, the transition block in EdgeActNet number of frames n, c is the channel, which is 3 here (3 2D
relies on a full-precision layer to maintain information flow. histograms generated by each frame). The backbone of the
Transition blocks can halve the dimension of the feature map EdgeActNet contains a 3D convolution layer, two 3D group
with a MaxPool layer, and the channels are also reduced in convolution layers, and four stages (S1 , S2 , S3 , S4 ) that each
the 1 × 1 downsampling convolution. stage contains several combinations of a binary 3D dense block
Binary 3D improvement block: improvement block was and a binary 3D improvement block. These four stages are
firstly proposed in [16] to increase the quality of the newly connected with three binary 3D transition blocks. Both the 3D
concatenated channels in dense blocks. As shown in Figure convolution layer and the 3D group convolution layers use the
5, it uses a binary convolution to compute 64 channels also kernel with the size of 3 × 3 × 3 and they are full-precision
based on the input feature map with the channel size of c. for keeping the information flow. Both the 3D convolution
The 64 output channels are integrated into the previously layer and the first 3D group convolution layer have 32 feature
outputted 64 channels through a residual connection without maps. The first 3D group convolution layer has 4 groups and

Fig. 7. The architecture of the proposed EdgeActNet

the second 3D group convolution layer has 8 groups and 64 cloud can be taken as a graph after connecting the neighbor
feature maps. The output feature map number of the stages points. However, current graph neural networks including
depends on the number of dense blocks that of them. The Point-gnn [55], PC-RGNN [56], and PointView-GCN [57] can
first binary 3D transition block generates 160 feature maps only handle static point clouds. For time-series data cloud
(channels), the second one outputs 224 feature maps, and the analysis, there is still no suitable graph neural network. It
third one generates 256 feature maps. After the backbone, a 3D drives us to adopt the multi-view representation and CNN
global average pooling is followed to reduce the feature maps’ for time-series point cloud classification. While we found
parameters by calculating the mean value for each one. Finally, that traditional multi-view representation does not improve the
a full-precision dense layer with the softmax activation outputs performance of the HAR. So we tried to use the 2d-histogram
the prediction. Note that there are four hyperparameters in the to model the relationship of different views of point cloud. For
EdgeActNet to customize the number of the combinations in classification, we tried to use ReActNet, which achieved the
each stage; these hyperparameters can be denoted as (s1 , s2 , SOTA by a large margin. However, when we tried to adapt the
s3 , s4 ). In the experiments, we will investigate the impact of ReActNet to 4D time-series point clouds, the dimensions of
these hyperparameters. feature maps between adjacent ReActNet blocks cannot keep
In training, the categorical loss of cross-entropy is used to consistent. We turned to directly binarize the classic CNN
optimize the network. We used Adam’s optimization function structures including (ResNet, and DenseNet) but the results
[53] with the initialized learning rate of 0.001. The batch size are hardly satisfactory. We realized that direct binarization
is 12 because a bigger batch size will result in memory over- seriously decreased the information capacity. We learned to
flow due to the large input. The EdgeActNet was implemented keep the first layer and the last layer as full precision to reduce
by using Tensorflow [54] deep learning framework. information loss, and the performance improves significantly.
Further, we adopt the structures of the dense block and the
Challenges in designing EdgeActNet: There are several improvement block to increase the information capacity and
challenges in designing the suitable architecture of BNN for quality, which forms the proposed EdgeActNet.
point cloud-based HAR. At the first beginning, we planned
to use graph neural networks as the data structure of point

IV. E XPERIMENTS , RESULTS , AND ANALYSIS TABLE I

P ERFORMANCE OF THE E DGE ACT N ET WITH DIFFERENT
We implemented our proposed methods on a point cloud HYPERPARAMETERS
dataset for HAR. We processed the dataset for multi-view rep-
Model Acc(%) F1(%) Memory Latency
resentation using 2D histograms. We evaluated and compared EdgeActNet 97.63 ± 0.09 97.44 ± 0.16 1.8 MB 12.4 ms
the performance of all classifiers not just from the accuracy, EdgeActNet (4,5,4,4) 96.86±0.23 96.65±0.31 3.43 MB 19.5 ms
EdgeActNet (4,6,8,6) 96.57±0.6 96.39±0.47 5.49 MB 40.7 ms
but also from the memory consumption and inference time. EdgeActNet (5,8,14,10) 97.41±0.15 97.22±0.29 12.08 MB 63.5 ms
The majority of research focuses on improving the accuracy EdgeActNet FP(2,3,2,2) 96.32±0.31 96.07±0.33 30.41 MB 142.4 ms
EdgeActNet FP(1,2,1,1) 96.91±0.2 96.72±0.24 15.04 MB 107.9 ms
of HAR. Model inference at the network edge, however, must EdgeActNet FP(1,2,1) 97.65±0.11 97.51±0.14 10.96 MB 89.6 ms
take into account both latency and memory footprint since EdgeActNet FP(1,2) 97.08±0.16 96.83±0.22 7.3 MB 53.9 ms
devices are generally computation and memory-constrained. In
this work, four metrics are used to measure the performance
of classifiers in EI, including accuracy, F1, delay (inference 97.44% in F1, the memory of 1.8 MB, and the inference
time), and memory footprint. The accuracy and F1 are calcu- time of 12.4 ms. Although the EdgeActNet (4, 5, 4, 4) and
lated from 5-fold cross-validation, they are illustrated as the the EdgeActNet (4, 6, 8, 6) contain more blocks in each stage,
mean accuracy/F 1 ± standard deviation. The inference they do not improve the classification accuracy and cost more
time (latency) was computed on a Raspberry Pi 4B with memory and time for human activity inference. The EdgeAct-
Raspberry Pi OS (32 bit), as the Raspberry Pi is commonly Net (5, 8, 14, 10) is only slightly inferior to the EdgeActNet
utilized for edge computing [58]–[60]. We take the average (2, 3, 2, 2) in activity classification accuracy while it costs
value of the inference time of 50 iterations in this work. 12.08 MB memory and 63.5 ms for inference. Moreover, we
also compared the EdgeActNet with its full-precision version
(EdgeActNet FP). The EdgeActNet FP only achieved 96.32%
A. MMActivity dataset
in accuracy and 96.07% in F1 which is more than 1.3% lower
The MMActivity dataset [5] is the first point cloud dataset than EdgeActNet (2, 3, 2, 2); and it needs 30.41 MB memory
for HAR. A TI’s IWR1443BOOST radar was used to collect and takes 142.4 ms for inference. Compared to its full-
the MMActivity dataset [5]. TI’s IWR1443BOOST radar is precision counterfeit, the EdgeActNet can save about 16.9×
an FMCW (Frequency Modulated Continuous Wave) radar memory consumption and 11.5× inference time. Generally,
that operates in the 76-81 GHz frequency range. It contains the full-precision model performs better than the binarized
three transmitter and four receiver antennas that estimate both model. We think it is because the full-precision model overfits
azimuth and elevation angles, enabling object detection in a 3D the MMActivity dataset. For validating our thought, we further
space [61]. The dataset is collected from 5 different activities investigated the performance of the full-precision models in-
performed by multiple participants. These five activities are cluding EdgeActNet FP (1, 2, 1, 1), EdgeActNet FP (1, 2, 1),
Walking, Jumping, Jumping and jacks, Squats, and Boxing. and EdgeActNet FP (1, 2). The EdgeActNet FP (1, 2, 1, 1)
The radar uses 30 frames per second sampling rate. The data has few blocks in each stage, compared to the EdgeActNet FP,
file has been divided into train and test files, with 71.6 minutes its performance improves. The EdgeActNet FP (1, 2, 1) fur-
of data in the train and 21.4 minutes of data in the test. More ther improves with stage S4 removed and achieves a higher
information about the MMActivity dataset can be found in [5]. accuracy than the EdgeActNet (2, 3, 2, 2). It infers that the
We used a sliding window of 2s (60 frames) with a sliding EdgeActNet FP overfits because of high complexity. By re-
factor of 0.33s (10 frames) for point cloud sequence segmen- ducing blocks and stages, the complexity gets down and the
tation. Finally, we get 12097 training samples and 3538 testing overfitting problem gets alleviated, which results in better
samples. Figure 8 shows a sample of each activity, each sample accuracy. While the performance of the EdgeActNet FP (1, 2)
contains 60 frames extracted by the sliding window. As it can declines by further removing the stage S3 . This is because
be seen, the radar point cloud is sparse. Each sample is further the model reduces its feature representation capability and
processed for 2D histogram-based multi-view representation. gets underfitting by removing too many neurons. Although
The size of generated 2D histograms is 50 × 50. There are 3 the EdgeActNet FP (1, 2, 1) achieved better accuracy than the
histograms generated from each frame, and there are 60 frames EdgeActNet (2, 3, 2, 2), it cost 6× memory and 7× time for
in each sample. Hence, each sample has a shape (input shape) inference. By overall consideration of accuracy, memory, and
of 50 × 50 × 60 × 3. latency, the EdgeActNet (2, 3, 2, 2) is still the best option. In
the following sections, the hyperparameters of EdgeActNet are
B. Evaluation defaulted to (2, 3, 2, 2).
Figure 9 shows the validation accuracy and loss of the
We evaluate the proposed EdgeActNet on the MMActivity EdgeActNets.
dataset by using 5-fold cross-validation. As we mentioned
in Section III-C, the hyperparameters of the EdgeActNet are
the block numbers (s1 , s2 , s3 , s4 ) of each stage. We tried C. Ablation study on the components of the EdgeActNet
four sets of hyperparameters including (2, 3, 2, 2), (4, 5, 4, 4), The proposed EdgeActNet is composed of 3D group con-
(4, 6, 8, 6), and (5, 8, 14, 10). The performance of the Edge- volutions, binary 3D dense blocks, binary 3D improvement
ActNet is shown in Table I. The EdgeActNet with the hy- blocks, and binary 3D transition blocks. For evaluating the
perparameters of (2, 3, 2, 2) achieved 97.63% in accuracy and contribution of each type of component in the EdgeActNet, we

Fig. 8. The point cloud samples (raw data from 60 frames) of 5 different activities

Fig. 10. The validation accuracy (a) and loss (b) of the EdgeActNets without
Fig. 9. The validation accuracy (a) and loss (b) of the EdgeActNet with three components respectively.
different hyperparameters.

removed the 3D group convolutions, binary 3D dense blocks,

and binary 3D improvement blocks from the EdgeActNet seen, these three components affect the performance of the
respectively, and compared their performance. As for the EdgeActNet at different levels. Compared to the original
binary 3D transition block, it is used to connect the four EdgeActNet, the EdgeActNet woGC declines 1.24% accuracy
stages in the EdgeActNet. It can control the number of feature and saves 0.1 MB memory, the EdgeActNet woDB declines
maps generated from dense blocks. When the dense blocks are more than 2% accuracy and saves about 0.9 MB memory,
removed from the EdgeActNet, it also can be the connection and the EdgeActNet woIB only declines about 0.5% accuracy
of improvement blocks in two adjacent stages. Hence, for and saves 0.54 MB memory. It shows that the dense blocks
keeping the architecture of the EdgeActNet, we didn’t remove contribute to the EdgeActNet most and it is also the part
the transition block and analyze its contribution. with the largest proportion in the EdgeActNet, followed by
After we removed the group convolutions, dense blocks, and the group convolutions, and the improvement blocks affect
improvement blocks from the EdgeActNet respectively, we got the least. From the results, all three components improve the
three new networks: the EdgeActNet without group convo- performance of the EdgeActNet.
lutions (EdgeActNet woGC), the EdgeActNet without dense
blocks (EdgeActNet woDB), and the EdgeActNet without Figure 10 shows the validation accuracy and loss of three
improvement blocks (EdgeActNet woIB). The performance EdgeActNets with removing the group convolutions, dense
of these three networks is shown in Table II. As can be blocks, and improvement blocks respectively.

TABLE II
P ERFORMANCE OF THE E DGE ACT N ETS WITH THREE COMPONENTS
REMOVED RESPECTIVELY

Model Acc(%) F1(%) Memory Latency

EdgeActNet woGC 96.37±0.5 96.24±0.37 1.7 MB 11.8 ms
EdgeActNet woDB 95.39±0.53 95.31±0.51 957.64 KB 8.7 ms
EdgeActNet woIB 97.17±0.2 97.07±0.19 1.26 MB 11.2 ms

Fig. 12. The validation accuracy (a) and loss (b) of four BNNs.

TABLE III
P ERFORMANCE OF FOUR BNN S

Model Acc(%) F1(%) Memory Latency

EdgeActNet 97.63±0.09 97.44±0.16 1.8 MB 12.4 ms
Birealnet3D 93.97±0.48 93.76±0.63 4.88 MB 28.7 ms
BiResNet3D 92.68±0.35 92.47±0.52 4.65 MB 139.3 ms
Fig. 11. The validation accuracy (a) and loss (b) of the EdgeActNet achieved
BinaryDenseNet3D 96.50±0.54 96.20±0.6 1.84 MB 170.6 ms
on 2D histograms and the raw project views.

D. Ablation study on multi-view representation of point cloud a binarized version of ResNet. BinaryDenseNet is a binarized
version of DenseNet. The major modification is that we
As we implemented the multi-view representation of point replaced binary convolution layers with binary 3D convolution
cloud using the 2D histograms, it is necessary to find out layers for adapting to the time-series point cloud data. We
whether 2D histograms are better than the originally projected also used some tricks in network binarization. For example,
views for improving the rate of HAR. We re-implemented we keep the first and last layers full-precision to reduce the
the EdgeActNet on the samples that are constructed from 3 loss of information flow; we put batch normalization after
projected views of the point cloud. Each projected view is instead of before max-pooling [15]. The BiResNet3D and
resized to 60 × 60, so the input shape is 60 × 60 × 60 × 3. BinaryDenseNet3D implemented in our work contain 18 and
The performance in validation is shown in Figure 11. In the 28 convolution layers respectively.
test, the EdgeActNet that implemented on the projected views As shown in Table III, the EdgeActNet outperforms all other
(EdgeActNet+the projected views) only achieved 85.29% in three BNNs. The BinaryDenseNet3D follows the EdgeActNet
accuracy and 85.14% in F1, which is more than 12% lower and achieved 96.5% in accuracy and 96.2% in F2, it only cost
than the EdgeActNet that implemented on 2D histograms 0.04 MB more memory than the EdgeActNet. However, its
(EdgeActNet+2D histograms). The comparison shows that 2D inference time is 13.75× of the EdgeActNet. The Birealnet3D
histograms can provide better performance for time-series achieved 93.97% in classification accuracy, 93.76% in F1; and
point cloud-based HAR, even the resolution of 2D histograms it consumes 4.88 MB memory and takes 28.7 ms for inference.
(50 × 50) is lower than that of the projected views (60 × 60). The BiResNet3D is around 5% lower than the EdgeActNet in
We believe it is because 2D histograms can not only reflect the both accuracy and F1; its memory consumption is close to
spatial distribution of points in the projected views but also can the Birealnet3D and its inference time is more than 2× of the
represent the correlation between different views. It is worth EdgeActNet.
noting that the usage of 2D histograms also reduces the input The results show that EdgeActNet is superior to several
data size. The size of total point cloud samples in [5] reaches current BNNs. It is worth noting that not all BNNs can
71.68 GB, while the size of our total point cloud samples is be adapted to 4D time series. In the future, we will try to
52.5 GB. The saved storage size of point cloud samples is modify more BNN benchmarks for providing a more complete
19.18 GB. Compared to the voxelized representation of point comparison.
cloud in [5], our input data size is shinked by about 27%. Figure 12 shows the validation accuracy and loss of the four
Figure 11 shows the validation accuracy and loss of the BNNs.
EdgeActNet+2D histograms and the EdgeActNet+the pro-
jected views.
F. Comparison with full-precision networks
In this section, we investigated three well-known full-
E. Comparison with other binary networks precision neural networks including ResNet3D, DenseNet3D,
Apart from the EdgeActNet, we also implemented three and SwinTransformer for point cloud-based human activity
other binary neural networks: BirealNet3D, BiResNet3D, Bi- recognition. These three networks have achieved state-of-the-
naryDenseNet3D, which are modified from three well-known art in several benchmark datasets. ResNet3D is a 3D version
binary neural networks: BirealNet [62], BiResNet [49], and of ResNet. The ResNet3D implemented here contains 18 con-
BinaryDenseNet [49], respectively. BirealNet uses a simple volution layers. DenseNet3D is a 3D version of DenseNet. The
shortcut to enhance the representation capability. BiResNet is DenseNet3D implemented here contains 4 dense blocks with 8,

TABLE V
P ERFORMANCE OF THE E DGE ACT N ET ON THE DGUHA DATASET

Model Acc(%) F1(%) Memory Latency

EdgeActNet+AHC 95.03±0.61 95.03±0.53 1.8 MB 11.7 ms
ST-GCN+AHC [66] 93.79% 93.8% 2.97 MB

Fig. 13. The validation accuracy (a) and loss (b) of the EdgeActNet and other
three networks.

TABLE IV
P ERFORMANCE OF THE E DGE ACT N ET AND OTHER THREE NETWORKS

Model Acc(%) F1(%) Memory Latency

EdgeActNet 97.63±0.09 97.44±0.16 1.8 MB 12.4 ms Fig. 14. The validation accuracy (a) and loss (b) of the EdgeActNet on the
ResNet3D 88.72±0.68 88.65±0.75 55.35 MB 1.21 s DGUHA dataset.
SwinTransformer 91.26±0.89 91.45±0.36 33.01 MB 1.24 s
DenseNet3D 94.48±0.5 94.32±0.25 49.43 MB 2.17 s
using three upsampling techinques (Zero-Padding, Gaussian
Noise, Agglomerative Hierarchical Clustering). Among them,
8, 12, and 10 convolution layers respectively. SwinTransformer agglomerative Hierarchical Clustering (AHC) has the best
has been proposed recently. It has achieved state-of-the-art performance. The dimension of point cloud samples in the
in several benchmark datasets including imagenet [63] and DGUHA dataset is (400, 25, 3, 2). The data has been split into
COCO [64], which makes it a good option for comparison two parts: 80% of it is used for training and the left 20% is
in time-series point cloud classification. In our work, we used for testing. We trained and tested the EdgeActNet on
select the Swin-T (a tiny version of SwinTransformer) [65] the point cloud samples of the DGUHA dataset. The results
to perform human activity classification. and the comparison are shown in Table V. As can be seen,
As illustrated in Table IV, the ResNet3D gets 88.72% in the EdgeActNet achieves 95.03% in accuracy and F1, which is
accuracy and 88.65% in F1; the SwinTransfromer achieves 1.2% higher than the ST-GCN in [66]. And the EdgeActNet is
91.26% in accuracy and 91.45% in F1; the DenseNet3D 1.17 MB smaller than the ST-GCN. The results further prove
achieves 94.48% in accuracy and 94.32%. Among these three our EdgeActNet has superiority to other models in point cloud-
networks, the DenseNet3D has the highest classification ac- based HAR.
curacy while it takes the longest time of 2.17 s for inference, Figure 14 shows the validation accuracy and loss of the
the ResNet3D achieved the lowest classification accuracy EdgeActNet on the DGUHA dataset.
with the largest memory consumption (55.35 MB). However,
all three full-precision networks are not comparable to the H. Analysis
EdgeActNet in accuracy, memory consumption, and inference
In [5], the best performance is 90.47%, which is achieved by
time, which implies that the architecture of the EdgeActNet is
a Time-distributed CNN+ Bi-directional LSTM classifier. The
more suitable for time-series point cloud analysis.
classifier is inferior to our proposed EdgeActNet and the work
did not take the computation consumption into consideration.
G. Validation on the DGUHA dataset In the experiments, we only measured the inference latency
For validating the superiority of the EdgeActNet, it is of the EdgeActNet. It is necessary to measure the latency of
necessary to evaluate it on a new dataset. Currently, there are the whole pipeline of point cloud-based HAR on edge devices.
some works that published their radar point cloud datasets, but Although we cannot implement the whole pipeline since we
few of them can be used for HAR, and many open datasets do used the public dataset, it can be estimated by accumulating
not have data structure descriptions which makes them difficult the latency of data collection, data processing, and model
to be parsed. Fortunately, in [66], the author published a inference. For data collection, the direct output of the radar
DGUHA dataset for human activity recognition. The DGUHA sensor is point cloud frames. The time consumption of point
dataset includes both point cloud and skeleton data. It is cloud construction on the radar board has been covered by the
collected from 19 subjects and contains seven human activities sampling rate of 30 Hz. Since we used a sliding window of
including running, jumping, sitting down and standing up, both 2 s (60 frames) with a sliding factor of 0.333 s (10 frames)
upper limb extension, falling forward, right limb extension, in data collection, the time interval between samples is 0.333
and left limb extension. The point cloud data in the DGUHA s, which is the latency for data collection. For point cloud
dataset is collected by using a mmWave radar sensor, TI’s data processing, we processed each point cloud frame to 3 2d-
IWR1443BOOST radar, which was mounted parallel to the histograms and each point cloud sample has 60 frames, which
ground at a height of 1.2 m in the experiment. The sampling generates 180 2d-histograms. We measure the average time
rate of the radar was 20 Hz. Each activity was performed consumption of the point cloud processing on Raspberry Pi
for 20 s (400 frames). The point cloud is augmented by 4B, which is 134ms. For the model inference on Raspberry

Pi 4, as we list in Table I, the inference latency of the R EFERENCES

EdgeActNet is 12.4 ms. So the latency of the whole pipeline
[1] L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon,
of our solution is 482.7 ms (333.3 ms+137 ms+12.4 ms). “Sensor-based and vision-based human activity recognition: A compre-
The latency requirement of different applications is different. hensive survey,” Pattern Recognition, p. 107561, 2020.
Our work has exceeded many related work that claimed to [2] N. Golestani and M. Moghaddam, “Human activity recognition using
magnetic induction-based motion signals and deep recurrent neural
be real-time, for example, the latencies of HAR in [67]–[69] networks,” Nature communications, vol. 11, no. 1, pp. 1–11, 2020.
are 0.506 s, 2.95 s, and 1.5 s, respectively. We believe 482 [3] K. Qian, Z. He, and X. Zhang, “3d point cloud generation with
ms is quite acceptable for most applications such as human- millimeter-wave radar,” Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies, vol. 4, no. 4, pp. 1–23, 2020.
computer interaction, fall detection, and healthcare monitoring.
[4] O. Schumann, M. Hahn, N. Scheiner, F. Weishaupt, J. F. Tilly, J. Dick-
It is worth noting that if we use a smaller sliding step for mann, and C. Wöhler, “Radarscenes: A real-world radar point cloud
the sliding window, the latency of data collection will get data set for automotive applications,” arXiv preprint arXiv:2104.02493,
smaller. And data collection can be parallel processed with 2021.
[5] A. D. Singh, S. S. Sandha, L. Garcia, and M. Srivastava, “Radhar:
data processing and HAR inference, which will save more Human activity recognition from point clouds generated through a
time. millimeter-wave radar,” in Proceedings of the 3rd ACM Workshop on
The evaluation in our work only illustrates the performance Millimeter-wave Networks and Sensing Systems, 2019, pp. 51–56.
[6] F. Luo, S. Poslad, and E. Bodanese, “Temporal convolutional networks
of the EdgeActNet. For evaluating the whole pipeline of HAR, for multiperson activity recognition using a 2-d lidar,” IEEE Internet of
it is worth performing the near real-time HAR prediction with Things Journal, vol. 7, no. 8, pp. 7432–7442, 2020.
inconsistent human activities, which has been done in [9]. The [7] A. B. Sargano, P. Angelov, and Z. Habib, “A comprehensive review
on handcrafted and learning-based action representation approaches for
capture of the start or end time of human activity in the point human activity recognition,” applied sciences, vol. 7, no. 1, p. 110, 2017.
cloud is also required to study. In this work, we primarily focus [8] F. Luo, S. Poslad, and E. Bodanese, “Human activity detection and
on the compression and optimization of the HAR model, we coarse localization outdoors using micro-doppler signatures,” IEEE
Sensors Journal, vol. 19, no. 18, pp. 8079–8094, 2019.
expect to continue the above work in our future research. [9] ——, “Kitchen activity detection for healthcare using a low-power
radar-enabled sensor network,” in ICC 2019-2019 IEEE International
Conference on Communications (ICC). IEEE, 2019, pp. 1–7.
V. C ONCLUSION [10] V. Bianchi, M. Bassoli, G. Lombardo, P. Fornacciari, M. Mordonini, and
I. De Munari, “Iot wearable sensor and deep learning: An integrated
In this paper, we built a BNN upon time-series point approach for personalized human activity recognition in a smart home
cloud data for EI-enabled HAR. Firstly, we implemented a environment,” IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8553–
8562, 2019.
2D histogram-based multi-view representation upon three pro-
[11] S. Deng, H. Zhao, W. Fang, J. Yin, S. Dustdar, and A. Y. Zomaya,
jected views of point cloud. Then we proposed the EdgeActNet “Edge intelligence: the confluence of edge computing and artificial
on the generated histograms for human activity classification. intelligence,” IEEE Internet of Things Journal, 2020.
The EdgeActNet is a binary neural network that consists of [12] X. Zhang, Y. Wang, S. Lu, L. Liu, W. Shi et al., “Openei: An open
framework for edge intelligence,” in 2019 IEEE 39th International
binary 3D dense blocks, binary 3D transition blocks, and Conference on Distributed Computing Systems (ICDCS). IEEE, 2019,
binary 3D improvement blocks. Our EdgeActNet achieved av- pp. 1840–1851.
erage classification accuracies of 97.63% on the MMActivity [13] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge
intelligence: Paving the last mile of artificial intelligence with edge
dataset and 95.03% on the point cloud samples of the DGUHA computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762,
dataset respectively, with a memory footprint of 1.8 MB and 2019.
an inference time of around 12 ms. Compared to its full- [14] T. Choudhary, V. Mishra, A. Goswami, and J. Sarangapani, “A com-
prehensive survey on model compression and acceleration,” Artificial
precision version, it saves 16.9× memory consumption and Intelligence Review, pp. 1–43, 2020.
11.5× inference time. For making a comparison, we modified [15] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
the three latest BNNs for time-series point cloud classifi- Imagenet classification using binary convolutional neural networks,” in
European conference on computer vision. Springer, 2016, pp. 525–542.
cation. The results show that the EdgeActNet outperforms [16] J. Bethge, C. Bartz, H. Yang, Y. Chen, and C. Meinel, “Meliusnet:
these BNNs and three well-known neural networks including Can binary neural networks achieve mobilenet-level accuracy?” arXiv
ResNet3D, DenseNet3D, and SwinTransformer. Through the preprint arXiv:2001.05936, 2020.
[17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
ablation study on the representation of point cloud, it is found T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
that the 2D histogram-based multi-view representation of point lutional neural networks for mobile vision applications,” arXiv preprint
cloud can greatly improve the classification rate by about 12%. arXiv:1704.04861, 2017.
[18] Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, “Reactnet: Towards
Through the analysis of the components in the EdgeActNet, precise binary neural network with generalized activation functions,” in
we found that the dense block contribute most, which is European Conference on Computer Vision. Springer, 2020, pp. 143–
followed by the group convolution, then the improvement 159.
block. It is worth mentioning that it is the first time that we [19] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, “Deep learning for sensor-
based activity recognition: A survey,” Pattern Recognition Letters, vol.
use 2D histogram-based multi-view representation for point 119, pp. 3–11, 2019.
cloud and also the first time that we proposed a 3D BNN for [20] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A public
time-series point cloud classification. domain dataset for human activity recognition using smartphones.” in
Esann, 2013.
Our work presents that binary neural networks have great [21] D. Micucci, M. Mobilio, and P. Napoletano, “Unimib shar: A dataset for
potential to realize HAR on edge devices, especially for large human activity recognition using acceleration data from smartphones,”
volume data like point could. Our work can be considered Applied Sciences, vol. 7, no. 10, p. 1101, 2017.
[22] M. Malekzadeh, R. G. Clegg, A. Cavallaro, and H. Haddadi, “Protecting
an application case for edge intelligence since it is still in its sensory data against sensitive inferences,” in Proceedings of the 1st
infancy. Workshop on Privacy by Design in Distributed Systems, 2018, pp. 1–6.

[23] S. Van der Spek, J. Van Schaick, P. De Bois, and R. De Haan, “Sensing [44] A. S. Abutaleb, “Automatic thresholding of gray-level pictures using
human activity: Gps tracking,” Sensors, vol. 9, no. 4, pp. 3033–3055, two-dimensional entropy,” Computer vision, graphics, and image pro-
2009. cessing, vol. 47, no. 1, pp. 22–32, 1989.
[24] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding [45] P. K. Sahoo and G. Arora, “A thresholding method based on two-
and modeling of wifi signal based human activity recognition,” in dimensional renyi’s entropy,” Pattern Recognition, vol. 37, no. 6, pp.
Proceedings of the 21st annual international conference on mobile 1149–1161, 2004.
computing and networking, 2015, pp. 65–76. [46] A. Brink, “Thresholding of digital images using two-dimensional en-
[25] V. C. Chen, D. Tahmoush, and W. J. Miceli, Radar Micro-Doppler tropies,” Pattern recognition, vol. 25, no. 8, pp. 803–808, 1992.
Signatures. Institution of Engineering and Technology, 2014. [47] E. A. Yonekura and J. Facon, “Postal envelope segmentation by 2-d his-
[26] N. Nair, C. Thomas, and D. B. Jayagopi, “Human activity recog- togram clustering through watershed transform,” in Seventh International
nition using temporal convolutional network,” in Proceedings of the Conference on Document Analysis and Recognition, 2003. Proceedings.
5th international Workshop on Sensor-based Activity Recognition and IEEE, 2003, pp. 338–342.
Interaction, 2018, pp. 1–8. [48] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
[27] B. Zhang, H. Xu, H. Xiong, X. Sun, L. Shi, S. Fan, and J. Li, “A connected convolutional networks,” in Proceedings of the IEEE confer-
spatiotemporal multi-feature extraction framework with space and chan- ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
nel based squeeze-and-excitation blocks for human activity recognition,” [49] J. Bethge, H. Yang, M. Bornstein, and C. Meinel, “Back to sim-
Journal of Ambient Intelligence and Humanized Computing, pp. 1–13, plicity: How to train accurate bnns from scratch?” arXiv preprint
2020. arXiv:1906.08637, 2019.
[28] S. Lohit, Q. Wang, and P. Turaga, “Temporal transformer networks: Joint [50] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
learning of invariant and discriminative time warping,” in Proceedings of with deep convolutional neural networks,” Advances in neural informa-
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, tion processing systems, vol. 25, pp. 1097–1105, 2012.
2019, pp. 12 426–12 435. [51] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
[29] W. Shi and S. Dustdar, “The promise of edge computing,” Computer, transformations for deep neural networks,” in Proceedings of the IEEE
vol. 49, no. 5, pp. 78–81, 2016. conference on computer vision and pattern recognition, 2017, pp. 1492–
[30] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Bi- 1500.
narized neural networks,” in Advances in neural information processing [52] G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger, “Con-
systems, 2016, pp. 4107–4115. densenet: An efficient densenet using learned group convolutions,” in
[31] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep Proceedings of the IEEE conference on computer vision and pattern
learning for 3d point clouds: A survey,” IEEE transactions on pattern recognition, 2018, pp. 2752–2761.
analysis and machine intelligence, 2020. [53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[32] G. Bradski and S. Grossberg, “Recognition of 3-d objects from multiple Computer Science, 2014.
2-d views by a self-organizing neural architecture,” in From Statistics [54] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
to Neural Networks. Springer, 1994, pp. 349–375. S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-
[33] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object scale machine learning,” in 12th {USENIX} symposium on operating
categorization and pose estimation using multiviews from unsupervised systems design and implementation ({OSDI} 16), 2016, pp. 265–283.
viewpoints,” in Proceedings of the IEEE Conference on Computer Vision [55] W. Shi and R. Rajkumar, “Point-gnn: Graph neural network for 3d object
and Pattern Recognition, 2018, pp. 5010–5019. detection in a point cloud,” in Proceedings of the IEEE/CVF conference
[34] X. Wei, R. Yu, and J. Sun, “View-gcn: View-based graph convolutional on computer vision and pattern recognition, 2020, pp. 1711–1719.
network for 3d shape analysis,” in Proceedings of the IEEE/CVF [56] Y. Zhang, D. Huang, and Y. Wang, “Pc-rgnn: Point cloud completion
Conference on Computer Vision and Pattern Recognition, 2020, pp. and graph neural network for 3d object detection,” in Proceedings of
1850–1859. the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp.
[35] A. Hamdi, S. Giancola, and B. Ghanem, “Mvtn: Multi-view transforma- 3430–3437.
tion network for 3d shape recognition,” in Proceedings of the IEEE/CVF [57] S. S. Mohammadi, Y. Wang, and A. Del Bue, “Pointview-gcn: 3d shape
International Conference on Computer Vision, 2021, pp. 1–11. classification with multi-view point clouds,” in 2021 IEEE International
[36] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural net- Conference on Image Processing (ICIP). IEEE, 2021, pp. 3103–3107.
work for real-time object recognition,” in 2015 IEEE/RSJ International [58] J. Hochstetler, R. Padidela, Q. Chen, Q. Yang, and S. Fu, “Embedded
Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. deep learning for vehicular edge computing,” in 2018 IEEE/ACM
922–928. Symposium on Edge Computing (SEC). IEEE, 2018, pp. 341–343.
[37] A. Garcia-Garcia, F. Gomez-Donoso, J. Garcia-Rodriguez, S. Orts- [59] B. Chen, J. Wan, A. Celesti, D. Li, H. Abbas, and Q. Zhang, “Edge
Escolano, M. Cazorla, and J. Azorin-Lopez, “Pointnet: A 3d convo- computing in iot-based manufacturing,” IEEE Communications Maga-
lutional neural network for real-time object class recognition,” in 2016 zine, vol. 56, no. 9, pp. 103–109, 2018.
International joint conference on neural networks (IJCNN). IEEE, [60] R. Xu, S. Y. Nikouei, Y. Chen, A. Polunchenko, S. Song, C. Deng,
2016, pp. 1578–1584. and T. R. Faughnan, “Real-time human objects tracking for smart
[38] S. Palipana, D. Salami, L. A. Leiva, and S. Sigg, “Pantomime: Mid- surveillance at the edge,” in 2018 IEEE International Conference on
air gesture recognition with sparse millimeter-wave radar point clouds,” Communications (ICC). IEEE, 2018, pp. 1–6.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiqui- [61] T. Instruments, “Iwr1443boost evaluation module mmwave sensing
tous Technologies, vol. 5, no. 1, pp. 1–27, 2021. solution,” User’s Guide, May, 2017.
[39] D. Salami, R. Hasibi, S. Palipana, P. Popovski, T. Michoel, and S. Sigg, [62] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net:
“Tesla-rapture: A lightweight gesture recognition system from mmwave Enhancing the performance of 1-bit cnns with improved representational
radar sparse point clouds,” IEEE Transactions on Mobile Computing, capability and advanced training algorithm,” in Proceedings of the
2022. European conference on computer vision (ECCV), 2018, pp. 722–737.
[40] A. Sengupta and S. Cao, “mmpose-nlp: A natural language processing [63] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
approach to precise skeletal pose estimation using mmwave radars,” A large-scale hierarchical image database,” in 2009 IEEE conference on
IEEE Transactions on Neural Networks and Learning Systems, 2022. computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[41] H. Xue, Y. Ju, C. Miao, Y. Wang, S. Wang, A. Zhang, and L. Su, [64] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
“mmmesh: towards 3d real-time dynamic human mesh construction P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
using millimeter-wave,” in Proceedings of the 19th Annual International context,” in European conference on computer vision. Springer, 2014,
Conference on Mobile Systems, Applications, and Services, 2021, pp. pp. 740–755.
269–282. [65] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video
[42] Y. Gao, N. Ziems, S. Wu, H. Wang, and M. Daneshmand, “Human health swintransformer,” arXiv preprint arXiv:2106.13230, 2021.
activity intelligence based on mmwave sensing and attention learning,” [66] G. Lee and J. Kim, “Mtgea: A multimodal two-stream gnn framework
in GLOBECOM 2022-2022 IEEE Global Communications Conference. for efficient point cloud and skeleton data alignment,” Sensors, vol. 23,
IEEE, 2022, pp. 1391–1396. no. 5, p. 2787, 2023.
[43] J. N. Kapur, P. K. Sahoo, and A. K. Wong, “A new method for gray- [67] I. E. Jaramillo, J. G. Jeong, P. R. Lopez, C.-H. Lee, D.-Y. Kang, T.-J. Ha,
level picture thresholding using the entropy of the histogram,” Computer J.-H. Oh, H. Jung, J. H. Lee, W. H. Lee et al., “Real-time human activity
vision, graphics, and image processing, vol. 29, no. 3, pp. 273–285, recognition with imu and encoder sensors in wearable exoskeleton robot
1985. via deep learning networks,” Sensors, vol. 22, no. 24, p. 9690, 2022.

[68] R. Parada, K. Nur, J. Melià-Seguı́, and R. Pous, “Smart surface: Rfid-

based gesture recognition using k-means algorithm,” in 2016 12th
International Conference on Intelligent Environments (IE). IEEE, 2016,
pp. 111–118.
[69] A. Wickramasinghe, R. L. S. Torres, and D. C. Ranasinghe, “Recognition
of falls using dense sensing in an ambient assisted living environment,”
Pervasive and mobile computing, vol. 34, pp. 14–24, 2017.