0% found this document useful (0 votes)
6 views12 pages

Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices

Uploaded by

asialwen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices

Uploaded by

asialwen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1

Failure-Resilient Distributed Inference with Model


Compression over Heterogeneous Edge Devices
Li Wang, Senior Member, IEEE, Liang Li, Member, IEEE, Lianming Xu, Xian Peng, and Aiguo Fei

Abstract—The distributed inference paradigm enables the in IoT applications such as Smart Healthcare and Keyword
computation workload to be distributed across multiple devices, Spotting, typically have limited available memory, typically
facilitating the implementations of deep learning based intelligent around 500KB. As a result, models like Residual Network-50
services on extremely resource-constrained Internet of Things
arXiv:2406.14185v1 [cs.DC] 20 Jun 2024

(IoT) scenarios. Yet it raises great challenges to perform com- (ResNet-50) with 50 convolutional layers, which require over
plicated inference tasks relying on a cluster of IoT devices that 95 megabytes of memory for storage and involve multiple
are heterogeneous in their computing/communication capacity floating-point multiplications for each image calculation, may
and prone to crash or timeout failures. In this paper, we not be feasible for deployment on such devices. Therefore,
present RoCoIn, a robust cooperative inference mechanism there is a fundamental need for alternative approaches that
for locally distributed execution of deep neural network-based
inference tasks over heterogeneous edge devices. It creates a set are more memory-efficient and computationally lightweight
of independent and compact student models that are learned to enable fast and IoT-devices friendly DNN inference.
from a large model using knowledge distillation for distributed To alleviate the computation burden of edge devices for
deployment. In particular, the devices are strategically grouped DNN inference, some model compression techniques exist in
to redundantly deploy and execute the same student model such the literature, wherein Knowledge Distillation (KD) stands out
that the inference process is resilient to any local failures, while a
joint knowledge partition and student model assignment scheme as an effective solution to produce a more compact model that
are designed to minimize the response latency of the distributed can substitute for a complex model [1]. Specifically, KD-based
inference system in the presence of devices with diverse capac- methods train a more compact neural network (a.k.a student
ities. Extensive simulations are conducted to corroborate the model) with far fewer layers/width to mimic the output of
superior performance of our RoCoIn for distributed inference a larger network (a.k.a teacher model) we want to compress
compared to several baselines, and the results demonstrate its
efficacy in timely inference and failure resiliency. [2]. The basic idea behind KD is to distill the knowledge
from the teacher model into the student model, using a distil-
Index Terms—Edge intelligence, Internet of Things, dis-
lation algorithm that is optimized for imitation performance.
tributed deep learning inference, knowledge distillation.
This process involves three essential components: knowledge,
distillation algorithm, and teacher-student architecture, which
I. I NTRODUCTION
have been improved by researchers from both theoretical and
Exciting breakthroughs in deep learning (DL) and Internet empirical perspectives. However, a lightweight student model
of Things (IoT) have opened up new possibilities for pervasive that is compatible with extremely resource-constrained IoT
intelligence at the network edge. This is achieved through devices may not have the necessary capacity to represent the
the deployment of deep neural networks (DNNs) on various teacher’s knowledge, thereby suffering from severe accuracy
mobile edge devices, in response to the growing need for on- loss.
device intelligent services across a wide range of application Several pioneering works propose to exploit available com-
domains, ranging from intelligent assistants (such as Google putation resources within a manageable range for distributed
Now and Amazon Echo) in smart home to advanced video inference, instead of keeping all computation at the single
analytics in smart cities. Basically, the outstanding perfor- local device, under the emerging edge intelligence paradigm
mance of DNNs in accurate human-centric content process- [3]–[7]. As such, by forming a collaborative DNN computing
ing notoriously relies on increasingly complex parameterized system, the inference workload is partitioned and distributed
models that are memory-hungry and computation-intensive, from the source device to a cluster of devices in proximity via
which poses significant challenges to implement DNNs on local wireless connections such as WiFi. Jouhari et al. in [8]
embedded IoT devices with limited resources (e.g., memory, divide the layers of the given DNN into multiple subsets, each
central processing units (CPUs), battery, bandwidth). For in- of which is executed on a separate device, with intermediate
stance, microcontrollers like Arm Cortex-M, commonly used feature maps transferred between the corresponding devices
Li Wang, Xian Peng, and Aiguo Fei are with the School of Computer at runtime. Mao et al. in [9] enables execution parallelism
Science (National Pilot Software Engineering School), Beijing University of among multiple mobile devices by partitioning the neurons
Posts and Telecommunications (BUPT), China (Email: [email protected]; of each layer where the overlapped parts of the layer inputs
[email protected]; [email protected]). Liang Li is with the Frontier
Research Center, Peng Cheng Laboratory, China (Email: [email protected]). need to be transferred among the devices during computation.
Lianming Xu is with the School of Electronic Engineering, Beijing Uni- To alleviate the prohibitively huge communication burden
versity of Posts and Telecommunications (BUPT), China (Email: xulian- for intermediate exchanges, Bhardwaj et al. in [10] take this
[email protected]). (Corresponding author: Liang Li)
This work was supported by National Natural Science Foundation of China approach a step further and designed a new distributed infer-
under grants 62201071 and 62171054. ence paradigm called Network of Neural Networks (NoNN),
2

which compresses a large pre-trained ‘teacher’ deep network of deep neural network-based inference tasks via local
into a set of independent ‘student’ networks via KD. These knowledge replication. It distills the knowledge of the
individual students can then be deployed on separate resource- deep model to a set of independent and compact models,
constrained IoT devices to perform the distributed inference. while enabling edge devices’ local deployments with
In this way, only the outputs of the student networks require knowledge replication.
aggregation for inferring the final result, thereby reducing • We propose a knowledge assignment algorithm that
communication overhead to a large extent. partitions the knowledge of the original deep model
Despite the benefits of parallelized computational work- with importance balancing and designates the target
loads and negligible accuracy loss, there are still a few locally-deployed models for minimizing the inference
issues that hinder the efficient execution of DNN inference latency with the accuracy guarantee. Accommodating
tasks over massive resource-constrained edge devices. On the the edge devices’ diverse capacities (i.e., the processors’
one hand, edge devices usually have heterogeneity in their computational performance, the memory budget, and
computation capacity and communication condition, thereby the transmission quality), the algorithm integrates the
disabling those distributed inference schemes that uniformly similarity-aware device grouping and the normalized cut-
partition the knowledge of the teacher model and transfer it to based knowledge partitioning, where the Kuhn-Munkres
the student models with the same structure. It is non-trivial to method is further used to obtain the optimal device-
distribute the inference workload over heterogeneous devices knowledge-student matching.
for timely response with full utilization of edge resources • We evaluate the performance of the RoCoIn with our
while alleviating the bottleneck effect from the stragglers with knowledge assignment scheme via simulations, which
insufficient capacities. On the other hand, since the coopera- verify the efficacy of our scheme with various data
tive mechanism parallelizes DNN inference in a distributed sources and system configurations. Several baselines are
manner, the crash of any edge device or network timeout conducted to validate the superiority of RoCoIn in terms
can result in system breakdown and invalidate the inference of timely inference and failure resiliency.
result. Such failures are unknown to the task requester a priori The remainder of the paper is organized as follows. Section
and hard to harness proactively for maintaining the inference II reviews related work on edge inference. Section III elab-
performance. Therefore, it fundamentally calls for the devel- orates on the RoCoIn design and builds the system model.
opment of more flexible and adaptable approaches to partition- Section IV formulates the knowledge assignment problem
ing the knowledge of the teacher model and distributing it to and presents our algorithm. Section V gives the performance
the student models, as well as designing robust cooperative evaluation, and Section VI finally concludes the paper.
inference systems that can handle the failure of individual
edge devices without compromising the responsiveness or
II. R ELATED W ORK
accuracy of the overall system. Achieving these goals will
be critical to realizing the full potential of edge intelligence A. Model Compression
and enabling efficient and scalable distributed inference on To enable DNN training or inference relying on resource-
resource-constrained edge devices. limited IoT devices, one possible solution could be to use
In this paper, we develop a failure-resilient model compres- more optimized and compact deep learning architectures
sion and distribution scheme, named RoCoIn, for cooperative specifically designed for these devices, subjecting to precious
deep learning model inference. Our scheme employs a similar memory and processing power. This motivates some model
parallel workflow as NoNN where a set of independent compression techniques in the literature to enhance train-
student models are distilled from the large model for dis- ing efficiency or inference efficiency [11]–[13]. While they
tributed deployment. In particular, RoCoIn enables knowledge operate at different stages of the machine learning pipeline
dissemination across heterogeneous devices and ensures the and the objectives may differ between the two scenarios,
resilience of the cooperative inference system against local both aim to reduce the computational resources required by
failures. With a focus on minimizing response latency, RoCoIn the model. As a result, many model compression techniques
strategically groups devices to redundantly handle the same (such as pruning, quantization, knowledge distillation, etc.)
student model with a resilience guarantee, while incorporating can be adapted or extended to address both training and
a joint knowledge partition and model assignment method to inference efficiency. The rationale behind is that deep mod-
accommodate devices’ diverse capacities without sacrificing els usually learn a large number of redundant or useless
inference accuracy. As a result, devices can deploy individual weights, which can be removed or compactly represented
student models with varying complexities for parallel com- while sacrificing accuracy to a moderate extent. Specifically,
puting, with only the outputs requiring aggregation to infer pruning and quantization aim to reduce the number of weights
the final result. RoCoIn lays the groundwork for intermediary and the number of bits required to represent weights in
interaction-free cooperative inference across heterogeneous deep networks, respectively [14]. KD-based methods aim to
edge devices and sets the stage for further enhancements train a more compact neural network (a.k.a student model)
and integrations with other resource scheduling policies. Our with far fewer layers/width to mimic the output of a larger
salient contributions are listed as follows: network (a.k.a teacher model) we want to compress [15].
• We present RoCoIn, a robust cooperative inference mech- Using an ensemble KD technique, Bharadhwaj et al. in [16]
anism to enable failure-resilient distributed execution improved the performance of tiny vehicle detection models
3

that enables edge device-based vehicle track and count for require aggregation for inferring the final result. However,
real-time traffic analytics. Xu et al. in [17] proposed a NoNN uniformly distributes the learned knowledge into stu-
hybrid KD framework to compress a complex long short- dent models with identical structures, which are weak and
term memory model for machine remaining useful life, which vulnerable to adapt to the varying network conditions and
includes a generative adversarial network based knowledge capability among edge devices, especially in cases where
distillation for disparate architecture knowledge transfer and some of the devices become stragglers due to crash or timeout
a learning-during-teaching based knowledge distillation for issues.
identical architecture knowledge transfer. However, the ability
of model compression to reduce resource overhead is limited
III. S YSTEM M ODEL AND RO C O I N W ORKFLOW
for those IoT devices with extremely constrained capacity,
and excessive compression will result in severe degradation We describe our target distributed DNN inference system
of model performance in intelligent data understanding. as follows with basic notations. We assume that there is a set
of N edge devices, denoted by D = {d1 , d2 , ..., dn , ..., dN },
distributed in a restricted area. In particular, for a certain type
B. Cooperative Edge Inference of inference task, we suppose that there is an edge device serv-
Some recent efforts have been devoted to mitigating the ing as a source, which broadcasts the raw data for cooperative
computational bottleneck at single edge device by coordi- computation and aggregates the parallelized outputs from the
nating multiple devices to jointly perform intelligent tasks other devices for generating the final inference result. Taking
via proper model/data partition and workload distribution, the face recognition task in a smart home scenario as an
facilitated by controllable inter-device communication [3]–[7], example, the raw image is generally captured by an inspection
[9], [18]. Neurosurgeon [19] first proposed to partition a DNN camera that can serve as the source device and trigger the
model and offload partial inference workload onto a powerful cooperative inference procedures for the task. We denote by d1
cloud server for follow-up execution, where the offloading the source device without loss of generality and suppose that
efficiency highly depends on the transmission condition of d1 can communicate with all other edge devices via wireless
unstable wide-area network connections. To fully utilize the connection. Assume that the channel √ coefficient is Rayleigh
decentralized resources of massive edge devices, Jouhari et distributed, i.e., hn ∼ CN (0, λ), where the channel gain
al. in [8] divided the layers of the given DNN into multiple gn = |hn |2 follows an exponential distribution. We use a
subsets, each of which is executed on a separate device and tuple (ccore
n , cmem
n , rntran , pout
n ) to specify the resource profile
the intermediate feature maps are transferred between the of edge device dn . Here, ccore n and cmem
n represent the dn ’s
corresponding devices at runtime. Considering the complex FLOP and memory budgets for inference tasks, respectively,
model architecture with a directed acyclic graph (DAG) rather reflecting the computing capability of dn in a coarse granular-
than a chain of layers, Hu et al. in [20] presented EdgeFlow to ity. For a single device that only processes DNN workloads,
enable distributed inference of general DAG structured DNN cmem
n is the volume of memory excluding the space taken by
models. EdgeFlow partitions model layers into execution units the underlying system services, e.g., I/O services, compiler,
and orchestrates the intermediate results flowing through these etc. rntran denotes the the wireless transmission rate of the
units to fulfill the complicated layer dependencies. Such link dn → d1 , and pout n represents its transmission outage
model-parallelism paradigms are particularly beneficial when probability.
the model is too large to fit on a single device’s memory or We suppose that the teacher contains several convolu-
when certain model components require specialized hardware tional layers and one or more full-connected (FC) layers for
for efficient computation. prediction. RoCoIn employs a similar parallel workflow as
Instead of partitioning the model, Zeng et al. in [21] NoNN [10], where a collection of lightweight student models
proposed to split the input data for matching the available that focus only on a part of teacher’s knowledge are separately
resources of edge devices, which does not sacrifice model deployed on edge devices to perform the distributed inference.
accuracy as it reserves input data and model parameters of the Thus, it parallelizes the execution on multiple devices at
given DNN model. Data-parallelism schemes can be effective runtime. The key is to partition the knowledge of the teacher
when the model size is manageable, and the input data can be model, which can be achieved by clustering the filters in the
easily partitioned into smaller batches. Yet they assume that teacher’s final convolution layer according to their activation
edge devices are powerful in their memory to accommodate patterns and using them to train individual student modules.
the entire DNN model, which may hinder its applicability. The rationale behind is that features for various classes are
To alleviate the prohibitively huge communication burden for learned at different filters in CNNs and the activation patterns
intermediates exchange while conserving memory of edge reveal how teacher’s knowledge gets distributed at the final
devices, Bhardwaj et al. in [10] designed a new distributed convolution layer. While NoNN create a filter activation
inference paradigm, named Network of Neural Networks network to characterize the distribution pattern of the teacher’s
(NoNN), that compresses a large pre-trained ‘teacher’ deep knowledge, RoCoIn proposes to optimize knowledge assign-
network into a set of independent student networks via KD. ment over edge devices with appropriate student architecture
These individual students can then be deployed on separate selection against any unexpected failures during inference,
resource-constrained IoT devices to perform the distributed adapting to heterogeneous edge resources and transmission
inference, where only the outputs of the student networks conditions. We define F = {f1 , f2 , ..., fm , ..., fM } as the set
4

Fig. 1. An illustration of RoCoIn mechanism.

of filters belonging to the teacher’s final convolution layer. device receives DNN inference query. As response, the source
Let P = {P1 , P2 , ..., Pk , ..., PK } be the set of filter partitions device establishes connections with the cooperative edge
where Pk ⊂ F for any k. Assume that there are J types of devices according to the cooperation strategy and distributes
student architectures, denoted by S = {s1 , s2 , ..., sj , ..., sJ }, the input data, e.g., image, to them. All the participating
with different computation load Rj (FLOPS) and memory re- devices feed the data into their local student model in par-
quirement Qj (bit), which can be selected to deployed on edge allel and generate a portion of the final convolution layer’s
devices after learning certain knowledge from the teacher. output. These portions are aggregated by the source device
Note here that, in practice, the edge devices that participated and merged by a Fully Connected (FC) layer to yield the
in the cooperative inference task are diverse in their types final prediction in response to the query. As the teacher’s
and functions, and the locally processed results may not be knowledge is redundantly assigned to the students for resisting
aggregated successfully due to uncertain system factors, e.g., on any failures, the source device can initiate the FC-layer
edge device crash, unexpected channel conditions, concurrent execution when receiving a necessary number of disjoint
computation tasks, etc. It may result in great damage to the portions, rather than waiting for the feedback from all of the
performance of the distributed inference system since the devices. Notice that our individual student models that are
source device is oblivious to the uncertainty a prior. To make well selected adhere to the heterogeneous memory-and FLOP-
the system failure-resilient, we propose to assign knowledge constraints of the edge devices, and do not communicate until
to the edge devices with replication, which allows multiple the final fully connected layer. Consequently, RoCoIn incurs
devices to undertake the same part of an inference task by significantly lower memory, computations, and communica-
considering the potential uncertainty in advance. Here, we use tion, while improving the robustness by injecting redundancy
the set G = {G1 , G2 , ..., Gk , ..., GK } to indicate the collection in teacher’s knowledge assignment.
of edge device groups where Gk ⊂ D for any k, and there We note that the performance of our RoCoIn sys-
will be a one-to-one matching between any device group and tem strongly relying on appropriate knowledge assignment,
filter cluster. The main notations used throughout the paper wherein the following questions require to be answered: i)
are listed in Table I. What replication rule should be used to determine the edge
Fig. 1 shows the workflow of our RoCoIn, which consists devices that act as backups for each other? 2) How to partition
of the offline setup phase and the runtime execution phase. the teacher’s knowledge into disjoint portions catering to
In the offline setup phase, RoCoIn records the execution the intrinsic characteristics of the teacher model? 3) Which
profiles of each device and creates a cooperative inference student architectures should be selected for the edge devices
plan that determines the knowledge partitions and their student tailored to their diverse resource capacity, while fully learning
model assignment using the knowledge assignment algorithm. corresponding knowledge partition? Notice that the three-
Separate students are then trained to mimic parts of teacher’s fold strategies are closely intertwined, and thus there is a
knowledge, which are deployed on corresponding edge de- great demand for joint optimization to make full use of
vices according to the cooperation plan to enable parallel edge resources while drawing sufficient knowledge from the
execution. The runtime execution phase starts when the source teacher, which will be elaborated in the subsequent section.
5

TABLE I
SUMMARY OF NOTATIONS.

D = {dn }N n=1 Set of edge devices F = {fm }Mm=1 Set of filters in the final convolution layer
P = {Pk }K k=1 Set of filter patitions G = {Gk }Kk=1 Set of device groups
ccore
n FLOP budget of device dn cmem
n Memory budget of device dn
rntran Wireless transmission rate of device dn pout
n Transmission outage probability of device dn
S = {sj }Jj=1 Set of available student models αkj Binary variable, student assignment indicator
Rj Computation load of student model sj Qj Output size of student model sj
pth Transmission failure probability threshold dth Device similarity threshold
F Filter activation pattern graph E = {emm′ }∀m,m′ Edge set of graph F
A = [Amm′ ]∀m,m′ Weight matrix of graph F am Average activity of filter fm
zm Degree of node fm Z Degree matrix of graph F
W (Pk , Pk′ ) Cut weight of Pk and Pk′ w(Gk , Pk′ ) Assignment weight of Gk and Pk′
L Laplacian matrix of graph F H = [hmk ]∀m,k Indicator matrix for filter partitioning

IV. K NOWLEDGE A SSIGNMENT S CHEME D ESIGN of devices that return the local output. Here, Cjf lops /ccore
n

In this section, we begin with the formulation of the calculates the execution delay for performing student model
knowledge assignment problem, followed by the elaboration sj at edge device dn , while Qj /rnout is the time consumed to
on the design of the knowledge assignment scheme. transmit its output over wireless channels. The constraints in
(1b-1c) enforce that all the devices and the filters are parti-
tioned for distributed inference. Constraints in (1b-1c) further
A. Problem Formulation ensure that an edge device, as well as a filter, can be assigned
We first introduce a binary variable αkj where αkj = 1 if to no more than one group/partition. (1f) guarantees that the
using student architecture sj to learn the knowledge regarding cumulative transmission failure probability of the devices in
the filter partition Pk ; Otherwise, αkj = 0. On optimizing the same group should not exceed a threshold pth , so that the
the knowledge assignment strategy, we aim to minimize the portion of the output corresponding to each device group can
inference completion delay, in spite of some failures in the be returned for aggregation with high-reliability guarantee.
local output aggregation that may compromise the inference Constraints in (1g) enforce that the required memory budget
performance. Particularly, it requires us to make joint deci- to run student model on edge devices should not exceed
sions on user grouping G = {G1 , G2 , ..., GK }, filter partition their diverse capacity. Further, (1h) ensures that the student
P = {P1 , P2 , ..., PK }, and student assignment αkj under architecture selected for the edge devices can fully learning
the constraints of heterogeneous edge resources. Towards this the teacher’s knowledge without sacrificing accuracy, i.e., the
goal, we establish the knowledge assignment problem for our loss attained by the student cluster is lower than the threshold
RoCoIn as: ϵth .

X Cjf lops Qj B. Algorithm Design


min max min αkj ( + tran ) Considering the intertwined relation among variables, we
K,G={G1 ,G2 ,...,GK } k n:dn ∈Gk
j
ccore
n rn
αkj ,P={P1 ,P2 ,...,PK } decouple the original knowledge assignment problem in (1)
(1a) and integrate three functional modules into the algorithm, i.e.,
device grouping, knowledge partition, and student assignment.
[
s.t. Gk = D, (1b)
k The three modules run sequentially and determine decisions
[ on device grouping G = {G1 , G2 , ..., GK }, filter partition
Pk = F, (1c)
P = {P1 , P2 , ..., PK }, and student assignment αkj under the
k
\ constraints of heterogeneous edge resources, respectively. To
Gk Gk′ = ∅, ∀k ̸= k ′ , k, k ′ ∈ {1, ..., K}, (1d) elaborate, the algorithm first groups the devices via modified
follow-the-leader procedures based on a well-defined simi-
\
Pk Pk′ = ∅, ∀k ̸= k ′ , k, k ′ ∈ {1, ..., K}, (1e)
Y larity distance. This process ensures that edge devices with
pout th
n ≤ p , ∀k ∈ {1, ..., K}, (1f) similar computational capacities and satisfactory transmission
n:dn ∈Gk reliability are clustered together to serve as replicas of each
X
αkj Cjpara ≤ min cmem
n ,∀k ∈ {1, ..., K}, (1g) other. With the determined number of device groups, the filters
n:dn ∈Gk
j from the teacher’s final convolution layer are clustered into
Loss(θS |θT ) ≤ εth . (1h) knowledge partitions of corresponding quantity. This is facil-
itated by constructing a weighted filter graph and optimizing
Here, the objective function in (1a) represents the inference through Normalized Cut. After that, the algorithm targets at
completion delay, which is blocked by the slowest group achieve optimal matching among device groups, knowledge
6

partitions, and student models, thereby minimizing inference of connections between Pk and Pk′ . Notice that the rule
delay while preserving accuracy. To reduce computational we use to weight the edge encourages connections between
complexity, we simplify the three-dimensional matching into a very important and less important filters. We would like to
bi-partite matching problem and integrate the KM algorithm to partition the filters so that the knowledge of the teacher
find its optimum. In the following, we detail the three modules model can be distributed uniformly across the students. To
in our knowledge assignment scheme. this end, this work applies normalized cut, a prevalent spectral
1) Device grouping: We first define the capacity similarity clustering method, to split the graph with the minimized
of any two edge devices using Euclid distance, which is weights of the cuts while encouraging the weights within
calculated by each sub-graph to be large. In this way, it avoids the isolated
nodes, i.e., filters of the final convolution layer, from the rest
of the graph. Specifically, for a given number of partitions
q
sim(dn , dn′ ) = (cmem
n − cmem
n′ )2 + (ccore
n − ccore 2
n′ ) , (2) K, the normalized cut, denoted by N cut, of K partitions
We use a modified follow-the-leader method to group the P = {P1 , P2 , ..., PK } is given by
edge devices with approximately equal computational capac- K
ity, subjecting to the constraint in transmission reliability. The 1 X W (Pk , P k )
N cut(P1 , P2 , ..., Pk ) = , (3)
procedure does not require initialization of the number of 2 vol(Pk )
k=1
device groups, and uses an iterative process to compute the
cluster centroids. It starts by randomly setting a device as the where P k represents the complementary
P set of Pk , i.e.,P k =
centroid of the group G1 , which is denoted by G1 . Then we F − Pk . The minimum of k 1/vol(Pk ) is achieved if all
calculate its capacity similarity sim(G1 , dn ) with the other vol(Pi ) coincide thus ensures the partitions are as balanced
devices dn ∈ D/G1 . The group G1 Q involves the devices as possible. To facilitate the solution, we relax the Ncut
satisfying both sim(G1 , dn ) ≤ dth and n:dn ∈Gk pout n >p
th minimization problem by involving indicator vectors and dis-
one by one, and recompute the centroid repeatedly. The carding the discreteness condition, which fits in with efficient
devices that do not meet the conditions in terms of all the spectral algorithms achieved by finding the smallest nonzero
existing groups will be regarded as a new group, and all the eigenvalue of the graph Laplacian and thresholding the entries
groups continue to run the same procedures to involves the rest of the corresponding eigenvector [22]. The relaxed problem
unassigned devices. The process is controlled by the distance is given by
threshold dth , which is chosen through trial and error. 1 1

2) Knowledge partition: We distribute knowledge from min tr(H T Z − 2 LZ − 2 H)


H∈RM ×K (4)
teacher’s final convolution layer to individual students for en- s.t. H T H = I.
abling parallel inference. We suppose that the teacher contains
several convolutional layers and one or more FC layers for Here, the Laplacian matrix, denoted by L, of the graph F
prediction. When passing an image from the validation set is a symmetrical matrix and is defined as L = Z − A, where
through the teacher network, each filter in the final convolution Z is the degree matrix of vertices with element zi and A is
layer has a certain feature map. Inspired by [10], we use the the adjacency matrix. The problem above is a standard trace
average activity metric as a measure of importance of a filter minimization problem, and thus the filter partition problem is
for a given class of images, which is defined as the averaged converted to optimize an M -by-K indicator matrix H whose
value of the corresponding output channel of teacher’s final element hmk > 0 if the filter fm belongs to partition Pk
convolution layer. Basically, the higher the average activity and hmk = 0 otherwise. Basically, the solution is given
of a filter, the more important it is for the classification for by a matrix with the eigenvector associated with the K
some classes of images. Let am denote the average activity smallest eigenvalue of normalized Laplacian, i.e., Lsym =
of a filter m for a given image in the validation set. Then a Z −1/2 LZ −1/2 , as the columns, which can be obtained via
weighted graph F = (F, E, A) of filter activation patterns can eigenvalue decomposition of Lsym .
be built with filters fm ∈ F, ∀m as nodes, where each two 3) Student assignment: Recall that a certain knowledge
nodes (fm , fm′ ), m ̸= m′ , fm , P
fm′ ∈ F are connected by an partition of the teacher can be learned by a proper student
edge emm′ ∈ E with Amm′ = val am am′ |am − am′ | as the model that is replicated across the edge devices in each device
weight. group, such that the devices can generate output replicas
On this basis, the partition of the filters for teacher’s of the students against any local failures. Based on the
knowledge distribution can be regarded as a K-cut problem device grouping and knowledge partition strategy obtained
of the weighted graph F = (F, E, A), which require us to before, assigning students across the edge devices calls for a
split the graph into K sub-graphs. For any filter partition three-dimensional matching among device groups, knowledge
Pk corresponding to a sub-graph, we denote P by vol(Pk ) the partitions, and student models.
size ofPPk . It is defined as vol(P ) = fm ∈P zm where We would like to form a device-knowledge-student as-
zm = fm′ ∈F Amm′ is the degree of of the node fm . We signment in a manner that maximizes the inference accu-
further define the weight of cut for
P any two disjoint node sets racy while minimizing both computation time (Rj /ccore n )
Pk and Pk′ as W (Pk , Pk′ ) = m∈Pk ,m′ ∈Pk′ Amm . Here,
′ and communication time (Qj /rn ). Basically, a student model
vol(Pk ) measures the volume of connections between Pk and with a relatively complex structure and more parameters is
the rest of the graph and W (Pk , Pk′ ) measures the volume more powerful to learn and mimic a large-sized knowledge
7

partition, while the selected models are forced to meet spe- Algorithm 1 Knowledge Assignment Algorithm
cific memory- and FLOP-constraints of edge devices. Let Input: Capacity similarity threshold dth ; Transmission fail-
C para (Pk ) indicate the size of the knowledge partition Pk . ure probability threshold pth
The accuracy performance can be further interpreted by the Output: Device groups G = {G1 , ..., Gk ..., GK }; Fil-
ratio between the size of any student model Sj , ∀j and the size ter partitions P = {P1 , ..., Pk ..., PK }; Assignment strategy
of any knowledge partition Pk , ∀k, denoted as Rj /C para (Pk ). αkj , ∀k, j
It reflects the efficacy of using appropriately sized student Initialization: G1 ← d1 , G ← G1 , K = 1, P ← ∅
models for learning specific knowledge, as employing a larger 1: Device grouping:
model for a smaller knowledge partition often yields better 2: Compute the centroid of every group Gk as Gk , ∀k =
representation performance. 1, ..., K
Notice that, for a fixed device-knowledge pair, we can find 3: for n = 2, 3, ..., N do
the most appropriate student model from the student set S by 4: Compute sim(Gk , dn ), ∀k = 1, ..., K
optimizing the accuracy-latency trade-off under the device’s 5: for k = 1 → K do
hardware constraints. Thus, the three-dimensional matching 6: if sim(Gk , dn ) ≤ dth and n:dn ∈Gk (1 − pn ) ≤
Q
problem can be reduced to a bi-partite matching between K pth then
device groups and K knowledge partitions. We first narrow 7: Gk ← Gk ∪ {di }, update Gk
the set of applicable student models for each device group 8: break
according to the memory constraints of the devices, where 9: if di is unassigned then
the narrowed set is denoted by Sk with Sk ⊂ S. Accordingly, 10: GK+1 ← {di }, update GK+1
the device group Gk may have one possible link with the k ′ -th 11: G ← G ∪ GK+1 , K ← K + 1
knowledge partition, where the edge weight is defined as:
12: Knowledge partition:
Rj 13: Construct F = (F, E, A) with Aij = Fji = ai aj |ai − aj |
w(Gk , Pk′ ) = max , (5)
sj ∈Sk C para (P ′ )( Rj + Qj ) as the weight of the edge eij
k ccore rn
n
14: Compute the degree matrix of F as Z
wherein the weight is determined by the maximum accuracy-
15: Compute the normallized Laplacian Lsym ←
to-delay performance achievable across all feasible student
Z −1/2 LZ −1/2 where L = Z − A
model structures for device group Gk .
16: Compute the smallest K eigenvectors u1 , ..., uK of Lsym
After constructing the weighted bipartite graph, the well-
17: Compute the indicator matrix H ← [u1 , ..., uK ], H ∈
known Kuhn-Munkres algorithm can be used to give the opti-
RM ×K
mal one-to-one pairing between device groups and knowledge
18: Cluster the rows in H into K clusters using K-Means and
partitions to maximize the sum weight. The detailed matching
generate the corresponding knowledge partition strategy
process is summarized in Algorithm 1.
P = {P1 , ..., PK }
4) Complexity analysis: Algorithm 1 consists of three main
19: Student assignment:
functional modules: device grouping, knowledge partition, and
20: Construct Sk for every device group Gk
student assignment, each contributing to the overall complex-
21: for k, k ′ = 1, ..., K do
ity of the algorithm. Device grouping employs follow-the-
22: Compute w(Gk , Pk′ ) through (5)
leader clustering procedures, typically scaling as O(N K).
The knowledge partition process involves the Normalized 23: Use KM algorithm to obtain student assignment matrix Λ
Cut algorithm with complexity O(M 2 ). To optimize the 24: for k = 1, ..., K do
student assignment strategy efficiently, we reduce the three- 25: Pick sj from Sk with the maximum value of
Cjf lop
dimensional matching among devices, filters, and students f lops and obtain the student model as-
C Q
j
into a bi-partite matching, optimally solvable using the Kuhn- C para (Pk′ )( ccore + rnj )
n
Munkres algorithm with complexity O(K 3 ). It’s worth not- signment strategy αjk for the device group Gk
ing that this complexity can potentially be further reduced 26: return G, P, S, α
by employing alternative low-complexity bi-partite matching
algorithms, albeit with varying extent of performance sacri-
fices. Consequently, the overall complexity of Algorithm 1 is
A. Evaluation Setup
O(max(N K, M 2 , K 3 )).
We use the CIFAR-10 and CIFAR-100 datasets in our
V. P ERFORMANCE E VALUATION experiments, which are two of the most widely used datasets
In this section, we evaluate the performance of the pro- for machine learning research. The CIFAR-10 dataset contains
posed cooperative inference mechanism RoCoIn via extensive 60,000 32x32 color images in 10 different classes, while
simulations. Particularly, we compare RoCoIn with several CIFAR-100 dataset consists of 100 classes with 20 super-
baselines in terms of model complexity, failure resiliency classes and makes the image classification task more complex
and inference latency in various system conditions. We also to learn than that of CIFAR-10. We use WideResNet-16-4 and
explore the impact of cumulative transmission failure proba- WideResNet-28-10 as the teacher networks, which are trained
bility threshold and average transmission success probability on both the CIFAR-10 dataset and CIFAR-100 dataset for
on inference latency. image classification applications. For student networks, there
8

TABLE II
are 2 types of backbone models available to select, i.e. Mo- R ESULTS OF I MAGE C LASSIFICATION ON CIFAR-10 DATASET.
bileNet and WideResNet. Mobilenet is a typical lightweight
Parameters FLOPs
deep neural network specially designed for edge devices, and Method Model Accuracy
(Largest) (Largest)
WideResNet is a variant of ResNet with decreased depth and Teacher WideResNet16-4 2.75M 507.84M 91.86%
increased width, which is far superior to their commonly used RoCoIn WideResNet22-1 0.28M 48.58M 91.62%
thin and very deep counterparts. We set S = {WideResNet-
RoCoIn-G WideResNet22-1 0.28M 48.58M 91.40%
22-1, WideResNet-16-1, MobileNet-v2} for CIFAR-10 and
HetNoNN WideResNet22-1 0.28M 48.58M 91.51%
S = { WideResNet-16-3, WideResNet-16-2, WideResNet-22-
NoNN WideResNet16-1 0.18M 34.25M 91.32%
1 } for CIFAR-100, where MobileNet-v2 and WideResNet-
16-3 have the minimum and maximum computational loads
and memory footprint, respectively. For all simulations, the TABLE III
number of edge devices that cooperatively perform DNN in- R ESULTS OF I MAGE C LASSIFICATION ON CIFAR-100 DATASET.
ference tasks is set to be 8. We assume that the computational Parameters FLOPs
Method Model Accuracy
capacity of the edge device ranges from 5M to 30M FLOPS, (Largest) (Largest)
and randomly set the transmitting rate rn in the range [0.5, 1] Teacher WideResNet28-10 36.5M 10.9G 74.66%
kbps for each device. Here, the student models are trained RoCoIn WideResNet16-3 1.56M 575.3M 72.42%
using the following loss function: RoCoIn-G WideResNet16-3 1.56M 575.3M 72.31%
HetNoNN WideResNet16-3 1.56M 575.3M 71.68%
(1 − α)H(y, PS ) + αH(PTτ , PSτ )
Loss(θS ) =| {z } NoNN WideResNet16-2 0.71M 260.1M 70.78%
KD loss
2
X vTF (p) vSF (p) (6)
β −
||vTF (p)|| ||vSF (p)|| 2 II and V, we summarize the specifications for the locally-
+ Pk ∈P ,
| {z } deployed DNN models with the different schemes for per-
AT loss forming a certain inference task. As evidence, the distributed
inference solutions result in student models with significantly
where the first term is the standard knowledge distillation
fewer parameters and lower computational load compared
loss integrating hard&soft-label cross-entropy loss, and the
to the original teacher model, allowing them to fit within
last term is the activation-transfer loss that reflects the error
the resource-constrained IoT devices. We notice that NoNN
between activations of the teacher’s filters that belong to the
may result in the smaller model sizes compared to RoCoIn,
given partition and activations of filters in the corresponding
which is primarily due to the following factors: i) NoNN
student.
mandates that all edge devices deploy student models with
In the runtime stage, we launch for the IoT edge cluster
identical structures, rendering it bottlenecked by low-end
an image classification task on one image from CIFAR-10
devices with low memory budgets. Thus, an extremely sparse
or CIFAR-100, and take the average inference latency and
student model will be selected for all the devices to cater
accuracy of 100 repeated trials as the results. Particularly, we
to those “stragglers”, even though the majority of devices
compare the performance of our RoCoIn against the following
can handle denser models with higher accuracy. ii) RoCoIn
baselines, i.e.:
ensures the resilience of the cooperative inference system
1) RoCoIn-G employs a similar cooperative inference work- by strategically introducing redundancy when distributing the
flow as RoCoIn, but adopts a simple heuristic method to knowledge to the edge devices. With the fixed number of
decide the knowledge assignment. devices, the amount of “knowledge” that is required to be
2) NoNN partitions the knowledge of model equally and learned by individual devices increases, which may require
generates student models with the same architecture via more dense student models to attain satisfactory accuracy.
knowledge distillation [10]. Despite saving the memory and FLOPS, NoNN considerably
3) HetNoNN improves NoNN by distributing teacher’s degrades the accuracy performance as it applies lightweight
knowledge based on devices’ memory and computing student models to all the devices subjecting to the tightest
capabilities, but overlooking device grouping for resisting capacity constraint. It shows that RoCoIn can maintain high
communication failures. classification accuracy to the greatest extent among all the
4) Teacher gives the performance of the original large distributed solutions while ensuring lightweight computation
model, which preserves the highest accuracy without and memory overhead. This is attributed to the appropri-
knowledge loss but cannot be locally deployed on edge ate knowledge assignment, which motivates those powerful
devices. devices to deploy parameter-rich student models that learn
complex and important knowledge partitions. Fig. 2 further
B. Evaluation Results reveals the training performance of the network of student
We first validate the superiority of distributed DNN infer- models involved by different knowledge assignment schemes,
ence in alleviating the IoT devices’ computational load and where the accuracy and loss are calculated by aggregating
evaluate the performance of the proposed RoCoIn scheme the students’ outputs and yielding the final predictions. The
through a comparison with the other baselines. In Table results validate that the test accuracy achieved by our RoCoIn
9

(a) CIFAR-10 (a) Inference latency

(b) CIFAR-100 (b) Inference accuracy

Fig. 2. Training performance. Fig. 3. RoCoIn’s performance under different system configurations.

is consistently higher than the baselines on both CIFAR-10 essence, an extremely small pth can make RoCoIn unable
and CIFAR-100 datasets. to find a feasible device grouping solution due to the strict
Basically, a high transmission requirement, i.e., a small target on the groups’ cumulative transmission reliability. Yet
transmission failure probability threshold pth , for device the value of pth can be well designated to strike a balance
grouping could probably increase the number of student between robustness and latency in practice.
replicas and thus results in highfailure resiliencee and low To further illustrate the impact of the pth on the robustness
resource utilization. Fig. 3(a) depicts the runtime inference of RoCoIn system, we fix the average success probability of
latency under different configurations. We observe that the devices at 0.8 in this simulation and examine the inference
inference latency is non-increasing with the growing average accuracy of RocoIn with the existence of local failures under
success probability of the devices under different probability different thresholds pth . Fig. 3(b) gives the comparison result
thresholds pth . One reason for this is that, with a fixed of different probability thresholds pth , and the computational
threshold pth , favorable communication conditions for devices loads and parameters of corresponding student networks are
can not only speed up the aggregation of local outputs but concluded in Fig. 4. Here, we use S-Total and S-Valid to rep-
also potentially divide the teacher’s knowledge into smaller resent all student models including replicas and vital student
partitions for more diverse distribution among devices, which models excluding replicas. A larger ratio of the valid value
reduces the need for redundant backup devices and ultimately to the total value means better resource utilization efficiency.
decreases computational latency. A similar effect can also As can be observed in Fig. 3(b) and Fig. 4, a smaller
be achieved by increasing the threshold pth , as verified in pth achieves better failure-resilience at the cost of lower
Fig. 3(a). As a consequence, the robustness of the RoCoIn resource utilization, which coincides with our basic design
system will be compromised as there aren’t sufficient replicas idea of RoCoIn. This result demonstrates that transmission
in place to compensate for the loss of any student model’s failure probability threshold pth plays an important role in
outputs. This conclusion can be drawn from Fig. 3(b). In the performance of RoCoIn.
10

(a) CIFAR-10
Fig. 4. Student model profile for different redundancy mode.

Fixing pth to 0.25 and average success probability to 0.7,


we then examine the robustness of the distributed inference
schemes against the cases that some devices are unavailable
due to power depletion or communication failure, as shown
in Fig. 5 and Fig. 6. Particularly, all schemes in Fig. 5
are configured with prior knowledge of device failure prob-
abilities, whereas Fig. 6 illustrates a more realistic scenario
in which all schemes operate with unknown device failure
probability distributions. Here, we emulate such local failures
by simply zeroing out the inference results of the devices
that were considered to experience failures when performing
global aggregation. Then we study the impact of eliminating
a different number of devices. For each setting, we randomly
select from the device set a certain number of devices to (b) CIFAR-100
delete and repeat for 30 trials, based on which we get the
Fig. 5. Inference accuracy with failed devices.(w. known failure probability)
averaged inference accuracy to compare the failure-resilience
performance of different distributed inference schemes. The TABLE IV
results in Fig. 5 show that the absence of several devices L EVEL OF H ETEROGENEITY.
degrades the cooperative inference accuracy with all the Heterogeneity level 0 1 2 3 4 5
schemes, wherein our RoCoIn exhibit the most favorable per- Range of FLOPS (M) 0 10 15 20 25 30
formance in maintaining desirable accuracy. It is shown that, Range of data rate (bps) 0 100 200 300 400 500
even if half of the devices fail to contribute to the inference
outcome, RoCoIn can keep the classification accuracy over
88% and 64% for CIFAR-10 and CIFAR-100. That is to
say, RoCoIn provides the failure-resilience guarantee when Fig. 7 further evaluates the impact of the heterogeneity of
some of the local outputs get lost due to timeout or crash. devices’ computational capacity and communication condition
In contrast, HetNoNN and NoNN are more sensitive to local on the inference latency. We define a “heterogeneity level” to
failures, resulting in significant accuracy drops as the number control the variation range of computing capability (FLOPS)
of failed devices increases. We attribute this to the fact that and transmission rate among the devices. Here, we set six
RoCoIn tends to strategically group the devices for student levels of heterogeneity, as described in Table IV, randomly
replication and hence exhibits higher resilience. In the case distribute the processing speed and transmitting rate of each
where the prior knowledge of device failure probabilities device within the range. Fig. 7 elucidates that the high level
is unknown, as shown in Fig. 6, RoCoIn exhibits a more of device heterogeneity has a negative impact on cooperative
significant performance gain compared to the baselines, due inference and impairs time efficiency. Among the schemes
to its advantage of proactive replica deployment. This further tested, NoNN brings out the worst performance, especially for
underscores that in practical wireless distributed inference the cases of high heterogeneity, since it uniformly partitions
systems with environmental randomness, appropriate knowl- and distributes the teacher’s knowledge to the devices, ignor-
edge assignment with on-demand replication can effectively ing their diverse capacity for handling workloads. In contrast,
mitigate the detrimental impact of local errors on overall our proposed RoCoIn scheme, which integrates heterogeneity-
inference performance. aware knowledge assignment, outperforms the other baselines
11

(a) CIFAR-10
Fig. 7. Inference latency under heterogeneous environments.

reduces memory and computational costs to a certain ex-


tent due to computation parallelization, even with the more
complex architecture of the Yolov5 model. While Yolov5-
BC requires keeping a relatively large student model at each
device, it can achieve higher inference accuracy compared
with Yolov5-BNC. Although Yolov5-BC necessitates maintain-
ing a relatively large student model at each device, it achieves
higher inference accuracy compared to Yolov5-BNC. It can be
envisioned that for complex DNN tasks with intricate model
architectures, RoCoIn enables developers to determine which
modules should be compressed and parallelized to strike a
balance between accuracy and costs.

TABLE V
(b) CIFAR-100
R ESULTS OF O BJECTIVE D ETECTION ON V IS D RONE 2019 DATASET.
Fig. 6. Inference accuracy with failed devices. (w.o. known failure proba- Method Model Parameters FLOPs mAP(0.5)
bility)
Teacher Yolov5-s 7.23M 16.6G 48%
RoCoIn 4.97M 11.2G
Yolov5-BC 41%
(2 devices) 4.97M 11.2G
in overcoming the straggler issue in paralleled inference RoCoIn 1.76M 2.07G
Yolov5-BNC 28.2%
systems, regardless of the heterogeneity level. RoCoIn al- (2 devices) 1.76M 2.07G
lows each device to run a well-selected student model that 1.76M 2.07G
RoCoIn
accommodates its computing and memory capacity, exhibiting Yolov5-BNC 0.98M 1.27G 28.5%
(3 devices)
greater adaptability than the others to cope with scenarios of 0.98M 1.27G
high heterogeneity across devices while maintaining a low
inference latency.
We also apply our RoCoIn scheme to an object detection
task to assess its universality, utilizing the Yolov5 model [23] VI. C ONCLUSION
and the VisDrone dataset [24]. This dataset comprises 288 In this work, we have presented a RoCoIn scheme to enable
video clips captured by various drone-mounted cameras, failure-resilient distributed inference across multiple resource-
with manually annotated bounding boxes of targets such as constrained edge devices for offering deep neural network-
pedestrians, cars, bicycles, and tricycles. To generate student based services. Considering the heterogeneous computing and
models, we distill and parallelize the compute-intensive layers communication capacity of devices, we have proposed to
of the Yolo backbone and Neck modules to improve model partition the knowledge of the original large model into
compression efficiency. Here, Yolov5-BC is a student archi- independent modules and assign the computation workload
tecture modified by the Yolov5 with a compressed backbone of every knowledge module to edge devices with compressed
module, while Yolov5-BNC compresses both backbone and student models, aiming to minimize the response latency of
neck modules. We evaluate the performance of RoCoIn with the distributed inference system. To make the cooperative
2 devices and 3 devices, respectively, and present the results inference system resilient to local failures, we use a clustering-
in Table V. We observe that our RoCoIn can consistently based method to group the devices for redundant deploying
12

the same student model and performing the corresponding [17] Q. Xu, Z. Chen, K. Wu, C. Wang, M. Wu, and X. Li, “Kdnet-rul:
computation workload. Extensive simulations have been con- A knowledge distillation framework to compress deep neural networks
for machine remaining useful life prediction,” IEEE Transactions on
ducted to evaluate RoCoIn’s performance. The results have Industrial Electronics, vol. 69, no. 2, pp. 2022–2032, 2021.
shown that the proposed mechanism exhibits great potential [18] Z. Zhao, K. M. Barijough, and A. Gerstlauer, “Deepthings: Distributed
in accommodating the heterogeneity of edge devices and adaptive deep learning inference on resource-constrained iot edge
clusters,” IEEE Transactions on Computer-Aided Design of Integrated
improving the system’s robustness against local crash or Circuits and Systems, vol. 37, no. 11, pp. 2348–2359, 2018.
timeout failures. [19] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars,
and L. Tang, “Neurosurgeon: Collaborative intelligence between the
cloud and mobile edge,” ACM SIGARCH Computer Architecture News,
vol. 45, no. 1, pp. 615–629, 2017.
R EFERENCES [20] C. Hu and B. Li, “Distributed inference with deep learning models
across heterogeneous edge devices,” in IEEE INFOCOM 2022-IEEE
[1] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and Conference on Computer Communications. IEEE, 2022, pp. 330–339.
acceleration for deep neural networks: The principles, progress, and [21] L. Zeng, X. Chen, Z. Zhou, L. Yang, and J. Zhang, “Coedge:
challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126– Cooperative dnn inference with adaptive workload partitioning over
136, 2018. heterogeneous edge devices,” IEEE/ACM Transactions on Networking,
[2] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A vol. 29, no. 2, pp. 595–608, 2020.
survey,” International Journal of Computer Vision, vol. 129, pp. 1789– [22] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques
1819, 2021. for embedding and clustering,” in Proc. of Advances in Neural Informa-
[3] C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic adaptive dnn surgery tion Processing Systems (NIPS), Vancouver, Canada, December 2001.
for inference acceleration on the edge,” in IEEE INFOCOM 2019-IEEE [23] G. Jocher, “Ultralytics yolov5,” 2020. [Online]. Available: https:
Conference on Computer Communications. IEEE, 2019, pp. 1423– //github.com/ultralytics/yolov5
1431. [24] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection
[4] F. Xue, W. Fang, W. Xu, Q. Wang, X. Ma, and Y. Ding, “Edgeld: Locally and tracking meet drones challenge,” IEEE Transactions on Pattern
distributed deep learning inference on edge device clusters,” in 2020 Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399,
IEEE 22nd International Conference on High Performance Computing 2021.
and Communications; IEEE 18th International Conference on Smart
City; IEEE 6th International Conference on Data Science and Systems
(HPCC/SmartCity/DSS). IEEE, 2020, pp. 613–619.
[5] T. Mohammed, C. Joe-Wong, R. Babbar, and M. Di Francesco,
“Distributed inference acceleration with adaptive dnn partitioning and
offloading,” in IEEE INFOCOM 2020-IEEE Conference on Computer
Communications. IEEE, 2020, pp. 854–863.
[6] L. Zhang, J. Wu, S. Mumtaz, J. Li, H. Gacanin, and J. J. Rodrigues,
“Edge-to-edge cooperative artificial intelligence in smart cities with on-
demand learning offloading,” in 2019 IEEE Global Communications
Conference (GLOBECOM). IEEE, 2019, pp. 1–6.
[7] R. Schlegel, S. Kumar, E. Rosnes, and A. G. i Amat, “Privacy-
preserving coded mobile edge computing for low-latency distributed
inference,” IEEE Journal on Selected Areas in Communications, vol. 40,
no. 3, pp. 788–799, 2022.
[8] M. Jouhari, A. K. Al-Ali, E. Baccour, A. Mohamed, A. Erbad,
M. Guizani, and M. Hamdi, “Distributed cnn inference on resource-
constrained uavs for surveillance systems: Design and optimization,”
IEEE Internet of Things Journal, vol. 9, no. 2, pp. 1227–1242, 2021.
[9] J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, “Modnn:
Local distributed mobile computing system for deep neural network,” in
Design, Automation & Test in Europe Conference & Exhibition (DATE),
2017. IEEE, 2017, pp. 1396–1401.
[10] K. Bhardwaj, C.-Y. Lin, A. Sartor, and R. Marculescu, “Memory-and
communication-aware model compression for distributed deep learning
inference on iot,” ACM Transactions on Embedded Computing Systems
(TECS), vol. 18, no. 5s, pp. 1–22, 2019.
[11] Z. Wang, T. Luo, R. S. M. Goh, and J. T. Zhou, “Edcompress: Energy-
aware model compression for dataflows,” IEEE Transactions on Neural
Networks and Learning Systems, 2022.
[12] L. Wang, L. Xiang, J. Xu, J. Chen, X. Zhao, D. Yao, X. Wang, and B. Li,
“Context-aware deep model compression for edge cloud computing,” in
2020 IEEE 40th International Conference on Distributed Computing
Systems (ICDCS). IEEE, 2020, pp. 787–797.
[13] L. Li, D. Shi, R. Hou, H. Li, M. Pan, and Z. Han, “To talk or
to work: Flexible communication compression for energy efficient
federated learning over heterogeneous mobile edge devices,” in Proc. of
IEEE Conference on Computer Communications (INFOCOM), Virtual
Conference, May 2021.
[14] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015.
[15] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a
neural network,” arXiv preprint arXiv:1503.02531, 2015.
[16] M. Bharadhwaj, G. Ramadurai, and B. Ravindran, “Detecting vehicles
on the edge: Knowledge distillation to improve performance in hetero-
geneous road traffic,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022, pp. 3192–3198.

You might also like