0% found this document useful (0 votes)
9 views12 pages

ALiteDistributedSemanticCommunicationSystemforInternet_ofThings

The document presents a lite distributed semantic communication system (L-DeepSC) designed for Internet of Things (IoT) devices, which addresses challenges in deep learning model training and data transmission over wireless channels. It emphasizes the need for efficient data transmission at the semantic level to improve performance and reduce latency and power consumption in IoT networks. The proposed system incorporates techniques such as model pruning and quantization to enhance model affordability and effectiveness in resource-constrained environments.

Uploaded by

vietcuongwork511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

ALiteDistributedSemanticCommunicationSystemforInternet_ofThings

The document presents a lite distributed semantic communication system (L-DeepSC) designed for Internet of Things (IoT) devices, which addresses challenges in deep learning model training and data transmission over wireless channels. It emphasizes the need for efficient data transmission at the semantic level to improve performance and reduce latency and power consumption in IoT networks. The proposed system incorporates techniques such as model pruning and quantization to enhance model affordability and effectiveness in resource-constrained environments.

Uploaded by

vietcuongwork511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 1

A Lite Distributed Semantic Communication


System for Internet of Things
Huiqiang Xie, Graduate Student Member, IEEE, and Zhijin Qin, Member, IEEE

Abstract—The rapid development of deep learning (DL) and to IoT devices. However, data transmitted over the air could
widespread applications of Internet-of-Things (IoT) have made be distorted by wireless channels, which may cause improper
the devices smarter than before, and enabled them to perform trained results, i.e., local optimum. Moreover, the large number
arXiv:2007.11095v3 [eess.SP] 25 Nov 2020

more intelligent tasks. However, it is challenging for any IoT


device to train and run DL models independently due to its of parameters in DL models leads to high latency when
limited computing capability. In this paper, we consider an IoT distributing the DL models with limited bandwidth. Therefore,
network where the cloud/edge platform performs the DL based transmitting accurate data to the cloud/edge platform over
semantic communication (DeepSC) model training and updating wireless channels for model training and reducing the number
while IoT devices perform data collection and transmission of parameters in DL models for lower latency and power
based on the trained model. To make it affordable for IoT
devices, we propose a lite distributed semantic communication consumption at the IoT devices are two crucial problems.
system based on DL, named L-DeepSC, for text transmission To address the first problem on accurate data transmission
with low complexity, where the data transmission from the in an IoT network, semantic communication system, which
IoT devices to the cloud/edge works at the semantic level to interprets information at the semantic level rather than bit
improve transmission efficiency. Particularly, by pruning the sequences [6], is promising. To make a decision based on
model redundancy and lowering the weight resolution, the L-
DeepSC becomes affordable for IoT devices and the bandwidth the received information, there are usually three steps: i) the
required for model weight transmission between IoT devices traditional communication receiver to recover the raw data [7];
and the cloud/edge is reduced significantly. Through analyzing ii) the feature extractor to obtain and interpret the meanings of
the effects of fading channels in forward-propagation and back- the raw data for the decision [8]; and iii) the effects network to
propagation during the training of L-DeepSC, we develop a chan- produce the desired effects according to the extracted features
nel state information (CSI) aided training processing to decrease
the effects of fading channels on transmission. Meanwhile, we [9], [10]. Correspondingly, the communication is categorized
tailor the semantic constellation to make it implementable on into three levels [11], including transmission level to guar-
capacity-limited IoT devices. Simulation demonstrates that the antee the transmission accuracy of bit sequence, semantic
proposed L-DeepSC achieves competitive performance compared level to guarantee the exchange of semantic information, and
with traditional methods, especially in the low signal-to-noise effectiveness level to measure the corresponding effects or
(SNR) region. In particular, while it can reach as large as 40x
compression ratio without performance degradation. caused actions of transmitted information, i.e., network re-
configuration, which is illustrated in Fig. 1. The traditional
Index Terms—Internet of Things, neural network compression,
communication system works at the transmission level shown
pruning, quantization, semantic communication.
in Fig. 1(a), which aims at transmitting and receiving symbol
accurately [12]. The followed feature extractor network and
I. I NTRODUCTION effect networks are designed separately based on applications.
With the widely deployed connected devices, Internet of However, designing these modules separately may lead to
Things (IoT) networks are providing more and more intelligent error propagation and prevent from reaching joint optimality.
services, i.e., smart home, intelligent manufacturing, and smart For example, the feature network is not able to correct errors
cities, by processing a massive amount of data generated by from the traditional receiver, which will affect the subsequent
those connected devices [1], [2]. Deep learning (DL) [3] has decision making in the effect network. Thus, through designing
demonstrated great potentials in processing various types of the traditional receiver and feature extractor network jointly
data, i.e., images and texts. The DL-enabled IoT devices are (the semantic level) or merging traditional receiver, feature
capable of exploiting and processing different types of data extractor network, and effects network together (the effec-
more effectively as well as handling more intelligent tasks than tiveness level), communication systems have the capability of
before. Although some IoT devices have certain capability to error correction at the semantic level and effectiveness level,
process simple DL models, the limited memory, computing, respectively. In this paper, we will focus on distributed seman-
and battery capability still prevent from wide applications of tic communications for IoT networks and leave effectiveness
DL [4]. Therefore, the burden of DL model updates is usually level communication to future research.
transferred to the cloud/edge platform [5]. Particularly, the DL With the recent advancements on DL, it is promising to
model is trained at the cloud/edge platform based on data represent a traditional transceiver or each individual signal
from the IoT devices, and then the trained model is distributed processing block by a deep neural network (DNN) [13].
Inspired by the autoencoder in DL techniques, an end-to-end
Huiqiang Xie and Zhijin Qin are with the School of Electronic Engineering
and Computer Science, Queen Mary University of London, London E1 4NS, (E2E) communication system has been proposed to merge the
UK (e-mail: [email protected], [email protected]). signal processing blocks in traditional communication [14].
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 2

Received Recovered Extracted Take model is trained first, then pruned by a given threshold, and
symbols Traditional symbols Feature Extractor Features Effect Action
Receiver Network Network is fine-tuned to recover performance in terms of image classi-
(a) Transmission level fication. This approach could reduce the connections without
Received Recovered Take losing accuracy. Liu et al. [25] proposed to prune the filters in
symbols
Semantic Receiver
Features Effect Action CNN by training the model with the L1 regularization so that
Network
(b) Semantic level
redundancy weights converge to zero directly without sacrific-
Received Take ing the performance. By analyzing the connection sensitivity
symbols
Effect Receiver
Action among neurons and layers, Li et al. [26] remove the insensitive
(c) Effectiveness level
layers, which further increases inference speed. By applying
these pruning approaches, DL models can be compressed
Fig. 1. Illustration of three communication levels at the receiver. by 13 to 20 times. Quantization aims to represent a weight
parameter with lower precision (fewer bits), which reduces the
required bitwidth of data flowing through the neural network
Missing channel gradients becomes the bottleneck of training model in order to shrink the model size for memory saving
E2E communication systems. There are several works for and simplify the operations for computing acceleration [27].
mitigating this problem [15]–[17]. Dörner et al. proposed a With vector quantization, Gong et al. [28] quantize the DL
two-phase training processing [15] by training the transceiver models. Similarly, Zhou et al. [29] investigated an iterative
with a stochastic channel model firstly, and fine-toning the quantization, which starts with a trained full-resolution model
receiver over real channels. Aoudia et al. estimated the channel and then quantizes only a portion of the model followed by
gradients by sampling from a relaxed distribution based on several epochs of re-training to recover the accuracy loss
stochastic reinforce learning policy [16], where the transmitter from quantization. A mix precision quantization by Li et
and receiver are trained separately. Ye et al. proposed genera- al. [30] quantizes weights while keeping the activations at
tive adversarial network (GAN) to approximate the unknown full-resolution. The training algorithm by Jacob et al. [31]
channel model [17] so that the channel gradients can be preserves the model accuracy after post-quantization. With the
estimated by the GAN. quantization, the weights can generally be compressed from
There have been some initial works related to deep semantic 32-bit to 8-bit without performance loss. Similarly, pruning
communications [18]–[22]. Bourtsoulatze et al. [18] proposed and quantizing can be also used in DL-enabled communication
joint source-channel coding for wireless image transmission systems. For example, Guo et al. [32] have shown that model
based on the convolutional neural network (CNN), where peak compression can accelerate the processing of channel state
signal-to-noise ratio (PSNR) is used to measure the accuracy information (CSI) acquisition and signal detection in massive
of image recovery at the receiver. Taking image classifica- multiple-input multiple-output (MIMO) systems without per-
tion tasks into consideration, Lee et al. [19] developed a formance degradation.
transmission-recognition communication system by merging Through applying network slimmer into our existing work
wireless image transmission with the effect network as DNNs, DeepSC, the aforementioned two challenges in IoT networks
i.e., image classification, which achieves higher image classi- can be effectively addressed. Although the above works vali-
fication accuracy than performing them separately. For texts, date the feasibility, we still face the following issues to make
Farsad et al. [21] designed joint source-channel coding for it affordable for IoT devices:
erasure channel by using a recurrent neural network (RNN)
and a fully-connected neural network (FCN), where the system • Question 1: How to design semantic communication
recovers the text directly rather than perform channel and systems over wireless fading channels?
source decoding separately. In order to understand texts better • Question 2: How to form the constellation to make it
and adapt to dynamic environments, Xie et al. [22] developed a affordable for capacity-limited IoT devices?
semantic communication system based on Transformer, named • Question 3: How to compress semantic models for fast-
DeepSC, which clarifies the concepts of semantic information model transmission and low-cost implementation on IoT
and semantic error at the sentence-level for the first time. devices?
In brief, compared with traditional approaches, the semantic In this paper, we design a distributed semantic communi-
communication systems are more robust to channel variation cation system for IoT networks. Especially, a lite DeepSC is
and are able to achieve better performance in terms of source proposed (L-DeepSC) to address the above questions. Differ-
recovery and image classification, especially in the low signal- ent from our previous work [22], this work solves the training
to-noise (SNR) regime. DeepSC problem over fading channels with imperfect CSI
To deal with the second problem in reducing the number of and considers different wireless channel models to show the
parameters, network slimmer has attracted extensive attention generalization of our method. Moreover, this work extends
to compress DL models without degrading performance since [22] to a more practical IoT scenario, where two key problems,
neural networks are usually over-sized [23]. Parameters prun- model updating, and broadcasting, are solved. This work also
ing and quantization are two main approaches for DL model addresses the issue of the finite constellation sizes for capacity-
compression. Parameter pruning is to remove the unnecessary constrained IoT devices while [22] assumes infinite constella-
connections between two neurons or important neurons. Han tions. The main contributions of this paper are summarized as
et al. [24] proposed an iterative pruning approach, where the follows.
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 3

• We design a distributed semantic communication network iterations by the received semantic features from IoT
under power and latency constraints, in which the receiver devices.
and feature extractor networks are jointly optimized by 2) Model Broadcasting: The cloud/edge platform broad-
overcoming fading channels. casts the trained DL model to each IoT device.
• By identifying the impacts of CSI on DL model training 3) Semantic Features Upload: The IoT devices constantly
over fading channels, we propose a CSI-aided semantic capture the text data, which are encoded by the proposed
communication system to speed up convergence, where semantic transmitter shown in Fig. 2(b). The extracted
the CSI is refined by a de-noise neural network. This semantic features are then transmitted to the cloud/edge
addresses the aforementioned Question 1. for model update and subsequent processing.
• To make data transmission and receiving affordable for
The aforementioned Questions 1-3 correspond to model ini-
capacity-constrained devices, we design a finite-bits con- tialization/update, semantic features uploading, and model
stellation to solve Question 2. broadcasting, respectively. Different from the traditional infor-
• Due to over-parametrization, we propose a model com-
mation transmission, semantic features can be not only used
pression algorithm, including network sparsification and for recovering the text at the semantic level accurately, but
quantization, to reduce the size of DL models by pruning also exploited as the input of other modules, i.e., emotion
the redundancy connections and quantizing the weights, classification, dialog system, and human-robot interaction, for
which addresses the aforementioned Question 3. training effect networks and perform various intelligent tasks
The rest of this paper is organized as follows. The dis- directly. The devices can also exchange semantic features,
tributed semantic communication system model is introduced which has been previously discussed in our work in [22]. We
and the corresponding problems are identified in Section II. focus on the communication between cloud/edge platforms
Section III presents the proposed L-DeepSC. Numerical results and local IoT devices to make the semantic communication
are used to verify the performance of the proposed L-DeepSC model affordable.
in Section IV. Finally, Section V concludes this paper.
N otation: Cn×m and Rn×m represent the sets of complex
and real matrices of size n × m, respectively. Bold-font A. Semantic Communication System
variables denote matrices or vectors. x ∼ CN (µ, σ 2 ) means
variable x follows the circularly-symmetric complex Gaussian The DeepSC shown in Fig. 2(b) can be divided into three
distribution with mean µ and covariance σ 2 . (·)T and (·)H parts mainly, transmitter network, physical channel, and re-
denote the transpose and Hermitian of a vector or a matrix, ceiver network, where the transmitter network includes se-
respectively. <{·} and ={·} refer to the real and the imaginary mantic encoder and channel encoder, and the receiver network
parts of a complex number. consists of semantic decoder and channel decoder.
We assume that the input of the DeepSC is a sentence,
s = [w1 , w2 , · · · , wN ], where wn represents the n-th word in
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
the sentence. The encoded symbol stream can be represented
Text is an important type of source data, which can be as
sensed from speaking and typing, environmental monitoring,
X = Cα (Sβ (s)) , (1)
etc. By training DL models with these text data at cloud/edge
platform, the DL models based IoT devices have the capability
where Sβ (·) is the semantic encoder network with parameter
to understand text data and generate semantic feature to be
set β and Cα (·) is the channel encoder with parameter set α.
transmitted to the center to perform intelligent tasks, i.e.,
If X is sent through a wireless fading channel, the signal
intelligent assistants, human emotion understanding, and envi-
received at the receiver can be given by
ronment humid and temperature adjustment based on human
preference [33].
Y = fH (X) = HX + N, (2)
As shown in Fig. 2(a), we focus on distributed seman-
tic communications for IoT networks. The considered sys- where H1 represents the channel gain between the transmitter
tem is consisted of various IoT networks with two layers, 
and the receiver, and N ∼ N 0, σn2 is additive white
the cloud/edge platform and distributed IoT devices. The Gaussian noise (AWGN).
cloud/edge platform is equipped with huge computation power
The decoded signal can be represented as
and big memory, which can be used to train the DL model by
the received semantic features. The semantic communication −1
ŝ = Sχ Cδ−1 (Y) ,

(3)
enabled IoT devices to perform intelligent tasks by understand-
ing sensed texts, which are with limited memory and power where ŝ is the recovered sentence, Cδ−1 (·) is the channel
but expected long lifetime, i.e., up to 10 years. Particularly, decoder with parameter set δ and Sχ −1
(·) is the semantic
our considered distributed semantic communication system decoder network with parameter set χ, the superscript -1
consists of the following three steps: represents the decoding operation.
1) Model Initialization/Update: The cloud/edge platform
first trains the semantic communication model by initial 1 Here, we have omitted discussion of complex channels. If the complex
dataset. The trained model is updated in the subsequent channel is H̄, then H̄ = [< (H) , −= (H) ; = (H) , < (H)].
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 4

Cloud/Edge Computing
Model Initialization/Update

Semantic
Channel
Source

Physical
Channel
Semantic s X Y ŝ
features

Semantic Channel Channel Semantic Recovered


Encoder Encoder Decoder Decoder Feature/Source

Transmitter Receiver

Devices
(a) Proposed distributed semantic (b) Semantic communication system
communication network.

Fig. 2. The framework of semantic communications for IoT networks.

The whole
Distributedsemantic communication
Semantic communication Network can be trained by the where η is the learning rate and ∂L
∂β is the gradient, computed
CE

cross-entropy (CE) loss function, which is given by by


X ∂LCE ∂LCE ∂ŝ ∂Y ∂X
LCE (s, ŝ) = (q (wi ) − 1) log (1 − p (wi )) =
∂β ∂ŝ ∂Y ∂X ∂β
i=1
(4) (7)
X ∂LCE ∂ŝ ∂X
− q (wi ) log (p (wi )), = H .
∂ŝ ∂Y ∂β
i=1
where q(wi ) is the real probability that the i-th word, wi , In (7), H will introduce stochasticity during weight updat-
appears in source sentence s, and p(wi ) is the predicted ing. For an AWGN channel, H = I will not affect it. However,
probability that the i-th word, wi , appears in ŝ. CE can for fading channels, H is random, which may lead to that β
Semantic Channel
measure the difference between the two distributions. Through fails
Trainingto converge to the global optimum while the forward-
Training
Dataset Dataset
minimizing the CE loss, the network can learn the word s propagation inChannel (5) is unable
X to recover
ŝ semantic information
Channel ŝ

distribution, q(wi ), in the source sentence, s. Consequently, accurately based on thePhysical


Encoder localChannel
optimum. Thus, it is critical to
Decoder
Semantic Semantic
the syntax, phrase, and the meaning of words in the context design
Encoder the training process to mitigate the effects of H, which
Decoder
can be learnt by DNNs. also makes the DeepSC applicable for fading channels.
Transmitter constellation design: Generally,
2) Feasible Receiverthe DL mod-

B. Problem Description els run on floating-point operations (FLOPs), which means


Instead of bits, the input sentence, s, in the DeepSC, will that the input, output, and weights are in a large range of
cause that the learned constellation is no longer limited to a ±1.40129 × 10−45 to ±3.40282 × 10+38 [34]. Although
few points anymore. After transmitting X, the fading channel DeepSC can learn the constellations from the source infor-
increases the difficulty of model training compared with the mation and channel statistics, the learned constellation points,
AWGN channel. Meanwhile, the huge number of parameters, such as cluster constellation [35], are disordered in the range
α, β, χ, δ, indicates the complexity of the whole model. of ±1.40129 × 10−45 to ±3.40282 × 10+38 , which brings
These factors limit DeepSC for IoT networks and incur the additional burden to the hardware of IoT devices, for instance,
aforementioned Questions 1-3, including feasible constellation the high-resolution phase-shift and amplitude-shift pose high
design, training for fading channel, and model compression. requirements on the circuit. Therefore, it is desired to form
1) Training of fading channel: In DL, the training process feasible constellations with only finite points for the current
can be divided forward-propagation to predict the target and radio frequency (RF) systems. In other words, we have to
back-propagation to converge the neural network, as stated in design a smaller constellation for the DeepSC.
the following. 3) Model communication: The more parameters DeepSC
Forward-propagation: From the received signal to recover has, the stronger the signal processing ability, which however
semantic information, the estimation sentence is given by increases computational complexity and model size and results
in high power consumption. In the distributed DeepSC system,
−1
Cδ−1 (Y) ,

ŝ = Sχ (5)
the trained DeepSC model deployed at local IoT devices is
Back-propagation: Taking semantic encoder as an example, frequently updated to perform intelligent tasks better. The
the parameter vector at the tth iteration are is updated by IoT application limits the bandwidth and cost of distributing
∂LCE the DeepSC model. Furthermore, to extend the IoT network
β(t) = β(t − 1) − η , (6) lifetime, especially the battery lifetime, most local devices
∂β
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 5

are with finite storage and computation capability, which Back-propagation: It updates parameter WT by its gradi-
limits the size of DeepSC. Therefore, compressing DeepSC ent
not only reduces the latency of model transmission between ∂LCE (ŝ, s) T
the cloud/edge platform and local devices but also makes it = (FR WR HFT ) ∇ŝ LCE (ŝ, s) sT , (10)
∂WT
possible to run the DL model on local devices.
where FR ∼ diag (σ 0 (WR y + bR )) and FT ∼
0
diag (σ (WT s + bT )). In (10), the H is untrainable and
III. P ROPOSED L ITE D ISTRIBUTED S EMANTIC random, therefore it will cause perturbation for the weight
C OMMUNICATION S YSTEM updating, i.e., the weight updating with higher variance. If
the transmitter consists of very deep neural networks, the
To address the identified challenges in Section II, we pro-
perturbation will affect the back-propagation of the whole
pose a lite distributed semantic communication system, named
transmitter network, where the perturbation will propagate to
L-DeepSC. We analyze the effects of CSI in the model training
the whole transmitter network by the chain rule.
under fading channels and design a CSI-aided training process
Forward-propagation: With the received signal WR , the
to overcome the fading effects, which successfully deals with
source messages can be recovered by
Question 1. Besides, the weight pruning and quantization are
investigated to address Question 2. Finally, our finite-points Ŝ = σ (WR Y + bR )
(11)
constellation design solves Question 3, effectively. = σ (WR HX + WR N + bR ) .
In (11), WR has to learn how to deal with the channel
A. Deep De-noise Network based CSI Refinement and Can- effects and decode at the same time, which increases training
cellation burden and reduces network expression capability. Meanwhile,
the errors caused by channel effects also propagate to the
The most common method to reduce the effects of fading
subsequent layers for the L-DeepSC receiver with multiple
channels in wireless communication is to use the known
layers.
channel properties of a communication link, CSI. Similarly,
The impacts of channel can be mitigated by exploiting CSI
CSI can also reduce the channel impacts in training L-DeepSC.
at the cloud/edge. If channel H is known, then the received
Next, we will first analyze the role of CSI in L-DeepSC
symbol can be processed by
training. −1 H
In order to simplify the analysis, we assume the transmitter Ỹ = HH H H Y = X + Ñ, (12)
and the receiver are with one-layer dense with sigmoid activa-  −1 H
tion, where transmitter has an additional untrainable embed- where Ñ = HH H H N. In (12), the channel effect
ding layer, and receiver also has an untrainable de-embedding is transferred from multiplicative noise to additive noise, Ñ,
layer. The IoT devices are with the trained transmitter model which provides the possibility of stable back-propagation as
and the cloud/edge platform works as the receiver, as shown well as the stronger capability of network representation.
in the system model Fig. 2. The IoT devices and cloud/edge With (12), back-propagation and forward-propagation can be
platform are equipped with the same number of antennas. performed by setting H = I in (10) and (11), respectively.
After the embedding layer, the source message, s, is embedded Therefore, the channel effects can be completely removed.
into, S. Then, encode S into The above discussion shows the importance of CSI in
model training. However, CSI can be only estimated generally,
X = σ (WT S + bT ) , (8) i.e., least-squared (LS), linear minimum mean-squared error
(LMMSE), or minimum mean-squared error (MMSE) estima-
where X2 is the semantic features transmitted from the IoT tors. Due to exploiting prior channel statistics, LMMSE and
devices to the cloud/edge platform. WT and bT are the train- MMSE estimators usually perform better than the LS estima-
able parameters to extract the features from source message tors. Thus, LMMSE and MMSE estimators are sensitive to the
s, and σ(·) is the sigmoid activation function. accuracy of channel statistic while the LS estimator requires
The received symbol at the cloud/edge platform is affected no prior channel information. Meanwhile, DL techniques can
by channel H and AWGN as in (2). From the received symbol, also be used to improve the performance of channel estimation
the cloud/edge platform recovers the embedding matrix by [36], [37].
For simplicity, we initially use the LS estimator. Then, we
Ŝ = σ (WR Y + bR ) , (9)
adopt the deep de-noise network to increase the resolution of
where the estimated source message, ŝ, can be obtained after the LS estimator as in [38] shown in Fig. 3. Particularly, the
de-embedding layer. WR and bR can learn to recover s. The rough CSI estimated by the LS estimator with few pilots first
L-DeepSC can be optimized by the loss function in (4). The denoted by
fading channels not only contaminates the gradients in the Hrough = Yp XH H
p = H + NXp , (13)
back-propagation, but also restricts the representation power
in the forward-propagation. where Yp = HXp +N, Yp is the received pilot signal, Xp is
the transmitted pilot signals. Then, (13) can be represented as
2 Here, we have avoided discussion of complex signal. If the complex signal
is X̄, then X̄ = [< (X) , = (X)] . Hrough = H + N,
b (14)
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 6

Ydata Channel Ydata ŝ 0.88 0.19 0.35 -2.34 0.88 0 0 -2.34 1 0 0 -1


Receiver
Cancellation
-1.08 0.55 0.93 -0.97 -1.08 0.55 0.93 -0.97 -1 1 1 -1
Hrefine
0.53 0.41 0.32 -0.49 0.53 0 0 0 1 0 0 0
Ypilot LS Hrough ADNet
Estimator -0.79 -0.84 -1.27 0.24 -0.79 -0.84 -1.27 0 -1 -1 -1 0

(a) (b) (c)

Fig. 3. The proposed CSI refinement and cancellation based on de-noise


neural networks.
Fig. 4. Flowchart of the proposed joint pruning-quantization, (a) the original
weights matrix; (b) the weights after pruning, where the example pruning
function is x = 0 for x < 0.5; (c) the weights after quantization, where the
where N b = NXH . example quantization function is x = sign(x).
p
From (14), Hrough consists of exact H and the noise,
N.
b De-noise neural networks are used to recover H more Algorithm 1 Network Sparsification.
accurately from Hrough by considering H and Hrough as the Input: The pre-trained weights W, the sparse ratio γ.
original picture and noisy picture, respectively. Here, we ex- Output: The pruned weights Wpruned .
ploit attention-guided denoising convolutional neural network 1: Count the the total number of connections, M .
(ADNet) [39] to refine CSI. ADNet includes four blocks, 2: Sort the whole connections from small to large, s.
a sparse block, a feature enhancement block, an attention 3: Obtain the threshold by (17) with M and γ, wthre .
block, and a reconstruction block. After the input image, the 4: for n = 1 to N do
(n)
sparse block is used to extract useful features from the given 5: Prune the connections by (16), Wpruned .
noisy image. Attention block can extract the noise information 6: end for
hidden in the complex background and is integrated into the 7: Fine-tune the pruned model by loss function (4).
feature enhancement block to reduce the complexity. Finally,
the de-noised image is reconstructed by the reconstruction
block. 1) Network Sparsification: A proper criterion to disable
The refined CSI, Hrefine denoted by neural connections is important. Obviously, the connections
 with small weight values can be pruned. Therefore, the pruning
Hrefine = ADNet Hrough . (15) issue here turns into setting a proper pruning threshold.
As shown in Fig. 2(b), the DeepSC consists with neural
In (15), the ADNet(·) is trained the the loss function, networks, α, β, χ, δ, where each includes multiple layers.
2
L (Hrefine , H) = 12 kHrefine − HkF . Since the performance As the DeepSC mainly consists of dense layers, we choose
of the LS estimator is similar to that of LMMSE and MMSE unstructured pruning method in this paper, where the com-
estimators in the high SNR region, we pay more attention putation workload of sparse model can be reduced by the
to the low SNR region when training ADNet. With proper sparsity algorithm and FPGA design [40], [41], i.e., sparse
training, ADNet can mitigate the impacts from noise but matrix-vector multiplication. Assume there are total N layers
without any prior channel information, especially in the low (n)
in the pre-trained DeepSC model with Wi,j being the weight
SNR region. Such a design provides a good solution for of connection between the ith neuron of the (n + 1)th layer
Question 1. and jth neuron of nth layer. With a pruning threshold wthre ,
the model weights can be pruned by
(
(n) (n)
B. Model Compression (n) Wi,j , if Wi,j > wthre ,
Wi,j = (16)
Through applying CSI into model training, the cloud/edge 0, otherwise,
platform can extract the semantic features from L-DeepSC. We determine the pruning threshold by
However, the size and complexity of the trained L-DeepSC
model are still very large, which causes high latency for the wthre = sM ×γ , (17)
cloud/edge platform to broadcast updated L-DeepSC. Note that  
both weights pruning and quantization can reduce the model where s = sort W(1) , W(2) , · · · , W(N ) , is the sorted
size and complexity, therefore, we compress the DeepSC weights value from least important one to the most important
model by a joint pruning-quantization scheme to make it one, M is the total number of connections, and γ, the sparsity
affordable for IoT devices. As shown in Fig. 4, the original ratio between 0 and 1, indicates the proportion of zero values
weights are first pruned at a high-precision level by identifying in weights. The weight pruning can be divided into two steps,
and removing the unnecessary weights, which makes the weight pruning to disable some neuron connections and fine-
network sparse. Quantization is then used to convert the trained tine to recover the accuracy, as shown in Algorithm 1.
L-DeepSC model into a low-precision level. The proposed 2) Network Quantization: The quantization includes weight
(n)
network sparsification and quantization can address Question quantization and activation quantization. The weights, Wi,j ,
3 and are introduced in detail in the following. from a trained model, can be converted from 32-bit float point
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 7

Algorithm 2 Network Quantization. (n) 


xmax (1) = max X(n) (1) , X(n) (t) is the output of activa-
Input: The pre-trained weights W, the quantization level m, tions at nth layer with tth batch data, c ∈ [0, 1) represents
the correlation coefficient c, and the calibration data K. (n) (n)
the correlation between the current xmin /xmax with its past
Output: The pre-trained weights Wquantized and the range of value. The effects from outliers can be mitigated by the past
activation xmin and xmax . (n) (n)
normal values. After t+1 epochs, the xmin and xmax are fixed
1: Phase 1: Weights Quantization. (n) (n)
based on xmin (t + 1) and xmax (t + 1). Then, the output of the
2: for n = 1 to N do  activations can be quantized by
3: Compute the range of weights, max W(n) and     
(n)
X̃(n) = clamp round qx X(n) − xmin ; −M, M ,

min W(n) .
4: Quantize the weights by (18), W̃(n) . (22)
(n) (n)
5: end for where qa = (2m − 1)/(xmax − xmin ) is the scale-factor and
clamp (·) is used to eliminate the quantized outliers, which is
6: Phase 2: Activations Quantization.
given by
7: for t = 1 to K do
     
8: for n = 1 to N do clamp X(n) ; −T, T = min max X(n) , −T , T , (23)
9: Update the dynamic range of activation by (20) and
(n) (n)
(21), xmin (t) and xmax (t). where T = 2m − 1, which is the border of the m-bits integer
10: end for format.
11: end for As shown in Algorithm 2, the network quantization includes
12: Quantize the activations by (22). two phases: i) weight quantization; ii) activations quantization.
13: Fine-tune the quantized model by STE and loss function In phase 1, the weights of each layer can be quantized by
(4). (18) directly. In phase 2, the calibration process is applied
by running a few calibration batches in order to get the
(n) (n)
activations statistics. In each batch, xmin (t) and xmax (t) will
to m-bits integer through applying the quantization function be updated based on the activations statistics from the previous
by batches. These quantization processes might lead to slight
(n)
 
(n)
  accuracy degradation. The quantization-aware training (QAT)
W̃i,j = round qw Wi,j − min W(n) , (18) is required to re-train for minimizing the loss of accuracy.
where qw is the scale-factor to map the dynamic range of float Since the rounding operation is not derivable, a straight-
points to an m-bits integer, which is given by through estimator (STE) is used to estimate the gradient of
quantized weights in the back-propagation [42].
2m − 1
qw =  . (19)
max W(n) − min W(n)
C. Constellation Design with Fewer Quantization Bits
For activation quantization, the results of matrix multiplica-
tion are stored in accumulators. Due to the limited dynamic The cloud/edge platform can further reduce the size of L-
range of integer formats, it is possible that the accumula- DeepSC with model compression after the model is trained,
tor overflows quickly if the bit-width for the weights and which not only reduces the latency significantly for broad-
activation is the same. Therefore, accumulators are usually casting the updated DeepSC to IoT devices, but also changes
implemented with higher bit-widths, for example, INT32 += DeepSC to L-DeepSC with low complexity. However, high-
INT8× INT8. Besides, the range of activations is dynamic resolution waveform poses high requirements cost-sensitive
and dependent on the input data. Therefore, the output of IoT devices. In other words, the cost-sensitive IoT devices
activations has to re-quantize into m-bits integer for the subse- are usually capacity-limited and cannot afford a large number
quent calculation. Unlike weights that are constant, the output of constellation points which are with phase and amplitude
of activations usually includes elements that are statistical close to each other.
outliers, which expand the actual dynamic range. For example, Different from bits, the source message, s, is more com-
even if 99% of the data is distributed between -100 and plicated and the learned constellation will not be limited to
100, an outlier, 10,000, will extend the dynamic range into a few points, which brings additional burden on hardware.
from -100 to 10,000, which significantly reduces the mapping Besides, the DL models generally run in FP32, which also
resolution. In order to reduce the influence from the outliers, expands the range of constellation. Thus, we aim to reduce the
an exponential moving average (EMA) is used by size of learned constellation without degrading performance,
  where the output of X is the learned constellation while X
(n) (n)
xmin (t + 1) = (1 − c) xmin (t) + c min X(n) (t) , (20) is also the output of activation of last layer at the local IoT
devices. Inspired from the network quantization, we convert
and the learned high-resolution constellation into low-resolution
 
x(n) (n)
max (t + 1) = (1 − c) xmax (t) + c max X
(n)
(t) , (21) one with few points. Thus, we use two-stage quantization to
narrow the range of constellations, which is represented by
(n) (n)
where xmin (t + 1) and xmax (t + 1) are used for the range Xquantize
(n) 
Xdequantize = + xmin , (24)
of activation quantization, and xmin (1) = min X(n) (1) , qx
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 8

128−0.5 min step−0.5 , step × 4000−1.5 , where step is the



where Xquantize is the quantized X from (22), qx is the scale-
factor and xmin is the obtained by (20) and Xdequantize is the counting number in the back-propagation. This corresponds
dequantized X. to increasing the learning rate linearly for the first 4000
First, we quantize the X into m-bits integer so that the training steps and decreasing it thereafter proportionally to
range of X is narrowed to the size of 2m . For example, when the inverse square root of the step number. We also adopt
m = 8, the size of the constellation is reduced to 256. Then, the L2 regularization and the Adam optimizer with β1 = 0.9,
Xquantize is dequantize to restore X. Such an Xdequantize has a β2 = 0.98, and ε = 10−8 .
similar distribution as X but is with fewer constellation points, The adopted dataset is the proceedings of the European Par-
which is helpful to lower the hardware cost at transmitter and liament [43], which consists of around 2.0 million sentences
preserves the performance as much as possible and therefore and 53 million words. The dataset is pre-processed into lengths
provides the solution for Question 2. of sentences with 4 to 30 words and is split into training data
In summary, by exploiting the solutions for the afore- and testing data with 0.1 ratio. The benchmark approach is
mentioned Questions, we develop a lite distributed semantic based on separate source coding and channel coding technolo-
communication system, named L-DeepSC, which could reduce gies, which adopt variable-length coding (Huffman coding)
the latency for model exchange under limited bandwidth, run for source coding, where we build the Huffman codes by
the models at IoT devices with low power consumption, and counting the frequency of letters and punctuation so that the
deal with the distortion from fading channels when upload- look-up table is not large. Turbo coding and Reed-Solomon
ing semantic features. As a result, the proposed L-DeepSC (RS) coding [44] for channel coding, where turbo decoding
becomes a good candidate for the IoT networks. method is log-MAP algorithm with 5 iterations, and quadra-
ture amplitude modulation (QAM). The bilingual evaluation
IV. N UMERICAL R ESULTS understudy (BLEU) score is used to measure the performance
In this section, we compare the proposed L-DeepSC with [45].
traditional methods under different fading channels, including
Rayleigh and Rician fading channels. The weights pruning A. Constellation Design
and quantization are also verified under fading channels. For Fig. 5 compares the full-resolution constellation and the
the Rayleigh fading channel, the channel coefficient follows 4-bits constellation. The full-resolution constellation points
CN (0, 1); pfor the Rician fading channel.
p it follows CN (µ, σ 2 ) in Fig. 5(a) contain more information due to the higher
with µ = k/(k + 1) and σ = 1/(k + 1). where k is the resolution, but require complicated hardware, which is almost
Rician coefficient and we use k = 2 in our simulation. impossible to design. Through mapping the full-resolution
The transmitter of L-DeepSC is the same as that of DeepSC constellation into a finite space, the 4-bits constellation points
in [22]. The parameters for the decoding network at the in Fig. 5(b) become simplified, which makes it possible
receiver are shown in Table I for the fading channels, where to implement in the existing RF system. Note that the 4-
the sum of the outputs of Dense 3 and Dense 5 is the bits constellation keeps a similar distribution with the full-
input of the LayerNorm layer. The Transformer encoder and resolution constellation. For example, there exist certain blank
decoder are semantic encoder and decoder [22], respectively, regions at the edge of the constellation in Fig. 5(a), while the
which enables the systems to understand text and extract 4-bits constellation shows a similar trend in Fig. 5(b). Such
semantic information. We also prune the whole network since similar distribution prevents sharp performance degradation
we consider the communications between cloud/edge platform when the resolution of constellation decreases significantly.
and each IoT devices as well as the communications between Fig. 6 shows the BLEU scores versus SNR for different con-
IoT devices. stellation sizes under AWGN, including 4-bits constellation, 8-
bits constellation, and full-resolution constellation. All of them
TABLE I could achieve very similar performance when SNR > 9 dB,
T HE SETTING OF L-D EEP SC TRANSCEIVER .
which demonstrates the constellation design is effective and
Layer Name Units Activation cause no significant performance degradation. Full resolution
Embedding layer 128 None
4×Transformer Encoder 128 (8 heads) None
and 8-bits constellations perform slightly better than 4-bits
Transmitter Dense 1 256 Relu constellation when SNR is low. This is because some weights
Dense 2 16 None information used for denoising is lost when the resolution of
Dense 3 128 Relu the constellation is small.
Dense 4 512 Relu
Dense 5 128 None
Receiver
LayerNorm None None B. Performance over Fading Channels
4×Transformer Decoder 128 (8 heads) None
Prediction Layer Dictionary Size Softmax Fig. 7 compares the channel estimation MSEs of LS,
MMSE, and ADNet-aided LS estimator versus SNR under the
The output features are with 8 symbols per word. We Rayleigh fading channels. Note that MMSE equals to LMMSE
initialize the learnable embedding matrix from N (0, 1) with for the AWGN channels. The MMSE and LS estimators have
shape (vocab size, embedding-dim). The embedding dim is similar accuracy in the high SNR region, thus the range of
set to 128 in our program and the vocab size depends on training SNRs for the ADNet is set from 0 dB to 10 dB to
the training dataset. The batch size is 64, learning rate is improve the performance of the LS estimator in the low SNR
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 9

(a) Full-resolution Constellation (b) 4-bits constellation

Fig. 5. The comparison between the full-resolution constellation and 4-bits constellation.

1 100

0.8

10-1
0.6
BLEU Score

MSE

0.4
10-2

0.2

Original Constellation LS estimator


8-bits Constellation MMSE estimator
4-bits Constellation LS estimator with ADNet
0 10-3
0 3 6 9 12 15 18 0 3 6 9 12 15 18
SNR (dB) SNR (dB)

Fig. 6. The BLEU scores of different constellation sizes versus SNR under Fig. 7. The MSE for MMSE estimator, LS estimator, and the proposed ADNet
AWGN. based LS estimator.

region. As a result, the MSE of ADNet based LS estimator block-length, which can correct long bit sequences, however,
is significantly lower than that of LS and MMSE estimators turbo coding is convolution coding with short block-length,
when SNR is low. With increasing SNR, the MSE of ADNet where the coded bits only are related with previous m bits,
based LS estimator approaches to that of the LS and MMSE i.e., m = 3, so that the adjacent words result in higher error
estimators. Therefore, the ADNet based LS estimator can be rate. The performance of L-DeepSC is very close to that of
substituted by the LS estimator to reduce the complexity in DeepSC in terms of BLEU score, but requires much less
the high SNR region. bandwidth for communications. The system trained without
Fig. 8 and Fig. 9 illustrate the relationship between BLEU CSI performs worse than those trained with CSI, especially
score and SNR with the 4-bits constellation over the Rician under the Rayleigh fading channels, which also confirms the
and the Rayleigh fading channels, respectively, where DeepSC analysis of (10) and (11). Without CSI, the performance differ-
is trained with perfect CSI and the L-DeepSC is trained ence between the Rayleigh channels and the Rician channels is
with perfect CSI, rough CSI by (14), refined CSI by (15) caused by the line-of-sight (LOS), which can help the systems
and without CSI, respectively. The traditional approaches are recognize the semantic information during training. Besides,
Huffman coding with (5,7) RS and with turbo coding (rate with the aid of CSI, the effects of the fading channels are
1/2), both with 64-QAM. We observe that all DL-enabled mitigated significantly, as we have analyzed before. When
approaches are more competitive under the fading channels. SNR is low, the system with perfect CSI or refined CSI
RS coding is better than turbo coding in terms of BLEU score. outperforms that with rough CSI. As SNR increases, all these
This is because RS coding is linear block coding with long systems, L-DeepSC with perfect CSI, refined CSI, and rough
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 10

1 1

0.8 0.8

BLEU Score
0.6 0.6
BLEU Score

0.4 0.4
DeepSC with perfect CSI
L-DeepSC with perfect CSI
L-DeepSC with refined CSI
L-DeepSC with rough CSI
0.2 L-DeepSC without CSI 0.2
SNR = 18dB
huffman + RS with perfect CSI
SNR = 12dB
huffman + Turbo with perfect CSI
SNR = 6dB
SNR = 0dB
0 0
0 3 6 9 12 15 18 0 0.1 0.3 0.5 0.7 0.9 0.99
SNR (dB)

Fig. 8. The BLEU scores versus SNR under Ricain fading channels, with Fig. 10. The BLEU scores of different SNRs versus sparsity ratio, γ, under
perfect CSI, rough CSI, refined CSI, and no CSI. Rician fadings channel with the refined CSI.

1 1

0.8 0.8

0.6
BLEU Score

BLEU Score

0.6

0.4 DeepSC with perfect CSI


L-DeepSC with perfect CSI
0.4
L-DeepSC with refined CSI
L-DeepSC with rough CSI
L-DeepSC without CSI
0.2 huffman + RS with perfect CSI 0.2
huffman + Turbo with perfect CSI SNR = 18dB
SNR = 12dB
SNR = 6dB
0 SNR = 0dB
0
0 3 6 9 12 15 18 2 4 8 12 16 20
SNR (dB)

Fig. 9. The BLEU scores versus SNR under Rayleigh fading channels, with Fig. 11. The BLEU scores of different SNRs versus quantization level, m,
perfect CSI, rough CSI, refined CSI, and no CSI. under Rician fading channels with the refined CSI.

CSI, converge to similar performance gradually. for the high SNR cases, the model can be pruned directly
with only slight performance degradation. For the low SNR
C. Model Compression region, it is possible to prune 99% weights without significant
In this experiment, we investigate the performance of performance degradation when the system is sensitive to power
network slimmer, including network sparsification, network consumption.
quantization, and the combination of both. The pre-trained Fig. 11 demonstrates the relationship between the BLEU
model used for pruning and quantization is trained with 4- score and the quantization bit number, m, under the Rician
bits constellation under the Rician fading channels. fading channels, where m is defined in (19), and the system
Fig. 10 shows the influences of network sparsity ratio, γ, is quantized with QAT when the m is smaller than 2. The
on the BLEU scores with different SNRs under the Rician performance with m = 8 to m = 20 is similar, which indicates
fading channels, where the system is pruned directly when that the effectiveness of low-resolution neural networks. If
γ increases from 0 to 0.9 and is pruned with fine-tuning the system is more sensitive to power consumption and can
when γ increases to 0.99 continually. The proposed L-DeepSC tolerant to certain performance degradation, the resolution of
achieves almost the same BLEU scores when the γ increases the neural networks can be further reduced to 4-bits level.
from 0 to 0.9, which shows that there exists a mass of However, the BLEU score decreases dramatically from m = 4
weights redundancy in the trained DeepSC model. When the to m = 2 over the whole SRN range since most of the key
γ increases to 0.99, the BLEU scores still drop slightly due to information is removed in the low-resolution neural network.
the processing of fine-tuning, where the performance loss at Table II compares the BLEU scores and compression ratios
0 dB and 6 dB is larger than that at 12 dB and 18 dB. Thus, under different combinations of weights pruning and weights
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 11

TABLE II
T HE BLEU SCORE AND COMPRESSION RATIO , ψ, C OMPARISONS VERSUS DIFFERENT SPARSITY RATIO , γ, AND QUANTIZATION LEVEL , m, IN SNR =
12dB.

BLEU score BLEU score BLEU score BLEU score BLEU score
Pruned Model ψ ψ ψ ψ ψ
with m = 4 with m = 8 with m = 12 with m = 16 with m = 32
γ=0 0.811194 8 0.906763 4 0.902354 2.667 0.903089 2 0.895602 1
γ = 0.3 0.838967 11.429 0.892745 5.714 0.908537 3.81 0.910184 2.857 0.89851 1.429
γ = 0.6 0.835863 20.0 0.897143 10.0 0.90815 6.667 0.900468 5.0 0.9093 2.5
γ = 0.9 0.810322 80.0 0.895306 40.0 0.898784 26.667 0.910554 20.0 0.89515 10
γ = 0.95 0.779685 160.0 0.875814 80.0 0.873426 53.333 0.877221 40.0 0.87653 20

TABLE III Specially, the receiver and feature extractor were designed
T HE COMPARISON BETWEEN L-D EEP SC AND D EEP SC TRANSCEIVER IN jointly for text transmission. Firstly, we analyzed the effec-
PARAMETERS , SIZE , RUNTIME , AND BLEU SCORE .
tiveness of CSI in forward-propagation and back-propagation
Parameters Size Runtime BLEU score during system training over the fading channels. The analyt-
γ = 0, ical results reveal that the fading channels contaminate the
3,333,120 12.3 MB 20ms 0.895602
m = 32
γ = 0.6,
weights update and restrict model representation capability.
1,333,247 1.28 MB 18ms 0.897143 Thus, a refined LS estimator with fewer pilot overheads
m=8
was developed to eliminate the effects of fading channels.
Besides, we map the full-resolution original constellation into
quantization with SNR = 12 dB, where the compression ratio finite bits constellation to lower the cost of IoT devices,
is computed by which was verified by simulation results. Finally, due to
M × 32 the limited narrow bandwidth and computational capability
ψ= , (25) in IoT networks, two model compression approaches have
Mpruned × m
been proposed: 1) the network sparsification to prune the
where M is the number of weights before pruning and Mpruned unnecessary weights, and 2) network quantization to reduce
is the number of weights remaining after pruning, 32 is the the weights resolution. The simulation results validated that
number of required bits for FP32 and m is the number of the the proposed L-DeepSC outperforms the traditional methods,
required bits after quantization. The performance decreases especially in the low SNR regime, and has provided insights
when γ increases or m decreases, which are consistent with into the balance among compression ratio, sparsity ratio, and
Fig. 10 and Fig. 11. From the table, different compression quantization level. Therefore, our proposed L-DeepSC is a
ratios could lead to similar performance. For example, the promising candidate for intelligent IoT networks, especially
BLEU score with γ = 30% and m = 8 is similar to that with in the low SNR regime.
γ = 90% and m = 12, but the compression ratio is about five
times different, i.e., 5.714 and 26.667. By properly choosing
a suitable sparsity ratio and a quantization level, the same R EFERENCES
performance can be achieved but with a high compression [1] L. Atzori, A. Iera, and G. Morabito, “The Internet of Thingss: a survey,”
ratio. Computer Netw., vol. 54, no. 15, pp. 2787–2805, Oct. 2010.
Table III compares the DeepSC and L-DeepSC with 60% [2] T. Qiu, N. Chen, K. Li, M. Atiquzzaman, and W. Zhao, “How can
heterogeneous Internet of Things build our future: A survey,” IEEE
weights sparsity and 8-bit quantization when SNR is 12 dB, Commun. Surv. Tutorials, vol. 20, no. 3, pp. 2011–2027, Feb. 2018.
where we mainly consider the transmission of the weights. [3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
The simulation is performed in CPU by the computer with 2016.
[4] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep
Intel Core [email protected]. After network slimmer, learning for IoT big data and streaming analytics: A survey,” IEEE
the model size is reduced from 12.3 MB to 1.28 MB while Commun. Surv. Tutorials, vol. 20, no. 4, pp. 2923–2960, Jun. 2018.
achieving a similar BLEU score, which means the bandwidth [5] H. Li, K. Ota, and M. Dong, “Learning iot in edge: Deep learning for
the Internet of Things with edge computing,” IEEE Network, vol. 32,
resource can be saved significantly without degrading the no. 1, pp. 96–101, Jan. 2018.
performance. Besides, the runtime slightly decreases from [6] R. Carnap, Y. Bar-Hillel et al., An Outline of A Theory of Semantic
20ms to 18ms since the unstructured pruning method is Information. RLE Technical Reports 247, Research Laboratory of
Electronics, Massachusetts Institute of Technology., Cambridge MA,
employed, and there exists the communication time between Oct. 1952.
flash memory and some operation that can not be optimized. [7] D. Tse and P. Viswanath, Fundamentals of Wireless Communication.
If the model size is bigger, the L-DeepSC could save more Cambridge University Press, 2005.
runtime. [8] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature Extraction:
Foundations and Applications. Springer, 2008, vol. 207.
[9] R. Szeliski, Computer Vision: Algorithms and Applications. Springer
V. C ONCLUSION Science & Business Media, 2010.
[10] N. Indurkhya and F. J. Damerau, Handbook of Natural Language
In this paper, we proposed a lite distributed semantic Processing. CRC Press, 2010, vol. 2.
communication system, named L-DeepSC, for the Internet [11] C. E. Shannon and W. Weaver, The Mathematical Theory of Communi-
cation. The University of Illinois Press, 1949.
of Things (IoT) networks, where the participating devices [12] D. Tse and P. Viswanath, Fundamentals Wireless Communication.
are usually with limited power and computing capabilities. Cambridge University Press, 2005.
SUBMIT TO IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 12

[13] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in physical [36] K. Thakkar, A. Goyal, and B. Bhattacharyya, “Deep learning and
layer communications,” IEEE Wireless Commun., vol. 26, no. 2, pp. channel estimation,” in Proc. Int’l Conf. on Adv. Comput. and Commun.
93–99, Apr. 2019. Systems (ICACCS), Coimbatore, India, Mar. 2020, pp. 745–751.
[14] T. O’Shea and J. Hoydis, “An introduction to deep learning for the [37] E. Balevi, A. Doshi, and J. G. Andrews, “Massive MIMO channel
physical layer,” IEEE Trans. Cogn. Comm. & Networking, vol. 3, no. 4, estimation with an untrained deep neural network,” IEEE Trans. Wireless
pp. 563–575, Oct. 2017. Commun., vol. 19, no. 3, pp. 2079–2090, Jan. 2020.
[15] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning [38] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a
based communication over the air,” IEEE J. Sel. Topics Signal Process- Gaussian denoiser: Residual learning of deep cnn for image denoising,”
ing, vol. 12, no. 1, pp. 132–143, Dec. 2018. IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, Feb. 2017.
[16] F. A. Aoudia and J. Hoydis, “Model-free training of end-to-end com- [39] C. Tian, Y. Xu, Z. Li, W. Zuo, L. Fei, and H. Liu, “Attention-guided cnn
munication systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 11, pp. for image denoising,” Neural Netw., vol. 124, pp. 117–129, Apr. 2020.
2503–2516, Aug. 2019. [40] R. Dorrance, F. Ren, and D. Marković, “A scalable sparse matrix-vector
[17] H. Ye, L. Liang, G. Y. Li, and B.-H. Juang, “Deep learning-based end-to- multiplication kernel for energy-efficient sparse-blas on fpgas,” in Proc.
end wireless communication systems with conditional gans as unknown ACM/SIGDA Int’l sym. Field-programmable gate arrays, Feb. 2014, pp.
channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3133– 161–170.
3143, Feb. 2020. [41] L. Zhuo and V. K. Prasanna, “Sparse matrix-vector multiplication on
[18] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source- fpgas,” in Proc. ACM/SIGDA Int’l sym. Field-programmable gate arrays,
channel coding for wireless image transmission,” IEEE Trans. Cogn. Feb. 2005, pp. 63–74.
Commun. Netw., vol. 5, no. 3, pp. 567–579, May 2019. [42] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
[19] C. Lee, J. Lin, P. Chen, and Y. Chang, “Deep learning-constructed joint gradients through stochastic neurons for conditional computation,”
transmission-recognition for internet of things,” IEEE Access, vol. 7, pp. arXiv:1308.3432, 2013. [Online]. Available: http://arxiv.org/abs/1308.
76 547–76 561, Jun. 2019. 3432
[20] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Joint device-edge in- [43] P. Koehn, “Europarl: A parallel corpus for statistical machine transla-
ference over wireless links with pruning,” in Prob. IEEE Int’l Workshop tion,” in MT Summit, vol. 5, 2005, pp. 79–86.
Signal Process. Advances Wireless Commun. (SPAWC), Atlanta, GA, [44] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,”
USA, Aug. 2020, pp. 1–5. J. Society Industrial Applied Math., vol. 8, no. 2, pp. 300–304, Jan. 1960.
[21] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source- [45] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for
channel coding of text,” in Proc. IEEE Int’l. Conf. Acoustics Speech automatic evaluation of machine translation,” in Proc. Annual Meeting
Signal Process. (ICASSP), Calgary, AB, Canada, Apr. 2018, pp. 2326– Assoc. Comput. Linguistics (ACL), Philadelphia, PA, USA, Jul. 2002,
2330. pp. 311–318.
[22] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabled
semantic communication systems,” arXiv:2006.10685, 2020. [Online].
Available: https://arxiv.org/abs/2006.10685
[23] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
“Exploiting linear structure within convolutional networks for efficient
evaluation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Montreal,
Quebec, Canada, Dec. 2014, pp. 1269–1277.
[24] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
connections for efficient neural network,” in Proc. Adv. Neural Inf.
Process. Syst. (NIPS), Montreal, Quebec, Canada, Dec. 2015, pp. 1135–
1143.
[25] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
efficient convolutional networks through network slimming,” in Proc.
IEEE Int’l. Conf. on Comput. Vis. (ICCV), Venice, Italy, Oct. 2017, pp.
2755–2763.
[26] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
filters for efficient convnets,” in Proc. IEEE Int’l. Conf. on Learning
Representations (ICLR), Toulon, France, Apr. 2017.
[27] R. Krishnamoorthi, “Quantizing deep convolutional networks for
efficient inference: A whitepaper,” arXiv:1806.08342, 2018. [Online].
Available: http://arxiv.org/abs/1806.08342
[28] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep
convolutional networks using vector quantization,” arXiv:1412.6115,
2014. [Online]. Available: http://arxiv.org/abs/1412.6115
[29] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
quantization: Towards lossless cnns with low-precision weights,” in
Proc. IEEE Int’l. Conf. on Learning Representations (ICLR), Toulon,
France, Apr. 24-26, 2017.
[30] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,”
arXiv:1605.04711, 2016. [Online]. Available: http://arxiv.org/abs/1605.
04711
[31] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam,
and D. Kalenichenko, “Quantization and training of neural networks for
efficient integer-arithmetic-only inference,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, Jun. 2018, pp.
2704–2713.
[32] J. Guo, J. Wang, C.-K. Wen, S. Jin, and G. Y. Li, “Compression and
acceleration of neural networks for communications,” IEEE Wireless
Commun., vol. 27, no. 4, pp. 110–117, July 2020.
[33] D. Gil, A. Ferrández, H. Mora-Mora, and J. Peral, “Internet of Things: A
review of surveys based on context aware intelligent services,” Sensors,
vol. 16, no. 7, p. 1069, Jul. 2016.
[34] “IEEE standard for floating-point arithmetic,” IEEE Std 754-2008, pp.
1–70, 2008.
[35] B. Zhu, J. Wang, L. He, and J. Song, “Joint transceiver optimization for
wireless communication phy using neural network,” IEEE J. Sel. Areas
Commun., vol. 37, no. 6, pp. 1364–1373, Mar. 2019.

You might also like