0% found this document useful (0 votes)
155 views13 pages

An Introduction To Deep Learning For The Physical Layer

This document introduces several novel applications of deep learning to the physical layer of communications systems. It discusses how communications systems can be modeled as autoencoders and jointly optimized through end-to-end training. It also explores using deep learning techniques like convolutional neural networks for modulation classification directly on raw IQ samples. The document outlines the key contributions, including learning full transmitter and receiver implementations, extending this to multiple transmit-receive pairs, introducing radio transformer networks to integrate expert knowledge, and applying CNNs for modulation classification. It concludes by discussing open challenges and areas for future work.

Uploaded by

zhouxinzhao0211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views13 pages

An Introduction To Deep Learning For The Physical Layer

This document introduces several novel applications of deep learning to the physical layer of communications systems. It discusses how communications systems can be modeled as autoencoders and jointly optimized through end-to-end training. It also explores using deep learning techniques like convolutional neural networks for modulation classification directly on raw IQ samples. The document outlines the key contributions, including learning full transmitter and receiver implementations, extending this to multiple transmit-receive pairs, introducing radio transformer networks to integrate expert knowledge, and applying CNNs for modulation classification. It concludes by discussing open challenges and areas for future work.

Uploaded by

zhouxinzhao0211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO.

4, DECEMBER 2017 563

An Introduction to Deep Learning


for the Physical Layer
Timothy O’Shea , Senior Member, IEEE, and Jakob Hoydis, Member, IEEE

Abstract—We present and discuss several novel applications of levels of accuracy [6], [7]. In communications, however, we
deep learning for the physical layer. By interpreting a communi- can design transmit signals that enable straightforward ana-
cations system as an autoencoder, we develop a fundamental new lytic algorithms for symbol detection for a variety of channel
way to think about communications system design as an end-to-
end reconstruction task that seeks to jointly optimize transmitter and system models (e.g., detection of a constellation symbol
and receiver components in a single process. We show how this in additive white Gaussian noise (AWGN)). Thus, as long as
idea can be extended to networks of multiple transmitters and such models sufficiently capture real effects, we do not expect
receivers and present the concept of radio transformer networks DL to yield significant improvements on the physical layer.
as a means to incorporate expert domain knowledge in the Nevertheless, we believe that the DL applications which
machine learning model. Lastly, we demonstrate the application
of convolutional neural networks on raw IQ samples for mod- we explore in this paper are a useful and insightful way of
ulation classification which achieves competitive accuracy with fundamentally rethinking the communications system design
respect to traditional schemes relying on expert features. This problem, and hold promise for performance improvements
paper is concluded with a discussion of open challenges and in complex communications scenarios that are difficult to
areas for future investigation. describe with tractable mathematical models. Our main con-
Index Terms—Machine learning, deep learning, physical layer, tributions are as follows:
digital communications, modulation, radio communication, cog- • We demonstrate that it is possible to learn full trans-
nitive radio. mitter and receiver implementations for a given channel
model which are optimized for a chosen loss function
(e.g., minimizing block error rate (BLER)). Interestingly,
I. I NTRODUCTION
such “learned” systems can be competitive with respect to
OMMUNICATIONS is a field of rich expert knowledge
C about how to model channels of different types [1], [2],
compensate for various hardware imperfections [3], [4], and
the current state-of-the-art. The key idea here is to repre-
sent transmitter, channel, and receiver as one deep neural
network (NN) that can be trained as an autoencoder. The
design optimal signaling and detection schemes that ensure beauty of this approach is that it can even be applied to
a reliable transfer of data [5]. As such, it is a complex and channel models and loss functions for which the optimal
mature engineering field with many distinct areas of investi- solutions are unknown.
gation which have all seen diminishing returns with regards • We extend this concept to an adversarial network of
to performance improvements, in particular on the physical multiple transmitter-receiver pairs competing for capacity.
layer. Because of this, there is a high bar of performance over This leads to the interference channel for which find-
which any machine learning (ML) or deep learning (DL) based ing the best signaling scheme is a long-standing research
approach must pass in order to provide tangible new benefits. problem. We demonstrate that such a setup can also
In domains such as computer vision and natural language be represented as an NN with multiple inputs and out-
processing, DL shines because it is difficult to characterize puts, and that all transmitter and receiver implementations
real world images or language with rigid mathematical mod- can be jointly optimized with respect to a common or
els. For example, while it is an almost impossible task to individual performance metric(s).
write a robust algorithm for detection of handwritten digits or • We introduce radio transformer networks (RTNs) as a
objects in images, it is straightforward today to implement DL way to integrate expert knowledge into the DL model.
algorithms that learn to accomplish this task beyond human RTNs allow, for example, to carry out predefined cor-
rection algorithms (“transformers”) at the receiver (e.g.,
Manuscript received February 24, 2017; revised July 11, 2017 and multiplication by a complex-valued number, convolution
September 18, 2017; accepted September 21, 2017. Date of publication
October 2, 2017; date of current version December 22, 2017. The associate with a vector) which may be fed with parameters learned
editor coordinating the review of this paper and approving it for publication by another NN. This NN can be integrated into the
was M. Zorzi. (Corresponding author: Timothy O’Shea.) end-to-end training process of a task performed on the
T. O’Shea is with the Bradley Department of Electrical and Computer
Engineering, Virginia Tech and DeepSig, Arlington, VA 22203 USA (e-mail: transformed signal (e.g., symbol detection).
[email protected]). • We study the use of NNs on complex-valued IQ samples
J. Hoydis is with the Department of Software-Defined Mobile for the problem of modulation classification and show
Networks, Nokia Bell Labs, 91620 Nozay, France (e-mail:
[email protected]). that convolutional neural networks (CNNs), which are
Digital Object Identifier 10.1109/TCCN.2017.2758370 the cornerstone of most DL systems for computer vision,
2332-7731 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
564 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017

can outperform traditional classification techniques based evidence that “learned” algorithms taking this form could be
on expert features. This result mirrors a relentless trend executed faster and at lower energy cost than their manually
in DL for various domains, where learned features ulti- “programmed” counterparts.
mately outperform and displace long-used expert features, Fourth, massively parallel processing architectures with
such as the scale-invariant feature transform (SIFT) [8] distributed memory architectures, such as graphic process-
and Bag-of-words [9]. ing units (GPUs) but also increasingly specialized chips for
The ideas presented in this paper provide a multitude of NN inference (e.g., [17]), have shown to be very energy
interesting avenues for future research that will be discussed efficient and capable of impressive computational through-
in detail. We hope that these will stimulate wide interest within put when fully utilized by concurrent algorithms [18]. The
the research community. performance of such architectures, however, has been largely
The rest of this article is structured as follows: Section I-A limited by the ability of algorithms and higher level program-
discusses potential benefits of DL for the physical layer. ming languages to make efficient use of them. The inherently
Section I-B presents related work. Background of DL is concurrent nature of computation and memory access across
presented in Section II. In Section III, several DL applica- wide and deep NNs has demonstrated a surprising ability to
tions for communications are presented. Section IV contains readily achieve high resource utilization on these architec-
an overview and discussion of open problems and key areas tures with minimal application specific tuning or optimization
of future investigation. Section V concludes the article. required.

A. Potential of DL for the Physical Layer B. Historical Context and Related Work
Apart from the intellectual beauty of a fully “learned” com- Applications of ML in communications have a long history
munications system, there are some reasons why DL could covering a wide range of applications. These comprise channel
provide gains over existing physical layer algorithms. modeling and prediction, localization, equalization, decoding,
First, most signal processing algorithms in communications quantization, compression, demodulation, modulation recog-
have solid foundations in statistics and information theory and nition, and spectrum sensing, to name a few [19], [20] (and
are often provably optimal for tractable mathematically mod- references therein). However, to the best of our knowledge and
els. These are generally linear, stationary, and have Gaussian due to the reasons mentioned above, few of these applications
statistics. A practical system, however, has many imperfec- have been commonly adopted or led to a wide commercial
tions and non-linearities [4] (e.g., non-linear power amplifiers success. It is also interesting that essentially all of these appli-
(PAs), finite resolution quantization) that can only be approx- cations focus on individual receiver processing tasks alone,
imately captured by such models. For this reason, a DL-based while the consideration of the transmitter or a full end-to-end
communications system (or processing block) that does not system is entirely missing in the literature.
require a mathematically tractable model and that can be opti- The advent of open-source DL libraries (see Section II-B)
mized for a specific hardware configuration and channel might and readily available specialized hardware along with the
be able to better optimize for such imperfections. astonishing progress of DL in computer vision have stimulated
Second, one of the guiding principles of communications renewed interest in the application of DL for communications
systems design is to split the signal processing into a chain and networking [21] (and other papers in the special issue).
of multiple independent blocks; each executing a well defined There are currently essentially two different main approaches
and isolated function (e.g., source/channel coding, modulation, of applying DL to the physical layer. The goal is to either
channel estimation, equalization). Although this approach has improve/augment parts of existing algorithms with DL, or to
led to the efficient, versatile, and controllable systems we have completely replace them.
today, it is not clear that individually optimized processing Among the papers trying to improve existing algorithms
blocks achieve the best possible end-to-end performance. For are [22]–[24] that consider belief propagation channel decod-
example, the separation of source and channel coding for many ing and multiple-input multiple-output (MIMO) detection,
practical channels and short block lengths (see [10] and refer- respectively. These works are inspired by the idea of deep
ences therein) as well as separate coding and modulation [11] unfolding [25] of existing iterative algorithms by essentially
are known to be sub-optimal. Attempts to jointly optimize interpreting each iteration as a set of NN layers. In a similar
each of these components, e.g., based on factor graphs [12], manner, [26] aims at improving the solution of sparse linear
provide gains but lead to unwieldy and computationally com- inverse problems with DL.
plex systems. A learned end-to-end communications system, The approach of replacing existing algorithms through
however, does not require such a rigid modular structure as it DL is adopted in [27], dealing with blind detection for
is optimized for end-to-end performance. MIMO systems with low-resolution quantization, and [28],
Third, it has been shown that NNs are universal function in which detection for molecular communications for which
approximators [13] and recent work has shown a remarkable no mathematical channel model exists is studied. The idea
capacity for algorithmic learning with recurrent NNs [14] that of learning to solve complex optimization tasks for wire-
are known to be Turing-complete [15]. Since the execution less resource allocation is investigated in [29]. Some of us
of NNs can be highly parallelized on concurrent architectures have obtained initial results in the area of learned end-to-end
and implemented with low-precision data types [16], there is communications systems [30] and considered the problems

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 565

TABLE I
of modulation recognition [31], signal compression [32], and L IST OF L AYER T YPES
channel decoding [33], [34] with state-of-the art DL tools.
Notations: We use boldface upper- and lower-case letters
to denote matrices and column vectors, respectively. For a
vector x, xi denotes its ith element, x its Euclidean norm,
xT its transpose, and x  y the element-wise product with
y. For a matrix X, Xij or [X]ij denotes the (i, j)-element. R
and C denote the sets of real and complex numbers, respec-
tively. N (m, R) and CN (m, R) are the multivariate Gaussian TABLE II
L IST OF ACTIVATION F UNCTIONS
and complex Gaussian distributions with mean vector m and
covariance matrix R, respectively. Bern(α) is the Bernoulli
distribution with success probability α and ∇ is the gradient
operator.

II. D EEP L EARNING BASICS


A feedforward NN (or multilayer perceptron (MLP)) with
L layers describes a mapping f (r0 ; θ ) : RN0 → RNL of an
TABLE III
input vector r0 ∈ RN0 to an output vector rL ∈ RNL through L L IST OF L OSS F UNCTIONS
iterative processing steps:
r = f (r−1 ; θ ),  = 1, . . . , L (1)
where f (r−1 ; θ ) : RN−1 → RN is the mapping carried
out by the th layer. This mapping depends not only on the
output vector r−1 from the previous layer but also on a set
of parameters θ . Moreover, the mapping can be stochastic, is the output of the NN when r0,i is used as input. Several
i.e., f can be a function of some random variables. We use relevant loss functions are provided in Table III. Different
θ = {θ1 , . . . , θL } to denote the set of all parameters of the norms (e.g., L1, L2) of parameters or activations can be
network. The th layer is called dense or fully-connected if added to the loss function to favor solutions with small or
f (r−1 ; θ ) has the form sparse values (a form of regularization). The most popular
f (r−1 ; θ ) = σ (W r−1 + b ) (2) algorithm to find good sets of parameters θ is stochastic
gradient descent (SGD) which starts with some random initial
where W ∈ RN ×N−1 , b ∈ RN , and σ (·) is an activa- values of θ = θ 0 and then updates θ iteratively as
tion function which we will be defined shortly. The set of
parameters for this layer is θ = {W , b }. Table I lists sev- θ t+1 = θ t − η∇ L̃(θ t ) (4)
eral other layer types together with their mapping functions where η > 0 is the learning rate and L̃(θ) is an approximation
and parameters which are used in this manuscript. All lay- of the loss function which is computed for a random mini-
ers with stochastic mappings generate a new random mapping batch of training examples St ⊂ {1, 2, . . . , S} of size St at
each time they are called. For example, the noise layer simply each iteration, i.e.,
adds a Gaussian noise vector with zero mean and covariance
matrix βIN−1 to the input. Thus, it generates a different out- 1    
L̃(θ) = l rL,i , rL,i . (5)
put for the same input each time it is called. The activation St
i∈St
function σ (·) in (2) introduces a non-linearity which is impor-
tant for the so-called expressive power of the NN. Without By choosing St small compared to S, the gradient computa-
this non-linearity there would be not much of an advantage of tion complexity is significantly reduced while still reducing
stacking multiple layers on top of each other. Generally, the weight update variance. Note that there are many variants of
activation function is applied individually to each element of the SGD algorithm which dynamically adapt the learning rate
its input vector, i.e., [σ (u)]i = σ (ui ). Some commonly used to improve convergence [35, Ch. 8.5]. The gradient in (4)
activation functions are listed in Table II. NNs are generally can be very efficiently computed through the backpropaga-
trained using labeled training data, i.e., a set of input-output tion algorithm [35, Ch. 6.5]. Definition and training of NNs
vector pairs (r0,i , rL,i ), i = 1, . . . , S, where rL,i is the desired of almost arbitrary shape can be done with one of the many
output of the neural network when r0,i is used as input. The existing DL libraries presented in Section II-B.
goal of the training process is to minimize the loss
A. Convolutional Layers
1  
S

L(θ ) = l rL,i , rL,i (3) Convolutional neural network (CNN) layers were introduced
S in [6] to provide an efficient learning method for 2D images.
i=1
with respect to the parameters in θ , where 1 The linear activation function is typically used at the output layer in the
l(u, v) : RNL × RNL → R is the loss function and rL,i context of regression tasks, i.e., estimation of a real-valued vector.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
566 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017

By tying adjacent shifts of the same weights together in a


way similar to that of a filter sliding across an input vector,
convolutional layers are able to force the learning of features Fig. 1. A simple communications system consisting of a transmitter and a
with an invariance to shifts in the input. In doing so, they receiver connected through a channel.
also greatly reduce the model complexity (as measured by
the number of free parameters in the layer’s weight matrix)
required to represent equivalent shift-invariant features using C. Network Dimensions and Training
fully connected layers, reduce SGD optimization complexity, The term “deep” has become common in recent NN liter-
and improve generalization on appropriate datasets. ature, referring to the number of sequential layers within a
In general, a convolutional layer consists of a set of F filter network (but also more generally to the methods commonly
weights Qf ∈ Ra×b , f = 1, . . . , F (F is called the depth), used to train such networks). Depth relates directly to the num-
which generate each a so-called feature map Yf ∈ Rn ×m ber of iterative operations performed on input data through
from an input matrix X ∈ Rn×m according to the following sequential layers’ transfer functions. While deep networks
convolution2 : allow for numerous iterative transforms on the data, a min-

a−1 
b−1 imum latency network would likely be as shallow as possible.
f f
Yi,j = Qa−k,b− X1+s(i−1)−k,1+s(j−1)− (6) “Width” is used to describe the number of output activations
k=0 =0 per layer, or for all layers on average, and relates directly to
the memory required by each layer.
where s ≥ 1 is an integer parameter called stride, n = Best practice training methods have varied over the years,
1 + n+a−2s and m = 1 + m+b−2 s , and it is assumed that from direct solution techniques over gradient descent to
X is padded with zeros, i.e., Xi,j = 0 for all i ∈ / [1, n] and genetic algorithms, each having been favored or considered at
j∈/ [1, m]. The output dimensions can be reduced by either one time (see [35, Ch. 1.2] for a short history of DL). Layer-
increasing the stride s or by adding a pooling layer. The pool- by-layer pre-training [45] was also a recently popular method
ing layer partitions Y into p × p regions for each of which for scaling training to larger networks where backpropagation
it computes a single output value, e.g., maximum or average once struggled. However, most systems today are able to train
value, or L2-norm. networks which are both wide and deep directly using back-
For example, taking a vectorized grayscale image input con- propagation and SGD methods with adaptive learning rates
sisting of 28 × 28 pixels and connecting it to a dense layer (e.g., Adam [46]), regularization methods to prevent overfitting
with the same number of activations, results in a single weight (e.g., Dropout [36]), and activations functions which reduce
matrix with 784 × 784 = 614, 656 free parameters. However, gradient issues (e.g., ReLU [37]).
if we use a convolutional feature map containing six filters
each sized 5 × 5 pixels, we obtain a much reduced number
of free parameters of 6 · 5 · 5 = 150. For the right kind of III. E XAMPLES OF M ACHINE L EARNING A PPLICATIONS
dataset, this technique can be extremely effective. We will see FOR THE P HYSICAL L AYER
an application of convolutional layers in Section III-D. For In this section, we will show how to represent an end-to-
more details on CNNs, we refer to [35, Ch. 9]. end communications system as an autoencoder and train it via
SGD. This idea is then extended to multiple transmitters and
B. Deep Learning Libraries receivers and we study as an example the two-user interference
channel. We will then introduce the concept of RTNs to
In recent times, numerous tools and algorithms have
improve performance on fading channels, and demonstrate the
emerged that make it easy to build and train large NNs. Tools
application of CNNs to raw radio frequency time-series data
to deploy such training routines from high level language to
for the task of modulation classification.
massively parallel GPU architectures have been key enablers.
Among these are Caffe [38], MXNet [39], TensorFlow [40],
Theano [41], and Torch [42] (just to name a few), which allow A. Autoencoders for End-to-End Communications
for high level algorithm definition in various programming Systems
languages or configuration files, automatic differentiation of In its simplest form, a communications system consists of a
training loss functions through arbitrarily large networks, and transmitter, a channel, and a receiver, as shown in Fig. 1. The
compilation of the network’s forwards and backwards passes transmitter wants to communicate one out of M possible mes-
into hardware optimized concurrent dense matrix algebra ker- sages s ∈ M = {1, 2, . . . , M} to the receiver making n discrete
nels. Keras [43] provides an additional layer of NN primitives uses of the channel. To this end, it applies the transformation
with Theano and TensorFlow as its back-end. It has a highly f : M → Rn to the message s to generate the transmitted sig-
customizable interface to quickly experiment with and deploy nal x = f (s) ∈ Rn .3 Generally, the hardware of the transmitter
deep NNs, and has become our primary tool used to generate imposes certain constraints on x, e.g., an energy constraint
the numerical results for this manuscript [44].
3 We focus here on real-valued signals only. Extensions to complex-valued
2 In image processing, X is commonly a three-dimensional tensor with the signals are discussed in Section IV. Alternatively, one can consider a mapping
third dimension corresponding to color channels. The filters’ weights are also to R2n , which can be interpreted as a concatenation of the real and imaginary
three-dimensional and work on all input channels simultaneously. parts of x. This approach is adopted in Sections III-B–III-D.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 567

Fig. 2. A communications system over an AWGN channel represented as an autoencoder. The input s is encoded as a one-hot vector, the output is a
probability distribution over all possible messages from which the most likely is picked as output ŝ.

x22 ≤ n, an amplitude  constraint


 |xi | ≤ 1 ∀i, or an aver- The autoencoder can then be trained end-to-end using SGD on
age power constraint E |xi |2 ≤ 1 ∀i. The communication rate the set of all possible messages s ∈ M using the well suited
of this communications system is R = k/n [bit/channel use], categorical cross-entropy loss function between 1s and p.4
where k = log2 (M). In the sequel, the notation (n,k) means that Fig. 3a compares the block error rate (BLER), i.e.,
a communications system sends one out of M = 2k messages Pr(ŝ = s), of a communications system employing binary
(i.e., k bits) through n channel uses. The channel is described phase-shift keying (BPSK) modulation and a Hamming (7,4)
by the conditional probability density function p(y|x), where code with either binary hard-decision decoding or maximum
y ∈ Rn denotes the received signal. Upon reception of y, the likelihood decoding (MLD) against the BLER achieved by
receiver applies the transformation g : Rn → M to produce the trained autoencoder (7,4) (with fixed energy constraint
the estimate ŝ of the transmitted message s. x22 = n). Both systems operate at rate R = 4/7. For com-
From a DL point of view, this simple communications parison, we also provide the BLER of uncoded BPSK (4,4).
system can be seen as a particular type of autoencoder This result shows that the autoencoder has learned without
[35, Ch. 14]. Typically, the goal of an autoencoder is to any prior knowledge an encoder and decoder function that
find a low-dimensional representation of its input at some together achieve the same performance as the Hamming (7,4)
intermediate layer which allows reconstruction at the output code with MLD. The layout of the autoencoder is provided
with minimal error. In this way, the autoencoder learns to non- in Table IV. Although a single layer can represent the same
linearly compress and reconstruct the input. In our case, the mapping from message index to corresponding transmit vector,
purpose of the autoencoder is different. It seeks to learn rep- our experiments have shown that SGD converges to a better
resentations x of the messages s that are robust with respect global solution using two transmit layers instead of one. This
to the channel impairments mapping x to y (i.e., noise, fading, increased dimension parameter search space may actually help
distortion, etc.), so that the transmitted message can be recov- to reduce likelihood of convergence to sub-optimal minima by
ered with small probability of error. In other words, while most making such solutions more likely to emerge as saddle points
autoencoders remove redundancy from input data for compres- during optimization [47]. Training was done at a fixed value
sion, this autoencoder (the “channel autoencoder”) often adds of Eb /N0 = 7 dB (see Section IV-B) using Adam [46] with
redundancy, learning an intermediate representation robust to learning rate 0.001. We have observed that increasing the batch
channel perturbations. size while reducing the learning rate during training helps to
An example of such an autoencoder is shown in Fig. 2. Here, improve accuracy. For all other implementation details, we
the transmitter consists of a feedforward NN with multiple refer to the source code [44].
dense layers followed by a normalization layer that ensures Fig. 3b shows a similar comparison but for an (8,8) and
that physical constraints on x are met. Note that the input s to (2,2) communications system, i.e., R = 1. Surprisingly, while
the transmitter is encoded as a one-hot vector 1s ∈ RM , i.e., the autoencoder achieves the same BLER as uncoded BPSK
an M-dimensional vector, the sth element of which is equal for (2,2), it outperforms the latter for (8,8) over the full range
to one and zero otherwise. The channel is represented by an of Eb /N0 . This implies that it has learned some joint coding
additive noise layer with a fixed variance β = (2REb /N0 )−1 , and modulation scheme, such that a coding gain is achieved.
where Eb /N0 denotes the energy per bit (Eb ) to noise power For a truly fair comparison, this result should be compared to
spectral density (N0 ) ratio. The receiver is also implemented
as a feedforward NN. Its last layer uses a softmax activa- 4 A more memory-efficient approach to implement this architecture is by

tion whose output p ∈ (0, 1)M is a probability vector over all replacing the one-hot encoded input and the first dense layer by an embedding
that turns message indices into vectors. The loss function can then be replaced
possible messages. The decoded message ŝ corresponds then by the sparse categorical cross-entropy that accepts message indices rather
to the index of the element of p with the highest probability. than one-hot vectors as labels. This was done in our experiments [44].

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
568 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017

Fig. 3. BLER versus Eb /N0 for the autoencoder and several baseline communication schemes.

a higher-order modulation scheme using a channel code (or


the optimal sphere packing in eight dimensions). A detailed
performance comparison for various channel types and param-
eters (n, k) with different baselines is out of the scope of this
paper and left to future investigations.
Fig. 4 shows the learned representations x of all mes-
sages for different values of (n, k) as complex constellation
points, i.e., the x- and y-axes correspond to the first an sec-
ond transmitted symbols, respectively. In Fig. 4d for (7, 4), we
depict the seven-dimensional message representations using a
two-dimensional t-distributed stochastic neighbor embedding
(t-SNE) [48] of the noisy observations y instead. Fig. 4a shows
the simple (2, 2) system which converges rapidly to a classical
quadrature phase shift keying (QPSK) constellation with some
arbitrary rotation. Similarly, Fig. 4b shows a (4, 2) system
which leads to a rotated 16-PSK constellation. The impact
of the chosen normalization becomes clear from Fig. 4c for
the same parameters but with an average power normalization
instead of a fixed energy constraint (that forces the symbols
to lie on the unit circle). This results in an interesting mixed
pentagonal/hexagonal grid arrangement (with indistinguish- Fig. 4. Constellations produced by autoencoders using parameters (n, k):
able BLER performance from 16-QAM). The constellation has (a) (2, 2) (b) (2, 4), (c) (2, 4) with average power constraint, (d) (7, 4)
a symbol in the origin surrounded by five equally spaced near- 2-dimensional t-SNE embedding of received symbols.
est neighbors, each of which has six almost equally spaced
neighbors. Fig. 4d for (7, 4) shows that the t-SNE embedding
of y leads to a similarly shaped arrangement of clusters. learn to communicate over any channel, even for which no
The examples of this section treat the communication task information-theoretically optimal scheme is known.
as a classification problem and the representation of s by an
M-dimensional vector becomes quickly impractical for large B. Autoencoders for Multiple Transmitters and Receivers
M. To circumvent this problem, it is possible to use more The autoencoder concept from Section III-A, can be read-
compact representations of s such as a binary vector with ily extended to multiple transmitters and receivers that share
log2 (M) dimensions. In this case, output activation functions a common channel. As an example, we consider here the
such as sigmoid and loss functions such as MSE or binary two-user AWGN interference channel as shown in Fig. 5.
cross-entropy are more appropriate. Nevertheless, scaling such Transmitter 1 wants to communicate message s1 ∈ M to
an architecture to very large values of M remains challenging Receiver 1 while Transmitter 2 wants to communicate mes-
due to the size of the training set and model. We recall that a sage s2 ∈ M to Receiver 2. Extensions to K users with possibly
very important property of the autoencoder is also that it can different rates, i.e., sk ∈ Mk ∀k, as well as to other channel

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 569

Fig. 5. The two-user interference channel seen as a combination of two


interfering autoencoders that try to reconstruct their respective messages.

TABLE IV
L AYOUT OF THE AUTOENCODER U SED IN F IGS . 3a AND 3b. I T H AS
(2M + 1)(M + n) + 2M T RAINABLE PARAMETERS , R ESULTING IN 62,
791, AND 135,944 PARAMETERS FOR THE (2,2), (7,4), AND (8,8)
AUTOENCODER , R ESPECTIVELY

Fig. 6. BLER versus Eb /N0 for the two-user interference channel achieved
by the autoencoder (AE) and 22k/n -QAM time-sharing (TS) for different
parameters (n, k).

adopted dynamic weights αt for each mini-batch t:


L̃1 (θ t )
types are straightforward. Both transmitter-receiver pairs are αt+1 = , t>0 (10)
implemented as NNs and the only difference with respect to L̃1 (θ t ) + L̃2 (θ t )
the autoencoder from the last section is that the transmitted where α0 = 0.5. Thus, the smaller L̃1 (θ t ) is compared to
messages x1 , x2 ∈ Cn now interfere at the receivers, resulting L̃2 (θ t ), the smaller is its weight αt+1 for the next mini-batch.
in the noisy observations There are many other possibilities to train such a system and
we do not claim any optimality of our approach. However, it
y1 = x1 + x2 + n1 (7) has led in our experiments to the desired result of identical
y2 = x2 + x1 + n2 (8) BLERs for both transmitter-receiver pairs.
Fig. 6 shows the BLER of one of the autoencoders (denoted
where n1 , n1 ∼ CN (0, βIn ) is Gaussian noise. For simplic- by AE) as a function of Eb /N0 for the sets of parameters
ity, we have adopted the complex-valued notation, rather than (n, k) = {(1, 1), (2, 2), (4, 4), (4, 8)}. The NN-layout for both
considering real-valued vectors of size 2n. That is, the nota- autoencoders is that provided in Table IV by replacing n by
tion (n, k) means each of the 2k messages is transmitted over 2n. We have used an average power constraint to be com-
n complex-valued channel uses. Denote by petitive with higher-order modulation schemes (see Fig. 4c).
      As a baseline for comparison, we provide the BLER of
l1 = − log ŝ1 s , l2 = − log ŝ2 s (9) uncoded 22k/n -QAM that has the same rate when used together
1 2
with time-sharing (TS) between both transmitters.5 While the
autoencoder and time-sharing have identical BLERs for (1, 1)
the individual cross-entropy loss functions of the first and
and (2, 2), the former achieves substantial gains of around
second transmitter-receiver pair, respectively, and by L̃1 (θ t ),
0.7 dB for (4, 4) and 1 dB for (4, 8) at a BLER of 10−3 . The
L̃2 (θ t ) the associated losses for mini-batch t (see (5)). In
reasons for this are similar to those explained in Section III-A.
such a context, it is less clear how one should train two
It is interesting to have a look at the learned message
coupled autoencoders with conflicting goals. One approach
representations which are shown in Fig. 7. For (1, 1), the
consists of minimizing a weighted sum of both losses, i.e.,
transmitters have learned to use binary phase shift keying
L̃ = α L̃1 + (1 − α)L̃2 for some α ∈ [0, 1]. If one would min-
(BPSK)-like constellations in orthogonal directions (with an
imize L̃1 alone (i.e., α = 1), Transmitter 2 would learn to
arbitrary rotation around the origin). This achieves the same
transmit a constant signal independent of s2 that Receiver 1
performance as QPSK with time-sharing. However, for (2, 2),
could simply subtract from y1 . The opposite is true for α = 0.
the learned constellations are not orthogonal anymore and can
However, giving equal weight to both losses (i.e., α = 0.5)
be interpreted as some form of super-position coding. For the
does not necessarily result in equal performance. We have
observed in our experiments that it generally leads to highly 5 For (1, 1), (2, 2), and (4, 4), each transmitter sends a 4-QAM (i.e., QPSK)
unfair and suboptimal solutions. For this reason, we have symbol on every other channel use. For (4, 8), 16-QAM is used instead.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
570 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017

could be used in place of the receiver shown in Fig. 2. An


RTN consists of three parts: (i) a learned parameter estimator
gω : Rn → Rp which computes a parameter vector ω ∈ Rp
from its input y, (ii) a parametric transform t : Rn ×Rp → Rn
that applies a deterministic (and differentiable) function to
y which is parameterized by ω and suited to the propaga-
tion phenomena, and (iii) a learned discriminative network
g : Rn → M which produces the estimate ŝ of the transmitted
message (or other label information) from the canonicalized
input y ∈ Rn . By allowing the parameter estimator gω to
take the form of an NN, we can train the system end-to-
end to optimize for a given loss function. Importantly, the
training process of such an RTN does not seek to directly
improve the parameter estimation itself but rather optimizes
the way the parameters are estimated to obtain the best end-
to-end performance (e.g., BLER). While the example above
describes an RTN for receiver-side processing, it can sim-
ilarly be used wherever parametric transformations seeded
by estimated parameters are needed. RTNs are a form of
Fig. 7. Learned constellations for the two-user interference channel with
parameters (a) (1, 1), (b) (2, 2), (c) (4, 4), and (d) (4, 8). The constellation learned feed-forward attention inspired by Spatial Transformer
points of Transmitter 1 and 2 are represented by red dots and black crosses, Networks (STNs) [51] which have worked well for computer
respectively. vision problems.
The basic functioning of an RTN is best understood from
a simple example, such as the problem of phase offset esti-
first symbol, Transmitter 1 uses high power and Transmitter 2 mation and compensation. Let yc = ejϕ ỹc ∈ Cn be a vector
low power. For the second symbol, the roles are changed. of IQ samples that have undergone a phase rotation by the
For (4, 4) and (4, 8), the constellations are more difficult phase offset ϕ, and let y = [{y}T , {y}T ]T ∈ R2n . The goal
to interpret, but we can see that the constellations of both of gω is to estimate a scalar ϕ̂ = ω = gω (y) that is close
transmitters resemble ellipses with orthogonal major axes and to the phase offset ϕ, which is then used by the parametric
varying focal distances. This effect is more visible for (4, 8) transform t to compute ȳc = e−jϕ̂ yc . The canonicalized signal
than for (4, 4) because of the increased number of constella- ȳ = [{ȳc }T , {ȳc }T ]T is thus given by
tion points. An in-depth study of learned constellations and
cos(ϕ̂){ȳc } + sin(ϕ̂){ȳc }
how they are impacted by the chosen normalization and NN ȳ = t(ϕ̂, y) = (11)
weight initializations is out of the scope of this paper but a cos(ϕ̂){ȳc } − sin(ϕ̂){ȳc }
very interesting topic of future investigations. and then fed into the discriminative network g for further
We would like to point out that one can also consider other processing, such as classification.
types of multi-transmitter/receiver communications systems A compelling example demonstrating the advantages of
with this approach. These comprise, the general multiple RTNs is shown in Fig. 9 which compares the BLER of an
access channel (MAC) and broadcast channel (BC), as well as autoencoder (8,4) with and without RTN over a multipath
systems with jammers and eavesdroppers. As soon as some of fading channel with L = 3 channel taps. We assume complex-
the transmitters and receivers are non-cooperative, adversarial valued channel uses, so that transmitter and receiver have
training strategies could be adopted (see [49], [50]). 2n real-valued inputs and outputs. The received signal y =
[{yc }T , {yc }T ]T ∈ R2n is given as
C. Radio Transformer Networks for Augmented Signal 
L
Processing Algorithms yc,i = hc, xc,i−+1 + nc,i (12)
Many of the physical phenomena undergone in a com- =1
munications channel and in transceiver hardware can be where hc ∼ CN (0, L−1 IL ) are i.i.d. Rayleigh fading channel
inverted using compact parametric models/transformations. taps, nc ∼ CN (0, (REb /N0 )−1 In ) is receiver noise, and xc ∈
Widely used transformations include re-sampling to estimated Cn is the transmitted signal, where we assume in (12) xc,i = 0
symbol/clock timing, mixing with an estimated carrier tone, for i ≤ 0. Here, the goal of the parameter estimator is to
and convolving with an inverse channel impulse response. The predict a complex-valued vector ωc (represented by 2L real
estimation processes for parameters to seed these transforma- values) that is used in the transformation layer to compute the
tions (e.g., frequency offset, symbol timing, impulse response) complex convolution of yc with ωc . Thus, the RTN tries to
is often very involved and specialized based on signal specific equalize the channel output through inverse filtering in order
properties and/or information from pilot tones (see [3]). to simplify the task of the discriminative network. We have
One way of augmenting DL models with expert propaga- implemented the estimator as an NN with two dense layers
tion domain knowledge but not signal specific assumptions is with tanh activations followed by a dense output layer with
through the use of an RTN as shown in Fig. 8. This RTN linear activations.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 571

Fig. 8. A radio receiver represented as an RTN. The input y first runs through a parameter estimation network gω (y), has a known transform t(y, ω) applied
to generate the canonicalized signal y, and then is classified in the discriminative network g(y) to produce the output ŝ.

Fig. 9. BLER versus Eb /N0 for various communication schemes over a Fig. 10. Autoencoder training loss with and without RTN.
channel with L = 3 Rayleigh fading taps.

The autoencoder and RTN as presented above can be


While the plain autoencoder struggles to meet the extended through minor modifications to operate directly on
performance of differential BPSK (DBPSK) with maximum IQ samples rather than symbols to effectively deal with
likelihood sequence estimation (MLE) and a Hamming (7,4) problems such as pulse shaping, timing-, frequency- and
code, the autoencoder with RTN outperforms it. Another phase-offset compensation. This is an exciting and promis-
advantage of RTNs is faster training convergence which can ing area of research that we leave to future investigations.
be seen from Fig. 10 that compares the validation loss of Interesting applications of this approach could also arise in
the autoencoder with and without RTN as a function of the optical communications dealing with highly non-linear chan-
training epochs. We have observed in our experiments that nel impairments that are notoriously difficult to model and
the autoencoder with RTN consistently outperforms the plain compensate for [52].
autoencoder, independently of the chosen hyper-parameters.
However, the performance differences diminish when the
encoder and decoder networks are made wider and trained D. CNNs for Classification Tasks
for more iterations. Although there is theoretically nothing an Many signal processing functions within the physical layer
RTN-augmented NN can do that a plain NN cannot, the RTN can be learned as either regression or classification tasks. Here
helps by incorporating domain knowledge to simplify the tar- we look at the well-known problem of modulation classifica-
get manifold, similar to the role of convolutional layers in tion of single carrier modulation schemes based on sampled
imparting translation invariance where appropriate. This leads radio frequency time-series data, i.e., IQ samples. This task has
to a simpler search space and improved generalization. been accomplished for years through the approach of expert

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
572 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017

TABLE V
L AYOUT OF THE CNN FOR M ODULATION C LASSIFICATION
W ITH 324,330 T RAINABLE PARAMETERS

Fig. 12. Confusion matrix of the CNN (SNR = 10 dB).

In Fig. 11, we compare the classification accuracy of the


CNN against that of extreme gradient boosting7 with 1000
estimators, as well as a single scikit-learn decision tree [56],
operating on a mix of 16 analog and cumulant expert features
as proposed in [53] and [57]. The short-time nature of the
examples places this task on the difficult end of the modulation
classification spectrum since we cannot compute expert fea-
tures with high stability over long periods of time. The CNN
outperforms the boosted feature-based classifier by around
4 dB in the low to medium SNR range while the performance
at high SNR is similar. Performance in the single tree case is
about 6 dB worse than the CNN at medium SNR and 3.5 %
Fig. 11. Classifier performance comparison versus SNR.
worse at high SNR. Fig. 12 shows the confusion matrix for
feature engineering and either analytic decision trees (single the CNN at SNR = 10 dB, revealing confusing cases between
trees are widely used in practice) or trained discrimination QAM16 and QAM64 and between Wideband FM (WBFM)
methods operating on a compact feature space, such as sup- and double-sideband AM (AM-DSB). Despite the high SNR,
port vector machines, random forests, or small feedforward classification is imperfect due to several other impairments
NNs [53]. Some recent methods take a step beyond this using as described above. The distinction between AM-DSB and
pattern recognition on expert feature maps, such as the spec- WBFM is additionally complicated by the small observa-
tral coherence function or α-profile, combined with NN-based tion window (0.64 ms of modulated speech per example) and
classification [54]. However, approaches to this point have not low information rate with frequent silence between words.
sought to use feature learning on raw time-series data in the Discriminating between QAM16 and QAM64 also suffers
radio domain. This is however now the norm in computer from short-time observations over only a few symbols since
vision which motivates our approach here. constellations are higher order and share common points. The
As is widely done for image classification, we lever- accuracy of the feature-based classifier saturates at high SNR
age a series of narrowing convolutional layers followed by for the same reasons, and neither classifier reaches a perfect
dense/fully-connected layers and terminated with a dense score on this dataset. George and Huerta [58] report on a suc-
softmax layer for our classifier (similar to a VGG archi- cessful application of a similar CNN for the detection of black
tecture [55]). The layout is provided in Table V and we hole mergers in astrophysics from noisy time-series data.
refer to the source code [44] for implementation details. The
dataset6 for this benchmark consists of 1.2 M sequences of 128 IV. D ISCUSSION AND O PEN R ESEARCH C HALLENGES
complex-valued basedband IQ samples corresponding to ten
A. Data Sets and Challenges
different digital and analog single-carrier modulation schemes
(AM, FM, PSK, QAM, etc.) that have gone through a wireless In order to compare the performance of ML models and
channel with harsh impairments including multipath fading algorithms, it is crucial to have common benchmarks and open
and both clock and carrier rate offset [31]. The samples are datasets. While this is the rule in the computer vision, voice
taken at 20 different SNRs within the range from −20 dB to 7 At the time of writing of this document, XGB
18 dB. (http://xgboost.readthedocs.io/) was together with CNNs the ML model
that consistently won competitions on the data-science platform Kaggle
6 RML2016.10b—https://radioml.com/datasets/radioml-2016-10-dataset/. (https://www.kaggle.com/).

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 573

recognition, and natural language processing domains (e.g., C. Complex-Valued Neural Networks
MNIST8 or ImageNet,9 ) nothing comparable exists for com- Owing to the widely used complex baseband representation,
munications. This domain is somewhat different because it we typically deal with complex numbers in communications.
deals with inherently man-made signals that can be accurately Most related signal processing algorithms rely on phase rota-
generated synthetically, allowing the possibility of standard- tions, complex conjugates, absolute values, etc. For this reason,
izing data generation routines rather than just data in some it would be desirable to have NNs operate on complex rather
cases. It would be also desirable to establish a set of common than real numbers [61]. However, none of the previously
problems and the corresponding datasets (or data-generating described DL libraries (see Section II-B) currently support
software) on which researchers can benchmark and compare this due to several reasons. First, it is possible to represent all
their algorithms. One such example task is modulation clas- mathematical operations in the complex domain with a purely
sification in Section III-D; others could include mapping of real-valued NN of twice the size, i.e., each complex number
impaired received IQ samples or symbols to codewords or is simply represented by two real values. For example, an NN
bits. Even “autoencoder competitions” could be held for a with a scalar complex input and output connected through a
standardized set of benchmark impairments, taking the form single complex weight, i.e., y = wx, where y, w, x ∈ C, can be
of canonical “impairment layers” that would need to be made represented as a real-valued NN y = Wx, where the vectors
available for some of the major DL libraries (see Section II-B). y, x ∈ R2 contain the real and imaginary parts of y and x in
each dimension and W ∈ R2×2 is a weight matrix. Note that
B. Data Representation, Loss Functions, and Training SNR the real-valued version of this NN has four parameters while
the complex-valued version has only two. Second, a com-
As DL for communications is a new field, little is known plication arises in complex-valued NNs since traditional loss
about optimal data representations, loss-functions, and train- and activation functions are generally not holomorphic so that
ing strategies. For example, binary signals can be represented their gradients are not defined. A solution to this problem is
as binary or one-hot vectors, modulated (complex) symbols, Wirtinger calculus [62]. Although complex-valued NNs might
or integers, and the optimal representation might depend be simpler to train and consume less memory, we currently
among other factors of the NN architecture, learning objec- believe that they do not provide any significant advantage in
tive, and loss function. In decoding problems, for instance, terms of expressive power. Nevertheless, we keep them as an
one would have the choice between plain channel observa- interesting topic for future research.
tions or (clipped) log-likelihood ratios. In general, it seems
that there is a representation which is most suited to solve a
D. ML-Augmented Signal Processing
particular task via an NN. Similarly, it is not obvious at which
SNR(s) DL processing blocks should be trained. It is clearly The biggest challenge of learned end-to-end communica-
desirable that a learned system operates at any SNR, regard- tions systems is that of scalability to large message sets M .
less at which SNR or SNR-range it was trained. However, Already for k = 100 bits, i.e., M = 2100 possible messages,
we have observed that this is generally not the case. Training the training complexity is prohibitive since the autoencoder
at low SNR for instance does not allow for the discovery of must see at least every message once. Also naive neural chan-
structure important in higher SNR scenarios. Training across nel decoders (as studied in [33]) suffer from this “curse of
wide ranges of SNR can also severely effect training time. dimensionality” since they need to be trained on all possi-
George and Huerta [58] have observed that starting off the ble codewords. Thus, rather than switching immediately to
training at high SNR and then gradually lowering it with each learned end-to-end communications systems or fully replac-
epoch led to significant performance improvements for their ing certain algorithms by NNs, one more gradual approach
application. A related question is the optimal choice of loss might be that of augmenting only specific sub-tasks with DL.
function. In Sections III-A–III-C, we have treated communi- A very interesting approach in this context is deep unfolding
cations as a classification problem for which the categorical of existing iterative algorithms outlined in [25]. This approach
cross-entropy is a common choice. However, for alternate out- offers the potential to leverage additional side information
put data representations, the choice is less obvious. Applying from training data to improve an existing signal processing
an inappropriate loss function can lead to poor results. algorithm. It has been recently applied in the context of chan-
Choosing the right NN architecture and training parame- nel decoding and MIMO detection [22]–[24]. For instance
ters for SGD (such as mini-batch size and learning rate) are in [22], it was shown that training with a single codeword
also important practical questions for which no satisfying hard is sufficient since the structure of the code is embedded in
rules exist. Some guidelines can be found in [35, Ch. 11], the NN through the Tanner graph. The concept of RTNs as
but methods for how to select such hyper-parameters are cur- presented in Section III-C is another way of incorporating both
rently an active area of research and investigation in the DL side information from existing models along with information
world. Examples include architecture search guided by hyper- derived from a rich dataset into a DL algorithm to improve
gradients and differential hyper-parameters [59] as well as performance while reducing model and training complexity.
genetic algorithm or particle swarm style optimization [60].
E. System Identification for End-to-End Learning
8 http://yann.lecun.com/exdb/mnist/ In Sections III-A–III-C, we have tacitly assumed that
9 http://www.image-net.org/ the transfer function of the channel is known so that the

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
574 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017

backpropagation algorithm can compute its gradient. For our approach can provide interesting insight about the optimal
example, for a Rayleigh fading channel, the autoencoder needs communication schemes (e.g., constellations) in scenarios
to know during the training phase the exact realization of the where the optimal schemes are unknown (e.g., interference
channel coefficients to compute how a slight change in the channel). We believe that this is the beginning of a wide range
transmitted signal x impacts the received signal y. While this of studies into DL and ML for communications and are excited
is straightforward for simulated systems, it poses a major chal- at the possibilities this could lend towards future wireless com-
lenge for end-to-end learning over real channels and hardware. munications systems as the field matures. For now, there are
In essence, the hardware and channel together form a black- a great number of open problems to solve and practical gains
box whose input and output can be observed, but for which to be had. We have identified important key areas of future
no exact analytic expression is known a priori. Constructing investigation and highlighted the need for benchmark prob-
a model for a black box from data is called system identifi- lems and data sets that can be used to compare performance
cation [63], which is widely used in control theory. Transfer of different ML models and algorithms.
learning [64] is one appealing candidate for adapting an end-
to-end communications system trained on a statistical model to R EFERENCES
a real-world implementation which has worked well in other [1] T. S. Rappaport, Wireless Communications: Principles and Practice,
domains (e.g., computer vision). An important related ques- 2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 2002.
[2] R. M. Gagliardi and S. Karp, Optical Communications, 2nd ed.
tion is that of how one can learn a general model for a wide New York, NY, USA: Wiley, 1995.
range of communication scenarios and tasks that would avoid [3] H. Meyr, M. Moeneclaey, and S. A. Fechtel, Digital Communication
Receivers: Synchronization, Channel Estimation, and Signal Processing.
retraining from scratch for every individual setting. New York, NY, USA: Wiley, 1998.
[4] T. Schenk, RF Imperfections in High-Rate Wireless Systems: Impact and
Digital Compensation. Dordrecht, The Netherlands: Springer, 2008.
F. Feedback, CSI, and Privacy [5] J. Proakis and M. Salehi, Digital Communications, 5th ed. Boston, MA,
USA: McGraw-Hill Educ., 2007.
Feedback is a fundamental enabler in many adaptive com- [6] Y. LeCun, “Generalization and network design strategies,” in
munications systems (e.g., adaptive modulation and coding Connectionism in Perspective. Amsterdam, The Netherlands:
North-Holland, 1989, pp. 143–155.
(ACM)), while accurate channel state information (CSI) is [7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rec-
needed for multi-user MIMO communications. For this rea- tifiers: Surpassing human-level performance on imagenet classifica-
son, current cellular communication systems invest significant tion,” in Proc. IEEE Int. Conf. Comput. Vis., Santiago, Chile, 2015,
pp. 1026–1034.
resources (energy and time) in the acquisition and feedback [8] D. G. Lowe, “Object recognition from local scale-invariant features,” in
of CSI. In this work, we consider only learned physical Proc. IEEE Int. Conf. Comput. Vis., 1999, pp. 1150–1157.
[9] Z. S. Harris, “Distributional structure,” Word, vol. 10, nos. 2–3,
layer schemes that can cope with, but do not adapt to time- pp. 146–162, 1954.
varying channels. However, in [50], the concept of learning [10] A. Goldsmith, “Joint source/channel coding for wireless channels,”
in Proc. IEEE Veh. Technol. Conf., vol. 2. Chicago, IL, USA, 1995,
information encoder and decoder which adapt to a variable pp. 614–618.
key to provide privacy is introduced. A similar approach [11] E. Zehavi, “8-PSK trellis codes for a Rayleigh channel,” IEEE Trans.
Commun., vol. 40, no. 5, pp. 873–884, May 1992.
could be used to provide privacy at the physical layer and to [12] H. Wymeersch, Iterative Receiver Design, vol. 234. Cambridge, U.K.:
extend the channel autoencoder to support learning schemes Cambridge Univ. Press, 2007.
with CSI feedback. Finally, CSI is generally only used for [13] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
networks are universal approximators,” Neural Netw., vol. 2, no. 5,
precoding and mode selection, rather than being further ana- pp. 359–366, 1989.
lyzed to extract information. Storing and analyzing large [14] S. Reed and N. de Freitas, “Neural programmer-interpreters,” arXiv
preprint arXiv:1511.06279, 2015.
amounts of CSI (or other radio data)—possibly enriched with [15] H. T. Siegelmann and E. D. Sontag, “On the computational power of
location information—poses significant potential for revealing neural nets,” in Proc. 5th Annu. Workshop Comput. Learn. Theory,
Pittsburgh, PA, USA, 1992, pp. 440–449.
novel big-data-driven physical-layer understanding algorithms [16] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural
beyond immediate radio environment needs. New applications networks on CPUs,” in Proc. Deep Learn. Unsupervised Feature Learn.
NIPS Workshop, vol. 1. 2011, p. 4.
outside the traditional scope of communications, such as track- [17] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
ing and identification of humans (through walls) [65] as well energy-efficient reconfigurable accelerator for deep convolutional neu-
as gesture and emotion recognition [66], could be achieved ral networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017.
using ML on radio signals. [18] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsuper-
vised learning using graphics processors,” in Proc. Int. Conf. Mach.
Learn. (ICML), Montreal, QC, Canada, 2009, pp. 873–880.
V. C ONCLUSION [19] M. Ibnkahla, “Applications of neural networks to digital
communications—A survey,” Elsevier Signal Process., vol. 80,
We have discussed several promising new applications of no. 7, pp. 1185–1215, 2000.
DL to the physical layer. Most importantly, we have introduced [20] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machine-learning
techniques in cognitive radios,” IEEE Commun. Surveys Tuts., vol. 15,
a new way of thinking about communications as an end-to-end no. 3, pp. 1136–1159, 3rd Quart., 2013.
reconstruction optimization task using autoencoders to jointly [21] M. Zorzi, A. Zanella, A. Testolin, M. D. F. De Grazia, and M. Zorzi,
“Cognition-based networks: A new perspective on network optimization
learn transmitter and receiver implementations as well as using learning and distributed intelligence,” IEEE Access, vol. 3,
signal encodings without any prior knowledge. Comparisons pp. 1512–1530, 2015.
[22] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode lin-
with traditional baselines in various scenarios reveal extremely ear codes using deep learning,” in Proc. IEEE Annu. Allerton Conf.
competitive BLER performance, although the scalability to Commun. Control Comput. (Allerton), Monticello, IL, USA, 2016,
pp. 341–346.
long block lengths remains a challenge. Apart from potential [23] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decoding
performance improvements in terms of reliability or latency, of linear block codes,” arXiv preprint arXiv:1702.07560, 2017.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 575

[24] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” arXiv [54] A. Fehske, J. Gaeddert, and J. H. Reed, “A new approach to signal clas-
preprint arXiv:1706.01151, 2017. sification using spectral correlation and neural networks,” in Proc. IEEE
[25] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Int. Symp. New Front. Dyn. Spectr. Access Netw. (DYSPAN), Baltimore,
Model-based inspiration of novel deep architectures,” arXiv preprint MD, USA, 2005, pp. 144–150.
arXiv:1409.2574, 2014. [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[26] M. Borgerding and P. Schniter, “Onsager-corrected deep learning for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.
sparse linear inverse problems,” in Proc. IEEE Glob. Conf. Signal Inf. (ICLR), San Diego, CA, USA, 2015.
Process. (GlobalSIP), Washington, DC, USA, 2016, pp. 227–231. [56] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
[27] Y.-S. Jeon, S.-N. Hong, and N. Lee, “Blind detection for MIMO systems Learn. Res., vol. 12, pp. 2825–2830, Feb. 2011.
with low-resolution ADCs using supervised learning,” in Proc. IEEE Int. [57] A. Abdelmutalab, K. Assaleh, and M. El-Tarhuni, “Automatic mod-
Conf. Commun. (ICC), Paris, France, 2017, pp. 1–6. ulation classification based on high order cumulants and hierarchical
[28] N. Farsad and A. Goldsmith, “Detection algorithms for communication polynomial classifiers,” Phys. Commun., vol. 21, pp. 10–18, Dec. 2016.
systems using deep learning,” arXiv preprint arXiv:1705.08044, 2017. [58] D. George and E. A. Huerta, “Deep neural networks to enable real-time
[29] H. Sun et al., “Learning to optimize: Training deep neural networks for multimessenger astrophysics,” arXiv preprint arXiv:1701.00008, 2016.
wireless resource management,” arXiv preprint arXiv:1705.09412, 2017. [59] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyper-
[30] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: parameter optimization through reversible learning,” in Proc. 32nd Int.
Channel auto-encoders, domain specific regularizers, and attention,” in Conf. Mach. Learn. (ICML), Lille, France, 2015, pp. 2113–2122.
Proc. IEEE Int. Symp. Signal Process. Inf. Technol. (ISSPIT), Limassol, [60] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
Cyprus, 2016, pp. 223–228. optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, Jan. 2012.
[31] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio modula- [61] A. Hirose, Complex-Valued Neural Networks. Heidelberg, Germany:
tion recognition networks,” in Proc. Int. Conf. Eng. Appl. Neural Netw., Springer, 2006.
Aberdeen, U.K., 2016, pp. 213–226. [62] M. F. Amin, M. I. Amin, A. Y. H. Al-Nuaimi, and K. Murase, “Wirtinger
[32] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Unsupervised representa- calculus based gradient descent and Levenberg–Marquardt learning algo-
tion learning of structured radio communication signals,” in Proc. IEEE rithms in complex-valued neural networks,” in Proc. Int. Conf. Neural
Int. Workshop Sens. Process. Learn. Intell. Mach. (SPLINE), Aalborg, Inf. Process., Shanghai, China, 2011, pp. 550–559.
Denmark, 2016, pp. 1–5. [63] G. C. Goodwin and R. L. Payne, Dynamic System Identification:
[33] T. Gruber, S. Cammerer, J. Hoydis, and S. T. Brink, “On deep learning- Experiment Design and Data Analysis. New York, NY, USA: Academic
based channel decoding,” in Proc. IEEE 51st Annu. Conf. Inf. Sci. Press, 1977.
Syst. (CISS), Baltimore, MD, USA, 2017, pp. 1–6. [64] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
[34] S. Cammerer, T. Gruber, J. Hoydis, and S. T. Brink, “Scaling deep Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
learning-based decoding of polar codes via partitioning,” arXiv preprint [65] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, “Capturing the
arXiv:1702.06901, 2017. human figure through a wall,” ACM Trans. Graph., vol. 34, no. 6, p. 219,
[35] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, 2015.
MA, USA: MIT Press, 2016. [66] M. Zhao, F. Adib, and D. Katabi, “Emotion recognition using wire-
[36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and less signals,” in Proc. ACM Annu. Int. Conf. Mobile Comput. Netw.,
R. Salakhutdinov, “Dropout: A simple way to prevent neural networks New York, NY, USA, 2016, pp. 95–108.
from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
2014.
[37] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” in Proc. Int. Conf. Mach. Learn. (ICML), Haifa,
Israel, 2010, pp. 807–814.
[38] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
ding,” in Proc. 22nd ACM Int. Conf. Multimedia, Orlando, FL, USA, Timothy O’Shea (S’05–M’08–SM’13) received
2014, pp. 675–678. the M.S. degree in electrical engineering from
[39] T. Chen et al., “MXNet: A flexible and efficient machine learn-
North Carolina State University in 2007. He is cur-
ing library for heterogeneous distributed systems,” arXiv preprint
arXiv:1512.01274, 2015. rently pursuing the Ph.D. degree in electrical engi-
[40] M. Abadi et al. (2015). TensorFlow: Large-Scale Machine Learning on neering with Virginia Tech, where he is a Research
Heterogeneous Systems. [Online]. Available: http://tensorflow.org/ Associate. He is the Founder and CTO with DeepSig
[41] R. Al-Rfou et al., “Theano: A Python framework for fast computation Inc., Arlington, VA, USA. He was an Engineering
of mathematical expressions,” arXiv preprint arXiv:1605.02688, 2016. Researcher with UMIACS affiliated research center
[42] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A MATLAB- with the University of Maryland. He has been a Core
like environment for machine learning,” in Proc. BigLearn NIPS Contributor and a Technical Advisor to the GNU
Workshop, 2011, pp. 1–6. Radio Project since 2006. His research interests
[43] F. Chollet. (2015). Keras. [Online]. Available: include the application of machine learning and deep learning to software
https://github.com/fchollet/keras radio and novel applications in signal processing and synthesis of radio
[44] T. O’Shea and J. Hoydis. (2017). Source Code. [Online]. Available: communications and sensing systems.
https://github.com/radioml/introdlphy/
[45] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm
for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,
2006.
[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proc. Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA,
2015, pp. 1–15.
[47] Y. N. Dauphin et al., “Identifying and attacking the saddle point problem Jakob Hoydis (S’08–M’12) received the Diploma
in high-dimensional non-convex optimization,” in Proc. Adv. Neural Inf. (Dipl.-Ing.) degree in electrical engineering and
Process. Syst. (NIPS), Montreal, QC, Canada, 2014, pp. 2933–2941. information technology from RWTH Aachen
[48] L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. University, Germany, in 2008 and the Ph.D. degree
Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. from Supélec, Gif-sur-Yvette, France, in 2012.
[49] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural He is a Technical Staff Member with Nokia Bell
Inf. Process. Syst. (NIPS), 2014, pp. 2672–2680. Labs, France, where he is investigating applications
[50] M. Abadi and D. G. Andersen, “Learning to protect communications of deep learning for the physical layer. He was
with adversarial neural cryptography,” arXiv preprint arXiv:1610.06918, the Co-Founder and CTO of the social network
2016. SPRAED and worked for Alcatel-Lucent Bell Labs,
[51] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, Stuttgart, Germany. His research interests are in
“Spatial transformer networks,” in Proc. Adv. Neural Inf. Process.
Syst. (NIPS), 2015, pp. 2017–2025. the areas of machine learning, cloud computing, SDR, large random matrix
[52] J. Estaran et al., “Artificial neural networks for linear and non-linear theory, information theory, and signal processing, and their applications
impairment mitigation in high-baudrate IM/DD systems,” in Proc. 42nd to wireless communications. He was a recipient of the 2012 Publication
Eur. Conf. Opt. Commun. (ECOC), Düsseldorf, Germany, 2016, pp. 1–3. Prize of the Supélec Foundation, the 2013 VDE ITG Förderpreis, the 2015
[53] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation Leonard G. Abraham Prize of the IEEE COMSOC, the WCNC’2014 Best
recognition of communication signals,” IEEE Trans. Commun., vol. 46, Paper Award, and has been nominated as an Exemplary Reviewer 2012 for
no. 4, pp. 431–436, Apr. 1998. the IEEE C OMMUNICATION L ETTERS.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.

You might also like