0% found this document useful (0 votes)
236 views

An Introduction To Deep Learning For The Physical Layer

1. The document introduces several novel applications of deep learning (DL) for the physical layer of communications systems. It presents a new way to think of system design as an end-to-end reconstruction task that jointly optimizes transmitter and receiver components using DL. 2. It shows how DL can be used to learn full transmitter and receiver implementations directly from a channel model to minimize a loss function, such as block error rate. These "learned" systems can perform competitively with state-of-the-art techniques. 3. The concept is extended to networks of transmitters and receivers competing for capacity, representing the interference channel as a neural network that can jointly optimize all components with respect to a performance metric

Uploaded by

jun zhao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
236 views

An Introduction To Deep Learning For The Physical Layer

1. The document introduces several novel applications of deep learning (DL) for the physical layer of communications systems. It presents a new way to think of system design as an end-to-end reconstruction task that jointly optimizes transmitter and receiver components using DL. 2. It shows how DL can be used to learn full transmitter and receiver implementations directly from a channel model to minimize a loss function, such as block error rate. These "learned" systems can perform competitively with state-of-the-art techniques. 3. The concept is extended to networks of transmitters and receivers competing for capacity, representing the interference channel as a neural network that can jointly optimize all components with respect to a performance metric

Uploaded by

jun zhao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

An Introduction to Deep Learning


for the Physical Layer
Tim O’Shea, Senior Member, IEEE, and Jakob Hoydis, Member, IEEE

Abstract—We present and discuss several novel applications problem, and hold promise for performance improvements
of deep learning (DL) for the physical layer. By interpreting in complex communications scenarios that are difficult to
a communications system as an autoencoder, we develop a describe with tractable mathematical models. Our main con-
fundamental new way to think about communications system
design as an end-to-end reconstruction task that seeks to jointly tributions are as follows:
arXiv:1702.00832v2 [cs.IT] 11 Jul 2017

optimize transmitter and receiver components in a single process. • We demonstrate that it is possible to learn full transmitter
We show how this idea can be extended to networks of multiple
transmitters and receivers and present the concept of radio and receiver implementations for a given channel model
transformer networks (RTNs) as a means to incorporate expert which are optimized for a chosen loss function (e.g.,
domain knowledge in the machine learning (ML) model. Lastly, minimizing block error rate (BLER)). Interestingly, such
we demonstrate the application of convolutional neural networks “learned” systems can be competitive with respect to the
(CNNs) on raw IQ samples for modulation classification which current state-of-the-art. The key idea here is to represent
achieves competitive accuracy with respect to traditional schemes
relying on expert features. The paper is concluded with a transmitter, channel, and receiver as one deep neural
discussion of open challenges and areas for future investigation. network (NN) that can be trained as an autoencoder. The
beauty of this approach is that it can even be applied to
channel models and loss functions for which the optimal
I. I NTRODUCTION solutions are unknown.
• We extend this concept to an adversarial network of
Communications is a field of rich expert knowledge about
multiple transmitter-receiver pairs competing for capacity.
how to model channels of different types [1], [2], compensate
This leads to the interference channel for which finding
for various hardware imperfections [3], [4], and design optimal
the best signaling scheme is a long-standing research
signaling and detection schemes that ensure a reliable transfer
problem. We demonstrate that such a setup can also be
of data [5]. As such, it is a complex and mature engineering
represented as an NN with multiple inputs and outputs,
field with many distinct areas of investigation which have
and that all transmitter and receiver implementations
all seen diminishing returns with regards to performance
can be jointly optimized with respect to a common or
improvements, in particular on the physical layer. Because
individual performance metric(s).
of this, there is a high bar of performance over which any
• We introduce radio transformer networks (RTNs) as a
machine learning (ML) or deep learning (DL) based approach
way to integrate expert knowledge into the DL model.
must pass in order to provide tangible new benefits.
RTNs allow, for example, to carry out predefined cor-
In domains such as computer vision and natural language
rection algorithms (“transformers”) at the receiver (e.g.,
processing, DL shines because it is difficult to characterize real
multiplication by a complex-valued number, convolution
world images or language with rigid mathematical models. For
with a vector) which may be fed with parameters learned
example, while it is an almost impossible task to write a robust
by another NN. This NN can be integrated into the
algorithm for detection of handwritten digits or objects in
end-to-end training process of a task performed on the
images, it is almost trivial today to implement DL algorithms
transformed signal (e.g., symbol detection).
that learn to accomplish this task beyond human levels of
• We study the use of NNs on complex-valued IQ samples
accuracy [6], [7]. In communications, on the other hand,
for the problem of modulation classification and show
we can design transmit signals that enable straightforward
that convolutional neural networks (CNNs), which are
algorithms for symbol detection for a variety of channel and
the cornerstone of most DL systems for computer vision,
system models (e.g., detection of a constellation symbol in
can outperform traditional classification techniques based
additive white Gaussian noise (AWGN)). Thus, as long as such
on expert features. This result mirrors a relentless trend
models sufficiently capture real effects we do not expect DL
in DL for various domains, where learned features ulti-
to yield significant improvements on the physical layer.
mately outperform and displace long-used expert features,
Nevertheless, we believe that the DL applications which
such as the scale-invariant feature transform (SIFT) [8]
we explore in this paper are a useful and insightful way of
and Bag-of-words [9].
fundamentally rethinking the communications system design
The ideas presented in this paper provide a multitude of
T. O’Shea is with the Bradley Department of Electrical and Computer interesting avenues for future research that will be discussed
Engineering, Virginia Tech and DeepSig, Arlington, VA, US ([email protected]).
J. Hoydis is with Nokia Bell Labs, Route de Villejust, 91620 Nozay, France in detail. We hope that these will stimulate wide interest within
([email protected]). the research community.
2

The rest of this article is structured as follows: Section I-A ability of algorithms and higher level programming languages
discusses potential benefits of DL for the physical layer. Sec- to make efficient use of them. The inherently concurrent nature
tion I-B presents related work. Background of deep learning is of computation and memory access across wide and deep
presented in Section II. In Section III, several DL applications NNs has demonstrated a surprising ability to readily achieve
for communications are presented. Section IV contains an high resource utilization on these architectures with minimal
overview and discussion of open problems and key areas of application specific tuning or optimization required.
future investigation. Section V concludes the article.
B. Historical context and related work
Applications of ML in communications have a long history
A. Potential of DL for the physical layer
covering a wide range of applications. These comprise channel
Apart from the intellectual beauty of a fully “learned” modeling and prediction, localization, equalization, decoding,
communications system, there are some reasons why DL could quantization, compression, demodulation, modulation recog-
provide gains over existing physical layer algorithms. nition, and spectrum sensing to name a few [19], [20] (and
First, most signal processing algorithms in communications references therein). However, to the best of our knowledge and
have solid foundations in statistics and information theory and due to the reasons mentioned above, few of these applications
are often provably optimal for tractable mathematically mod- have been commonly adopted or led to a wide commercial
els. These are generally linear, stationary, and have Gaussian success. It is also interesting that essentially all of these
statistics. A practical system, however, has many imperfections applications focus on individual receiver processing tasks
and non-linearities [4] (e.g., non-linear power amplifiers (PAs), alone, while the consideration of the transmitter or a full end-
finite resolution quantization) that can only be approximately to-end system is entirely missing in the literature.
captured by such models. For this reason, a DL-based commu- The advent of open-source DL libraries (see Section II-B)
nications system (or processing block) that does not require a and readily available specialized hardware along with the
mathematically tractable model and that can be optimized for astonishing progress of DL in computer vision have stimulated
a specific hardware configuration and channel might be able renewed interest in the application of DL for communications
to better optimize for such imperfections. and networking [21]. There are currently essentially two
Second, one of the guiding principles of communications different main approaches of applying DL to the physical
systems design is to split the signal processing into a chain layer. The goal is to either improve/augment parts of existing
of multiple independent blocks; each executing a well defined algorithms with DL, or to completely replace them.
and isolated function (e.g., source/channel coding, modulation, Among the papers falling into the first category are [22],
channel estimation, equalization). Although this approach has [23] and [24] that consider improved belief propagation
led to the efficient, versatile, and controllable systems we have channel decoding and multiple-input multiple-output (MIMO)
today, it is not clear that individually optimized processing detection, respectively. These works are inspired by the idea
blocks achieve the best possible end-to-end performance. For of deep unfolding [25] of existing iterative algorithms by
example, the separation of source and channel coding for essentially interpreting each iteration as a set of NN layers.
many practical channels and short block lengths (see [10] and In a similar manner, [26] aims at improving the solution of
references therein) as well as separate coding and modulation sparse linear inverse problems with DL.
[11] are known to be sub-optimal. Attempts to jointly optimize In the second category, papers include [27], dealing with
each of these components, e.g., based on factor graphs [12], blind detection for MIMO systems with low-resolution quan-
provide gains but lead to unwieldy and computationally com- tization, and [28], in which detection for molecular commu-
plex systems. A learned end-to-end communications system nications for which no mathematical channel model exists is
on the other hand is unlikely to have such a rigid modular studied. The idea of learning to solve complex optimization
structure as it is optimized for end-to-end performance. tasks for wireless resource allocation, such as power control,
Third, it has been shown that NNs are universal function is investigated in [29]. Some of us have also demonstrated
approximators [13] and recent work has shown a remarkable initial results in the area of learned end-to-end communications
capacity for algorithmic learning with recurrent NNs [14] that systems [30] as well as considered the problems of modula-
are known to be Turing-complete [15]. Since the execution tion recognition [31], signal compression [32], and channel
of NNs can be highly parallelized on concurrent architectures decoding [33], [34] with state-of-the art DL tools.
and easily implemented with low-precision data types [16], Notations: We use boldface upper- and lower-case letters to
there is evidence that “learned” algorithms taking this form denote matrices and column vectors, respectively. For a vector
could be executed faster and at lower energy cost than their x, xi denotes its ith element, kxk its Euclidean norm, xT its
manually “programmed” counterparts. transpose, and x y the element-wise product with y. For
Fourth, massively parallel processing architectures with a matrix X, Xij or [X]ij denotes the (i, j)-element. R and
distributed memory architectures, such as graphic processing C denote the sets of real and complex numbers, respectively.
units (GPUs) but also increasingly specialized chips for NN N (m, R) and CN (m, R) are the multivariate Gaussian and
inference (e.g., [17]), have shown to be very energy efficient complex Gaussian distributions with mean vector m and
and capable of impressive computational throughput when covariance matrix R, respectively. Bern(α) is the Bernoulli
fully utilized by concurrent algorithms [18]. The performance distribution with success probability α and ∇ is the gradient
of such architectures, however, has been largely limited by the operator.
3

II. D EEP LEARNING BASICS Table I: List of layer types


Name f` (r`−1 ; θ` ) θ`
A feedforward NN (or multilayer perceptron (MLP)) with
Dense σ (W` r`−1 + b` ) W` , b`
L layers describes a mapping f (r0 ; θ) : RN0 7→ RNL of an
Noise r`−1 + n, n ∼ N (0, βIN`−1 ) none
input vector r0 ∈ RN0 to an output vector rL ∈ RNL through Dropout [36] d r`−1√ , di ∼ Bern(α) none
L iterative processing steps: N`−1 r`−1
Normalization e.g., kr`−1 k2
none
r` = f` (r`−1 ; θ` ), ` = 1, . . . , L (1)
Table II: List of activation functions
where f` (r`−1 ; θ` ) : RN`−1 7→ RN` is the mapping carried Name [σ(u)]i Range
out by the `th layer. This mapping depends not only on the linear ui (−∞, ∞)
output vector r`−1 from the previous layer but also on a set ReLU [37] max(0, ui ) [0, ∞)
of parameters θ` . Moreover, the mapping can be stochastic, tanh tanh(ui ) (−1, 1)
i.e., f` can be a function of some random variables. We use sigmoid 1
1+eu−ui
(0, 1)
θ = {θ1 , . . . , θL } to denote the set of all parameters of the softmax e
P uj
i
(0, 1)
j e
network. The `th layer is called dense or fully-connected if
f` (r`−1 ; θ` ) has the form Table III: List of loss functions
f` (r`−1 ; θ` ) = σ (W` r`−1 + b` ) (2) Name l(u, v)
MSE ku − vk22
where W` ∈ RN` ×N`−1 , b` ∈ RN` , and σ(·) is an activation
P
Categorical cross-entropy − j uj log(vj )
function which we will be defined shortly. The set of param-
eters for this layer is θ` = {W` , b` }. Table I lists several
other layer types together with their mapping functions and where η > 0 is the learning rate and L̃(θ) is an approximation
parameters which are used in this manuscript. All layers with of the loss function which is computed for a random mini-
stochastic mappings generate a new random mapping each batch of training examples St ⊂ {1, 2, . . . , S} of size St at
time they are called. For example, the noise layer simply adds each iteration, i.e.,
a Gaussian noise vector with zero mean and covariance matrix 1 X ?
L̃(θ) = l(rL,i , rL,i ). (5)
βIN`−1 to the input. Thus, it generates a different output for the St
i∈St
same input each time it is called. The activation function σ(·)
in (2) introduces a non-linearity which is important for the so- By choosing St small compared to S, the gradient computation
called expressive power of the NN. Without this non-linearity complexity is significantly reduced while still reducing weight
there would be not much of an advantage of stacking multiple update variance. Note that there are many variants of the
layers on top of each other. Generally, the activation function SGD algorithm which dynamically adapt the learning rate
is applied individually to each element of its input vector, i.e., to improve convergence [35, Ch. 8.5]. The gradient in (4)
[σ(u)]i = σ(ui ). Some commonly used activation functions can be very efficiently computed through the backpropagation
are listed in Table II.1 NNs are generally trained using labeled algorithm [35, Ch. 6.5]. Definition and training of NNs of
training data, i.e., a set of input-output vector pairs (r0,i , r?L,i ), almost arbitrary shape can be easily done with one of the
i = 1, . . . , S, where r?L,i is the desired output of the neural many existing DL libraries presented in Section II-B.
network when r0,i is used as input. The goal of the training
process is to minimize the loss A. Convolutional layers
S Convolutional neural network (CNN) layers were introduced
1X ? in [6] to provide an efficient learning method for 2D images.
L(θ) = l(r , rL,i ) (3)
S i=1 L,i By tying adjacent shifts of the same weights together in a
way similar to that of a filter sliding across an input vector,
with respect to the parameters in θ, where convolutional layers are able to force the learning of features
l(u, v) : RNL × RNL 7→ R is the loss function and rL,i with an invariance to shifts in the input vector. In doing so,
is the output of the NN when r0,i is used as input. Several they also greatly reduce the model complexity (as measured
relevant loss functions are provided in Table III. Different by the number of free parameters in the layer’s weight matrix)
norms (e.g., L1, L2) of parameters or activations can be required to represent equivalent shift-invariant features using
added to the loss function to favor solutions with small or fully connected layers, reducing SGD optimization complexity
sparse values (a form of regularization). The most popular and improving generalization on appropriate datasets.
algorithm to find good sets of parameters θ is stochastic In general, a convolutional layer consists of a set of F filter
gradient descent (SGD) which starts with some random initial weights Qf ∈ Ra×b , f = 1, . . . , F (F is called the depth),
values of θ = θ 0 and then updates θ iteratively as which generate each a so-called feature map Yf ∈ Rn ×m
0 0

θ t+1 = θ t − η∇L̃(θ t ) (4) from an input matrix X ∈ Rn×m 2 according to the following
2 In image processing, X is commonly a three-dimensional tensor with the
1 The linear activation function is typically used at the output layer in the third dimension corresponding to color channels. The filters weights are also
context of regression tasks, i.e., estimation of a real-valued vector. three-dimensional and work on each input channels simultaneously.
4

convolution: one time (see [35, Ch. 1.2] for a short history of DL). Layer-
a−1
XX b−1 by-layer pre-training [45] was also a recently popular method
f
Yi,j = Qfa−k,b−` X1+s(i−1)−k,1+s(j−1)−` (6) for scaling training to larger networks where backpropagation
k=0 `=0 once struggled. However, most systems today are able to train
networks which are both wide and deep directly using back-
where s ≥ 1 is an integer parameter called stride, n0 =
propagation and SGD methods with adaptive learning rates
1 + b n+a−2
s c and m0 = 1 + b m+b−2 s c, and it is assumed
(e.g., Adam [46]), regularization methods to prevent overfitting
that X is padded with zeros, i.e., Xi,j = 0 for all i ∈ / [1, n]
(e.g., Dropout [36]), and activations functions which reduce
and j ∈ / [1, m]. The output dimensions can be reduced by
gradient issues (e.g., ReLU [37]).
either increasing the stride s or by adding a pooling layer.
The pooling layer partitions Y into p × p regions for each of
which it computes a single output value, e.g., maximum or III. E XAMPLES OF MACHINE LEARNING APPLICATIONS
FOR THE PHYSICAL LAYER
average value, or L2-norm.
For example, taking a vectorized grayscale image input In this section, we will show how to represent an end-to-
consisting of 28 × 28 pixels and connecting it to a dense layer end communications system as an autoencoder and train it via
with the same number of activations, results in a single weight SGD. This idea is then extended to multiple transmitters and
matrix with 784 × 784 = 614, 656 free parameters. On the receivers and we study as an example the two-user interference
other hand, if we use a convolutional feature map containing channel. We will then introduce the concept of RTNs to
six filters each sized 5 × 5 pixels, we obtain a much reduced improve performance on fading channels, and demonstrate the
number of free parameters of 6 · 5 · 5 = 150. For the right kind application of CNNs to raw radio frequency time-series data
of dataset, this technique can be extremely effective. We will for the task of modulation classification.
see an application of convolutional layers in Section III-D. For
more details on CNNs, we refer to [35, Ch. 9]. A. Autoencoders for end-to-end communications systems

B. Machine learning libraries


In recent times, numerous tools and algorithms have
emerged that make it easy to build and train large NNs. Tools Figure 1: A simple communications system consisting of a
to deploy such training routines from high level language to transmitter and a receiver connected through a channel
massively parallel GPU architectures have been key enablers.
Among these are Caffe [38], MXNet [39], TensorFlow [40], In its simplest form, a communications system consists of
Theano [41], and Torch [42] (just to name a few), which allow a transmitter, a channel, and a receiver, as shown in Fig. 1.
for high level algorithm definition in various programming The transmitter wants to communicate one out of M possible
languages or configuration files, automatic differentiation of messages s ∈ M = {1, 2, ..., M } to the receiver making
training loss functions through arbitrarily large networks, and n discrete uses of the channel. To this end, it applies the
compilation of the network’s forwards and backwards passes transformation f : M 7→ Rn to the message s to generate
into hardware optimized concurrent dense matrix algebra ker- the transmitted signal x = f (s) ∈ Rn .3 Generally, the
nels. Keras [43] provides an additional layer of NN primitives hardware of the transmitter imposes certain constraints on x,
with Theano and TensorFlow as its back-end. It has a highly e.g., an energy constraint kxk22 ≤ n, an amplitude
 constraint

customizable interface to quickly experiment with and deploy |xi | ≤ 1 ∀i, or an average power constraint E |xi |2 ≤ 1 ∀i.
deep NNs, and has become our primary tool used to generate The communication rate of this communications system is
the numerical results for this manuscript [44]. R = k/n [bit/channel use], where k = log2 (M ). In the
sequel, the notation (n,k) means that a communications system
sends one out of M = 2k messages (i.e., k bits) through
C. Network dimensions and training n channel uses. The channel is described by the conditional
The term “deep” has become common in recent NN liter- probability density function p(y|x), where y ∈ Rn denotes
ature, referring to the number of sequential layers within a the received signal. Upon reception of y, the receiver applies
network (but also more generally to the methods commonly the transformation g : Rn 7→ M to produce the estimate ŝ of
used to train such networks). Depth relates directly to the the transmitted message s.
number of iterative operations performed on input data through From a DL point of view, this simple communications
sequential layers’ transfer functions. While deep networks system can be seen as a particular type of autoencoder
allow for numerous iterative transforms on the data, a min- [35, Ch. 14]. Typically, the goal of an autoencoder is to
imum latency network would likely be as shallow as possible. find a low-dimensional representation of its input at some
“Width” is used to describe the number of output activations intermediate layer which allows reconstruction at the output
per layer, or for all layers on average, and relates directly to with minimal error. In this way, the autoencoder learns to
the memory required by each layer.
3 We focus here on real-valued signals only. Extensions to complex-valued
Best practice training methods have varied over the years,
signals are discussed in Section IV. Alternatively, one can consider a mapping
from direct solution techniques over gradient descent to ge- to R2n , which can be interpreted as a concatenation of the real and imaginary
netic algorithms, each having been favored or considered at parts of x. This approach is adopted in Sections III-B, III-C, and III-D.
5

Figure 2: A communications system over an AWGN channel represented as an autoencoder. The input s is encoded as a one-hot
vector, the output is a probability distribution over all possible messages from which the most likely is picked as output ŝ.

non-linearly compress and reconstruct the input. In our case, any prior knowledge an encoder and decoder function that
the purpose of the autoencoder is different. It seeks to learn together achieve the same performance as the Hamming (7,4)
representations x of the messages s that are robust with code with MLD. The layout of the autoencoder is provided
respect to the channel impairments mapping x to y (i.e., noise, in Table IV. Although a single layer can represent the same
fading, distortion, etc.), so that the transmitted message can mapping from message index to corresponding transmit vector,
be recovered with small probability of error. In other words, our experiments have shown that SGD converges to a better
while most autoencoders remove redundancy from input data global solution using two transmit layers instead of one. This
for compression, this autoencoder (the “channel autoencoder”) increased dimension parameter search space may actually help
often adds redundancy, learning an intermediate representation to reduce likelihood of convergence to sub-optimal minima by
robust to channel perturbations. making such solutions more likely to emerge as saddle points
An example of such an autoencoder is shown in Fig. 2. Here, during optimization [47]. Training was done at a fixed value
the transmitter consists of a feedforward NN with multiple of Eb /N0 = 7 dB (cf. Section IV-B) using Adam [46] with
dense layers followed by a normalization layer that ensures learning rate 0.001. We have observed that increasing the batch
that physical constraints on x are met. Note that the input s to size during training helps to improve accuracy. For all other
the transmitter is encoded as a one-hot vector 1s ∈ RM , i.e., implementation details, we refer to the source code [44].
an M -dimensional vector, the sth element of which is equal Fig. 3b shows a similar comparison but for an (8,8) and
to one and zero otherwise. The channel is represented by an (2,2) communications system, i.e., R = 1. Surprisingly, while
additive noise layer with a fixed variance β = (2REb /N0 )−1 , the autoencoder achieves the same BLER as uncoded BPSK
where Eb /N0 denotes the energy per bit (Eb ) to noise power for (2,2), it outperforms the latter for (8,8) over the full range
spectral density (N0 ) ratio. The receiver is also implemented of Eb /N0 . This implies that it has learned some joint coding
as a feedforward NN. Its last layer uses a softmax activation and modulation scheme, such that a coding gain is achieved.
whose output p ∈ (0, 1)M is a probability vector over all For a truly fair comparison, this result should be compared to
possible messages. The decoded message ŝ corresponds then a higher-order modulation scheme using a channel code (or
to the index of the element of p with the highest probability. the optimal sphere packing in eight dimensions). A detailed
The autoencoder can then be trained end-to-end using SGD on performance comparison for various channel types and param-
the set of all possible messages s ∈ M using the well suited eters (n, k) with different baselines is out of the scope of this
categorical cross-entropy loss function between 1s and p.4 paper and left to future investigations.
Fig. 3a compares the block error rate (BLER), i.e.,
Pr(ŝ 6= s), of a communications system employing binary Fig. 4 shows the learned representations x of all messages
phase-shift keying (BPSK) modulation and a Hamming (7,4) for different values of (n, k) as complex constellation points,
code with either binary hard-decision decoding or maximum i.e., the x- and y-axes correspond to the first an second
likelihood decoding (MLD) against the BLER achieved by transmitted symbols, respectively. In Fig. 4d for (7, 4), we
the trained autoencoder (7,4) (with fixed energy constraint depict the seven-dimensional message representations using a
kxk22 = n). Both systems operate at rate R = 4/7. For com- two-dimensional t-distributed stochastic neighbor embedding
parison, we also provide the BLER of uncoded BPSK (4,4). (t-SNE)[48] of the noisy observations y instead. Fig. 4a shows
This result shows that the autoencoder has learned without the simple (2, 2) system which converges rapidly to a classical
quadrature phase shift keying (QPSK) constellation with some
4 A more memory-efficient approach to implement this architecture is by arbitrary rotation. Similarly, Fig. 4b shows a (4, 2) system
replacing the one-hot encoded input and the first dense layer by an embedding which leads to a rotated 16-PSK constellation. The impact
that turns message indices into vectors. The loss function can then be replaced
by the sparse categorical cross-entropy that accepts message indices rather of the chosen normalization becomes clear from Fig. 4c for
than one-hot vectors as labels. This was done in our experiments [44]. the same parameters but with an average power normalization
6

100 100

10−1 10−1
Block error rate

Block error rate


10−2 10−2

10−3 10−3
Uncoded BPSK (4,4) Uncoded BPSK (8,8)
Hamming (7,4) Hard Decision Autoencoder (8,8)
10−4 10−4
Autoencoder (7,4) Uncoded BPSK (2,2)
Hamming (7,4) MLD Autoencoder (2,2)
10−5 10−5
−4 −2 0 2 4 6 8 −2 0 2 4 6 8 10
Eb /N0 [dB] Eb /N0 [dB]
(a) (b)
Figure 3: BLER versus Eb /N0 for the autoencoder and several baseline communication schemes

Table IV: Layout of the autoencoder used in Figs. 3a and 3b. It


has (2M + 1)(M + n) + 2M trainable parameters, resulting
in 62, 791, and 135,944 parameters for the (2,2), (7,4), and
(8,8) autoencoder, respectively.
Layer Output dimensions
Input M
Dense + ReLU M
Dense + linear n
Normalization n
Noise n (a) (b)
Dense + ReLU M
Dense + softmax M

instead of a fixed energy constraint (that forces the symbols


to lie on the unit circle). This results in an interesting mixed
pentagonal/hexagonal grid arrangement (with indistinguish-
able BLER performance from 16-QAM). The constellation
(c) (d)
has a symbol in the origin surrounded by five equally spaced
nearest neighbors, each of which has six almost equally spaced Figure 4: Constellations produced by autoencoders using pa-
neighbors. Fig. 4d for (7, 4) shows that the t-SNE embedding rameters (n, k): (a) (2, 2) (b) (2, 4), (c) (2, 4) with average
of y leads to a similarly shaped arrangement of clusters. power constraint, (d) (7, 4) 2-dimensional t-SNE embedding
of received symbols.
The examples of this section treat the communication task
as a classification problem and the representation of s by an
M -dimensional vector becomes quickly impractical for large B. Autoencoders for multiple transmitters and receivers
M . To circumvent this problem, it is possible to use more
The autoencoder concept from Section III-A, can be readily
compact representations of s such as a binary vector with
extended to multiple transmitters and receivers that share
log2 (M ) dimensions. In this case, output activation functions
a common channel. As an example, we consider here the
such as sigmoid and loss functions such as MSE or binary
two-user AWGN interference channel as shown in Fig. 5.
cross-entropy are more appropriate. Nevertheless, scaling such
Transmitter 1 wants to communicate message s1 ∈ M to
an architecture to very large values of M remains challenging
Receiver 1 while Transmitter 2 wants to communicate message
due to the size of the training set and model. We recall that a
s2 ∈ M to Receiver 2.5 Both transmitter-receiver pairs are
very important property of the autoencoder is also that it can
learn to communicate over any channel, even for which no 5 Extensions to K users with possibly different rates, i.e., s ∈ M ∀k, as
k k
information-theoretically optimal scheme is known. well as to other channel types are straightforward.
7

TS/AE (1, 1) TS/AE (2, 2) TS (4, 4)


AE (4, 4) TS (4, 8) AE (4, 8)
100

10−1

Block error rate


10−2
Figure 5: The two-user interference channel seen as a combi-
nation of two interfering autoencoders that try to reconstruct
their respective messages 10−3

implemented as NNs and the only difference with respect to 10−4


the autoencoder from the last section is that the transmitted
messages x1 , x2 ∈ Cn now interfere at the receivers, resulting 10−5
in the noisy observations 0 2 4 6 8 10 12 14
Eb /N0 [dB]
y1 = x1 + x2 + n1 (7)
y2 = x2 + x1 + n2 (8) Figure 6: BLER versus Eb /N0 for the two-user interference
channel achieved by the autoencoder (AE) and 22k/n -QAM
where n1 , n1 ∼ CN (0, βIn ) is Gaussian noise. For simplicity, time-sharing (TS) for different parameters (n, k)
we have adopted the complex-valued notation, rather than
considering real-valued vectors of size 2n. That is, the notation
(n, k) means each of the 2k messages is transmitted over n with higher-order modulation schemes (cf. Fig. 4c). As a
complex-valued channel uses. Denote by baseline for comparison, we provide the BLER of uncoded

l1 = − log [ŝ1 ]s1 , l2 = − log [ŝ2 ]s2

(9) 22k/n -QAM that has the same rate when used together with
time-sharing (TS) between both transmitters.6 While the au-
the individual cross-entropy loss functions of the first and toencoder and time-sharing have identical BLERs for (1, 1)
second transmitter-receiver pair, respectively, and by L̃1 (θ t ), and (2, 2), the former achieves substantial gains of around
L̃2 (θ t ) the associated losses for mini-batch t (cf. (5)). In 0.7 dB for (4, 4) and 1 dB for (4, 8) at a BLER of 10−3 . The
such a context, it is less clear how one should train two reasons for this are similar to those explained in Section III-A.
coupled autoencoders with conflicting goals. One approach It is interesting to have a look at the learned message
consists of minimizing a weighted sum of both losses, i.e., representations which are shown in Fig. 7. For (1, 1), the
L̃ = αL̃1 + (1 − α)L̃2 for some α ∈ [0, 1]. If one would transmitters have learned to use binary phase shift keying
minimize L̃1 alone (i.e., α = 1), Transmitter 2 would learn to (BPSK)-like constellations in orthogonal directions (with an
transmit a constant signal independent of s2 that Receiver 1 arbitrary rotation around the origin). This achieves the same
could simply subtract from y1 . The opposite is true for α = 0. performance as QPSK with time-sharing. However, for (2, 2),
However, giving equal weight to both losses (i.e., α = 0.5) the learned constellations are not orthogonal anymore and can
does not necessarily result in equal performance. We have be interpreted as some form of super-position coding. For the
observed in our experiments that it generally leads to highly first symbol, Transmitter 1 uses high power and Transmitter 2
unfair and suboptimal solutions. For this reason, we have low power. For the second symbol, the roles are changed.
adopted dynamic weights αt for each mini-batch t: For (4, 4) and (4, 8), the constellations are more difficult
to interpret, but we can see that the constellations of both
L̃1 (θ t )
αt+1 = , t>0 (10) transmitters resemble ellipses with orthogonal major axes and
L̃1 (θ t ) + L̃2 (θ t ) varying focal distances. This effect is more visible for (4, 8)
where α0 = 0.5. Thus, the smaller L̃1 (θ t ) is compared to than for (4, 4) because of the increased number of constellation
L̃2 (θ t ), the smaller is its weight αt+1 for the next mini-batch. points. An in-depth study of learned constellations and how
There are many other possibilities to train such a system and they are impacted by the chosen normalization and NN weight
we do not claim any optimality of our approach. However, it initializations is out of the scope of this paper but a very
has led in our experiments to the desired result of identical interesting topic of future investigations.
BLERs for both transmitter-receiver pairs. We would like to point out that one can easily consider other
Fig. 6 shows the BLER of one of the autoencoders (denoted types of multi-transmitter/receiver communications systems
by AE) as a function of Eb /N0 for the sets of parameters with this approach. These comprise, the general multiple
(n, k) = {(1, 1), (2, 2), (4, 4), (4, 8)}. The NN-layout for both access channel (MAC) and broadcast channel (BC), as well as
autoencoders is that provided in Table IV by letting n = 2n. 6 For (1, 1), (2, 2), and (4, 4), each transmitter sends a 4-QAM (i.e., QPSK)
We have used an average power constraint to be competitive symbol on every other channel use. For (4, 8), 16-QAM is used instead.
8

BLER). While the example above describes an RTN for


receiver-side processing, it can similarly be used wherever
parametric transformations seeded by estimated parameters are
needed. RTNs are a form of learned feed-forward attention
inspired by Spatial Transformer Networks (STNs) [51] which
(a) (b) have worked well for computer vision problems.
The basic functioning of an RTN is best understood from a
simple example, such as the problem of phase offset estimation
and compensation. Let yc = ejϕ ỹc ∈ Cn be a vector of IQ
samples that have undergone a phase rotation by the phase
offset ϕ, and let y = [<{y}T , ={y}T ]T ∈ R2n . The goal
(c)
of gω is to estimate a scalar ϕ̂ = ω = gω (y) that is close
to the phase offset ϕ, which is then used by the parametric
transform t to compute ȳc = e−j ϕ̂ yc . The canonicalized signal
ȳ = [<{ȳc }T , ={ȳc }T ]T is thus given by
 
cos(ϕ̂)<{ȳc } + sin(ϕ̂)={ȳc }
ȳ = t(ϕ̂, y) = (11)
cos(ϕ̂)={ȳc } − sin(ϕ̂)<{ȳc }
(d)
and then fed into the discriminative network g for further
Figure 7: Learned constellations for the two-user interference processing, such as classification.
channel with parameters (a) (1, 1), (b) (2, 2), (c) (4, 4), and A compelling example demonstrating the advantages of
(d) (4, 8). The constellation points of Transmitter 1 and 2 are RTNs is shown in Fig. 9 which compares the BLER of an
represented by red dots and black crosses, respectively. autoencoder (8,4)7 with and without RTN over a multipath
fading channel with L = 3 channel taps. That is, the received
signal y = [<{yc }T , ={yc }T ]T ∈ R2n is given as
systems with jammers and eavesdroppers. As soon as some of
L
the transmitters and receivers are non-cooperative, adversarial X
training strategies could be adopted (see [49], [50]). yc,i = hc,` xc,i−`+1 + nc,i (12)
`=1

C. Radio transformer networks for augmented signal process- where hc ∼ CN (0, L−1 IL ) are i.i.d. Rayleigh fading channel
ing algorithms taps, nc ∼ CN (0, (REb /N0 )−1 In ) is receiver noise, and xc ∈
Cn is the transmitted signal, where we assume in (12) xc,i = 0
Many of the physical phenomena undergone in a communi-
for i ≤ 0. Here, the goal of the parameter estimator is to
cations channel and in transceiver hardware can be inverted us-
predict a complex-valued vector ω c (represented by 2L real
ing compact parametric models/transformations. Widely used
values) that is used in the transformation layer to compute the
transformations include re-sampling to estimated symbol/clock
complex convolution of yc with ω c . Thus, the RTN tries to
timing, mixing with an estimated carrier tone, and convolving
equalize the channel output through inverse filtering in order
with an inverse channel impulse response. The estimation
to simplify the task of the discriminative network. We have
processes for parameters to seed these transformations (e.g.,
implemented the estimator as an NN with two dense layers
frequency offset, symbol timing, impulse response) is often
with tanh activations followed by a dense output layer with
very involved and specialized based on signal specific proper-
linear activations.
ties and/or information from pilot tones (see, e.g., [3]).
While the plain autoencoder struggles to meet the perfor-
One way of augmenting DL models with expert propagation
mance of differential BPSK (DBPSK) with maximum likeli-
domain knowledge but not signal specific assumptions is
hood sequence estimation (MLE) and a Hamming (7,4) code,
through the use of an RTN as shown in Fig. 8. An RTN
the autoencoder with RTN outperforms it. Another advantage
consists of three parts: (i) a learned parameter estimator gω :
of RTNs is faster training convergence which can be seen from
Rn 7→ Rp which computes a parameter vector ω ∈ Rp from
0 Fig. 10 that compares the validation loss of the autoencoder
its input y, (ii) a parametric transform t : Rn × Rp 7→ Rn that
with and without RTN as a function of the training epochs.
applies a deterministic (and differentiable) function to y which
We have observed in our experiments that the autoencoder
is parametrized by ω and suited to the propagation phenomena,
0 with RTN consistently outperforms the plain autoencoder,
and (iii) a learned discriminative network g : Rn 7→ M which
independently of the chosen hyper-parameters. However, the
produces the estimate ŝ of the transmitted message (or other
0 performance differences diminish when the encoder and de-
label information) from the canonicalized input y ∈ Rn .
coder networks are made wider and trained for more iterations.
By allowing the parameter estimator gω to take the form
Although there is theoretically nothing an RTN-augmented NN
of an NN, we can train the system end-to-end to optimize
can do that a plain NN cannot, the RTN helps by incorporating
for a given loss function. Importantly, the training process of
domain knowledge to simplify the target manifold, similar
such an RTN does not seek to directly improve the parameter
estimation itself but rather optimizes the way the parameters 7 We assume complex-valued channel uses, so that transmitter and receiver
are estimated to obtain the best end-to-end performance (e.g., have 2n real-valued inputs and outputs.
9

Figure 8: A radio receiver represented as an RTN. The input y first runs through a parameter estimation network gω (y), has
a known transform t(y, ω) applied to generate the canonicalized signal y, and then is classified in the discriminative network
g(y) to produce the output ŝ.

100 1
Autoencoder
Categorical cross-entropy loss
Autoencoder + RTN
0.8
10−1
Block error rate

0.6
10−2
0.4

10−3 Autoencoder (8,4)


0.2
DBPSK(8,7) MLE + Hamming(7,4)
Autoencoder (8,4) + RTN
10−4 0
0 5 10 15 20 0 20 40 60 80 100
Eb /N0 [dB] Training epoch

Figure 9: BLER versus Eb /N0 for various communication Figure 10: Autoencoder training loss with and without RTN
schemes over a channel with L = 3 Rayleigh fading taps

tion of single carrier modulation schemes based on sampled


to the role of convolutional layers in imparting translation radio frequency time-series data, i.e., IQ samples. This task has
invariance where appropriate. This leads to a simpler search been accomplished for years through the approach of expert
space and improved generalization. feature engineering and either analytic decision trees (single
The autoencoder and RTN as presented above can be easily trees are widely used in practice) or trained discrimination
extended to operate directly on IQ samples rather than symbols methods operating on a compact feature space, such as support
to effectively deal with problems such as pulse shaping, vector machines, random forests, or small feedforward NNs
timing-, frequency- and phase-offset compensation. This is [53]. Some recent methods take a step beyond this using
an exciting and promising area of research that we leave to pattern recognition on expert feature maps, such as the spectral
future investigations. Interesting applications of this approach coherence function or α-profile, combined with NN-based
could also arise in optical communications dealing with highly classification [54]. However, approaches to this point have not
non-linear channel impairments that are notoriously difficult to sought to use feature learning on raw time-series data in the
model and compensate for [52]. radio domain. This is however now the norm in computer
vision which motivates our approach here.
As is widely done for image classification, we leverage a se-
D. CNNs for classification tasks ries of narrowing convolutional layers followed by dense/fully-
Many signal processing functions within the physical layer connected layers and terminated with a dense softmax layer
can be learned as either regression or classification tasks. Here for our classifier (similar to a VGG architecture [55]). The
we look at the well-known problem of modulation classifica- layout is provided in Table V and we refer to the source code
10

Table V: Layout of the CNN for modulation classification with 1


324,330 trainable parameters

Correct classification probability


Layer Output dimensions
Input 2 × 128 0.8
Convolution (128 filters, size 2 × 8) + ReLU 128 × 121
Max Pooling (size 2, strides 2) 128 × 60
Convolution (64 filters, size 1 × 16) + ReLU 64 × 45
Max Pooling (size 2, strides 2) 64 × 22 0.6
Flatten 1408
Dense + ReLU 128
Dense + ReLU 64 0.4
Dense + ReLU 32
Dense + softmax 10 CNN
Boosted Tree
0.2
Single Tree
[44] for further implementation details. The dataset8 for this Random Guessing
benchmark consists of 1.2M sequences of 128 complex-valued 0
basedband IQ samples corresponding to ten different digital −20 −10 0 10
and analog single-carrier modulation schemes (AM, FM, PSK, SNR
QAM, etc.) that have gone through a wireless channel with
harsh realistic effects including multipath fading, sample rate Figure 11: Classifier performance comparison versus SNR
and center frequency offset [31]. The samples are taken at 20
different signal-to-noise ratios (SNRs) within the range from
−20 dB to 18 dB. 1.0
8PSK
In Fig. 11, we compare the classification accuracy of the
CNN against that of extreme gradient boosting9 with 1000 AM-DSB
0.8
estimators, as well as a single scikit-learn tree [56], operating BPSK
on a mix of 16 analog and cumulant expert features as pro- CPFSK
Ground truth

posed in [53] and [57]. The short-time nature of the examples 0.6
places this task on difficult end of the modulation classification GFSK
spectrum since we cannot compute expert features with high PAM4
0.4
stability over long periods of time. We can see that the CNN QAM16
outperforms the boosted feature-based classifier by around
QAM64
4 dB in the low to medium SNR range while the performance 0.2
at high SNR is almost identical. Performance in the single QPSK
tree case is about 6 dB worse than the CNN at medium SNR WBFM
and 3.5 % worse at high SNR. Fig. 12 shows the confusion 0.0
SK

B
SK

K
SK
M4

6
4
SK
FM
M1
M6
FS
-DS

matrix for the CNN at SNR = 10 dB revealing confusing


8P

BP

GF

QP
WB
PA
CP

QA
QA
AM

cases for the CNN are between QAM16 and QAM64 and Prediction
between analog modulations Wideband FM (WBFM) and
Figure 12: Confusion matrix of the CNN (SNR = 10 dB)
double-sideband AM (AM-DSB), even at high SNR. The
confusion between AM and FM arises during times when
the underlying voice signal is idle or does not cary much voice recognition, and natural language processing domains
information. The distinction between QAM16 and QAM64 (e.g., MNIST10 or ImageNet11 ), nothing comparable exists for
is very hard with a short-time observation over only a few communications. This domain is somewhat different because it
symbols which share constellation points. The accuracy of the deals with inherently man-made signals that can be accurately
feature-based classifier saturates at high SNR for the same generated synthetically, allowing the possibility of standard-
reasons. In [58], the authors report on a successful application izing data generation routines rather than just data in some
of a similar CNN for the detection of black hole mergers in cases. It would be also desirable to establish a set of common
astrophysics from noisy time-series data. problems and the corresponding datasets (or data-generating
software) on which researchers can benchmark and compare
IV. D ISCUSSION AND OPEN RESEARCH CHALLENGES their algorithms. One such example task is modulation clas-
A. Data sets and challenges sification in Section III-D; others could include mapping of
impaired received IQ samples or symbols to codewords or
In order to compare the performance of ML models and
bits. Even “autoencoder competitions” could be held for a
algorithms, it is crucial to have common benchmarks and
standardized set of benchmark impairments, taking the form
open datasets. While this is the rule in the computer vision,
of canonical “impairment layers” that would need to be made
8 RML2016.10b—https://radioml.com/datasets/radioml-2016-10-dataset/ available for some of the major DL libraries (see Section II-B).
9 At the time of writing of this document, XGB (http://xgboost.readthedocs.
10 http://yann.lecun.com/exdb/mnist/
io/) was together with CNNs the ML model that consistently won competions
on the data-science platform Kaggle (https://www.kaggle.com/). 11 http://www.image-net.org/
11

B. Data representation, loss functions, and training SNR two. Second, a complication arises in complex-valued NNs
As DL for communications is a new field, little is known since traditional loss and activation functions are generally not
about optimal data representations, loss-functions, and training holomorphic so that their gradients are not defined. A solution
strategies. For example, binary signals can be represented to this problem is Wirtinger calculus [62]. Although complex-
as binary or one-hot vectors, modulated (complex) symbols, valued NNs might be easier to train and consume less memory,
or integers, and the optimal representation might depend we currently believe that they do not provide any significant
among other factors of the NN architecture, learning objective, advantage in terms of expressive power. Nevertheless, we keep
and loss function. In decoding problems, for instance, one them as an interesting topic for future research.
would have the choice between plain channel observations
or (clipped) log-likelihood ratios. In general, it seems that D. ML-augmented signal processing
there is a representation which is most suited to solve a The biggest challenge of learned end-to-end communica-
particular task via an NN. Similarly, it is not obvious at which tions systems is that of scalability to large message sets M.
SNR(s) DL processing blocks should be trained. It is clearly Already for k = 100 bits, i.e., M = 2100 possible messages,
desirable that a learned system operates at any SNR, regardless the training complexity is prohibitive since the autoencoder
at which SNR or SNR-range it was trained. However, we must see at least every message once. Also naive neural
have observed that this is generally not the case. Training at channel decoders (as studied in [33]) suffer from this “curse of
low SNR for instance does not allow for the discovery of dimensionality” since they need to be trained on all possible
structure important in higher SNR scenarios. Training across codewords. Thus, rather than switching immediately to learned
wide ranges of SNR can also severely effect training time. The end-to-end communications systems or fully replacing certain
authors of [58] have observed that starting off the training at algorithms by NNs, one more gradual approach might be
high SNR and then gradually lowering it with each epoch led that of augmenting only specific sub-tasks with DL. A very
to significant performance improvements for their application. interesting approach in this context is deep unfolding of
A related question is the optimal choice of loss function. existing iterative algorithms outlined in [25]. This approach
In Sections III-A to III-C, we have treated communications offers the potential to leverage additional side information
as a classification problem for which the categorical cross- from training data to improve an existing signal processing al-
entropy is a common choice. However, for alternate output gorithm. It has been recently applied in the context of channel
data representations, the choice is less obvious. Applying an decoding and MIMO detection [22], [23], [24]. For instance
inappropriate loss function can lead to poor results. in [22], it was shown that training with a single codeword
Choosing the right NN architecture and training parameters is sufficient since the structure of the code is embedded in
for SGD (such as mini-batch size and learning rate) are also the NN through the Tanner graph. The concept of RTNs as
important practical questions for which no satisfying hard presented in Section III-C is another way of incorporating both
rules exist. Some guidelines can be found in [35, Ch. 11], side information from existing models along with information
but methods for how to select such hyper-parameters are derived from a rich dataset into a DL algorithm to improve
currently an active area of research and investigation in the performance while reducing model and training complexity.
DL world. Examples include architecture search guided by
hyper-gradients and differential hyper-parameters [59] as well
E. System identification for end-to-end learning
as genetic algorithm or particle swarm style optimization [60].
In Sections III-A to III-C, we have tacitly assumed that the
transfer function of the channel is known so that the back-
C. Complex-valued neural networks propagation algorithm can compute its gradient. For example,
Owing to the widely used complex baseband representation, for a Rayleigh fading channel, the autoencoder needs to know
we typically deal with complex numbers in communications. during the training phase the exact realization of the channel
Most related signal processing algorithms rely on phase ro- coefficients to compute how a slight change in the transmitted
tations, complex conjugates, absolute values, etc. For this signal x impacts the received signal y. While this is easily
reason, it would be desirable to have NNs operate on complex possible for simulated systems, it poses a major challenge
rather than real numbers [61]. However, none of the previously for end-to-end learning over real channels and hardware. In
described DL libraries (see Section II-B) currently support this essence, the hardware and channel together form a black-box
due to several reasons. First, it is possible to represent all whose input and output can be observed, but for which no
mathematical operations in the complex domain with a purely exact analytic expression is known a priori. Constructing a
real-valued NN of twice the size, i.e., each complex number model for a black box from data is called system identification
is simply represented by two real values. For example, an NN [63], which is widely used in control theory. Transfer learning
with a scalar complex input and output connected through [64] is one appealing candidate for adapting an end-to-end
a single complex weight, i.e., y = wx, where y, w, x ∈ C, communications system trained on a statistical model to a real-
can be represented as a real-valued NN y = Wx, where world implementation which has worked well in other domains
the vectors y, x ∈ R2 contain the real and imaginary parts (e.g., computer vision). An important related question is that
of y and x in each dimension and W ∈ R2×2 is a weight of how one can learn a general model for a wide range of
matrix. Note that the real-valued version of this NN has communication scenarios and tasks that would avoid retraining
four parameters while the complex-valued version has only from scratch for every individual setting.
12

F. Learning from CSI and beyond [9] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–
162, 1954.
Accurate channel state information (CSI) is a fundamental [10] A. Goldsmith, “Joint source/channel coding for wireless channels,” in
requirement for multi-user MIMO communications. For this Proc. IEEE Vehicular Technol. Conf., vol. 2, 1995, pp. 614–618.
reason, current cellular communication systems invest signif- [11] E. Zehavi, “8-PSK trellis codes for a Rayleigh channel,” IEEE Trans.
Commun., vol. 40, no. 5, pp. 873–884, 1992.
icant resources (energy and time) in the acquisition of CSI [12] H. Wymeersch, Iterative receiver design. Cambridge University Press,
at the base station and user equipment. This information is 2007, vol. 234.
generally not used for anything apart from precoding/detection [13] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
networks are universal approximators,” Neural networks, vol. 2, no. 5,
or other tasks directly related to processing of the current data pp. 359–366, 1989.
frame. Storing and analyzing large amounts of CSI (or other [14] S. Reed and N. de Freitas, “Neural programmer-interpreters,” arXiv
radio data)—possibly enriched with location information— preprint arXiv:1511.06279, 2015.
[15] H. T. Siegelmann and E. D. Sontag, “On the computational power of
poses significant potential for revealing novel big-data-driven neural nets,” in Proc. 5th Annu. Workshop Computational Learning
physical-layer understanding algorithms beyond immidiate ra- Theory. ACM, 1992, pp. 440–449.
dio environment needs. New applications beyond the tradi- [16] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural
networks on CPUs,” in Proc. Deep Learning and Unsupervised Feature
tional scope of communications, such as tracking and iden- Learning NIPS Workshop, 2011.
tification of humans (through walls) [65] as well as gesture [17] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
and emotion recognition [66], could be achieved using ML on efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
radio signals. [18] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsupervised
learning using graphics processors,” in Proc. Int. Conf. Mach. Learn.
(ICML). ACM, 2009, pp. 873–880.
V. C ONCLUSION [19] M. Ibnkahla, “Applications of neural networks to digital
We have discussed several promising new applications of communications–A survey,” Elsevier Signal Processing, vol. 80,
no. 7, pp. 1185–1215, 2000.
DL to the physical layer. Most importantly, we have introduced [20] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machine-learning
a new way of thinking about communications as an end-to-end techniques in cognitive radios,” IEEE Commun. Surveys Tuts., vol. 15,
reconstruction optimization task using autoencoders to jointly no. 3, pp. 1136–1159, 2013.
[21] J. Qadir, K.-L. A. Yau, M. A. Imran, Q. Ni, and A. V. Vasilakos,
learn transmitter and receiver implementations as well as “IEEE Access Special Section Editorial: Artificial Intelligence Enabled
signal encodings without any prior knowledge. Comparisons Networking,” IEEE Access, vol. 3, pp. 3079–3082, 2015.
with traditional baselines in various scenarios reveal extremely [22] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear
codes using deep learning,” in Proc. IEEE Annu. Allerton Conf. Com-
competitive BLER performance, although the scalability to mun., Control, and Computing (Allerton), 2016, pp. 341–346.
long block lengths remains a challenge. Apart from potential [23] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decoding
performance improvements in terms of reliability or latency, of linear block codes,” arXiv preprint arXiv:1702.07560, 2017.
[24] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” arXiv
our approach can provide interesting insight about the opti- preprint arXiv:1706.01151, 2017.
mal communication schemes (e.g., constellations) in scenarios [25] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding:
where the optimal schemes are unknown (e.g., interference Model-based inspiration of novel deep architectures,” arXiv preprint
arXiv:1409.2574, 2014.
channel). We believe that this is the beginning of a wide [26] M. Borgerding and P. Schniter, “Onsager-corrected deep learning for
range of studies into DL and ML for communications and sparse linear inverse problems,” arXiv preprint arXiv:1607.05966, 2016.
are excited at the possibilities this could lend towards future [27] Y.-S. Jeon, S.-N. Hong, and N. Lee, “Blind detection for MIMO systems
with low-resolution ADCs using supervised learning,” arXiv preprint
wireless communications systems as the field matures. For arXiv:1610.07693, 2016.
now, there are a great number of open problems to solve [28] N. Farsad and A. Goldsmith, “Detection algorithms for communication
and practical gains to be had. We have identified important systems using deep learning,” arXiv preprint arXiv:1705.08044, 2017.
[29] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
key areas of future investigation and highlighted the need for “Learning to optimize: Training deep neural networks for wireless
benchmark problems and data sets that can be used to compare resource management,” arXiv preprint arXiv:1705.09412, 2017.
performance of different ML models and algorithms. [30] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate:
Channel auto-encoders, domain specific regularizers, and attention,” in
Proc. IEEE Int. Symp. Signal Process. and Inf. Technol. (ISSPIT), 2016,
R EFERENCES pp. 223–228.
[31] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio mod-
[1] T. S. Rappaport, Wireless communications: Principles and practice, ulation recognition networks,” in Proc. Int. Conf. Eng. Applications of
2nd ed. Prentice Hall, 2002. Neural Networks. Springer, 2016, pp. 213–226.
[2] R. M. Gagliardi and S. Karp, Optical communications, 2nd ed. Wiley, [32] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Unsupervised representation
1995. learning of structured radio communication signals,” in Proc. IEEE Int.
[3] H. Meyr, M. Moeneclaey, and S. A. Fechtel, Digital communication Workshop Sensing, Processing and Learning for Intelligent Machines
receivers: Synchronization, channel estimation, and signal processing. (SPLINE), 2016, pp. 1–5.
John Wiley & Sons, Inc., 1998. [33] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learning-
[4] T. Schenk, RF imperfections in high-rate wireless systems: Impact and based channel decoding,” in Proc. IEEE 51st Annu. Conf. Inf. Sciences
digital compensation. Springer Science & Business Media, 2008. Syst. (CISS), 2017, pp. 1–6.
[5] J. Proakis and M. Salehi, Digital Communications, 5th ed. McGraw- [34] S. Cammerer, T. Gruber, J. Hoydis, and S. t. Brink, “Scaling deep
Hill Education, 2007. learning-based decoding of polar codes via partitioning,” arXiv preprint
[6] Y. LeCun et al., “Generalization and network design strategies,” Con- arXiv:1702.06901, 2017.
nectionism in perspective, pp. 143–155, 1989. [35] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: 2016.
Surpassing human-level performance on imagenet classification,” in [36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 1026–1034. R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
[8] D. G. Lowe, “Object recognition from local scale-invariant features,” in from overfitting.” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
Proc. IEEE Int. Conf. Computer Vision, 1999, pp. 1150–1157. 2014.
13

[37] V. Nair and G. E. Hinton, “Rectified linear units improve restricted [52] J. Estaran et al., “Artificial neural networks for linear and non-linear
boltzmann machines,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, impairment mitigation in high-baudrate IM/DD systems,” in Proc. 42nd
pp. 807–814. European Conf. Optical Commun. (ECOC). VDE, 2016, pp. 1–3.
[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, [53] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for recognition of communication signals,” IEEE Trans. Commun., vol. 46,
fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. no. 4, pp. 431–436, 1998.
[39] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, [54] A. Fehske, J. Gaeddert, and J. H. Reed, “A new approach to signal clas-
C. Zhang, and Z. Zhang, “MXNet: A flexible and efficient machine sification using spectral correlation and neural networks,” in IEEE Int.
learning library for heterogeneous distributed systems,” arXiv preprint Symp. New Frontiers in Dynamic Spectrum Access Networks (DYSPAN),
arXiv:1512.01274, 2015. 2005, pp. 144–150.
[40] M. Abadi et al., “TensorFlow: Large-scale machine learning on [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
heterogeneous systems,” 2015, software available from tensorflow.org. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[Online]. Available: http://tensorflow.org/ [56] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
[41] R. Al-Rfou, G. Alain, A. Almahairi et al., “Theano: A Python frame- Learn. Res., vol. 12, pp. 2825–2830, 2011.
work for fast computation of mathematical expressions,” arXiv preprint [57] A. Abdelmutalab, K. Assaleh, and M. El-Tarhuni, “Automatic mod-
arXiv:1605.02688, 2016. ulation classification based on high order cumulants and hierarchical
[42] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like polynomial classifiers,” Physical Communication, vol. 21, pp. 10–18,
environment for machine learning,” in BigLearn, NIPS Workshop, 2011. 2016.
[43] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015. [58] D. George and E. Huerta, “Deep neural networks to enable real-time
[44] T. O’Shea and J. Hoydis, “Source code,” https://github.com/ multimessenger astrophysics,” arXiv preprint arXiv:1701.00008, 2016.
-available-after-review, 2017. [59] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyper-
[45] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for parameter optimization through reversible learning,” in Proc. 32nd Int.
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, Conf. Mach. Learn. (ICML), 2015.
2006. [60] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
[46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” mization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, Feb. 2012.
arXiv preprint arXiv:1412.6980, 2014. [61] A. Hirose, Complex-valued neural networks. Springer Science &
[47] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Business Media, 2006.
Y. Bengio, “Identifying and attacking the saddle point problem in [62] M. F. Amin, M. I. Amin, A. Al-Nuaimi, and K. Murase, “Wirtinger
high-dimensional non-convex optimization,” in Advances in Neural calculus based gradient descent and Levenberg-Marquardt learning al-
Information Processing Systems (NIPS), 2014, pp. 2933–2941. gorithms in complex-valued neural networks,” in Int. Conf. on Neural
[48] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Information Processing. Springer, 2011, pp. 550–559.
Learn. Res., vol. 9, no. Nov, pp. 2579–2605, 2008. [63] G. C. Goodwin and R. L. Payne, Dynamic system identification: exper-
[49] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, iment design and data analysis. Academic press, 1977.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in [64] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
Advances in Neural Information Processing Systems (NIPS), 2014, pp. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
2672–2680. [65] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, “Capturing the
[50] M. Abadi and D. G. Andersen, “Learning to protect communications human figure through a wall,” ACM Trans. Graphics (TOG), vol. 34,
with adversarial neural cryptography,” arXiv preprint arXiv:1610.06918, no. 6, p. 219, 2015.
2016. [66] M. Zhao, F. Adib, and D. Katabi, “Emotion recognition using wireless
[51] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer signals,” in Proc. ACM Annu. Int. Conf. Mobile Computing and Net-
networks,” in Advances in Neural Information Processing Systems working, 2016, pp. 95–108.
(NIPS), 2015, pp. 2017–2025.

You might also like