An Introduction To Deep Learning For The Physical Layer
An Introduction To Deep Learning For The Physical Layer
Abstract—We present and discuss several novel applications problem, and hold promise for performance improvements
of deep learning (DL) for the physical layer. By interpreting in complex communications scenarios that are difficult to
a communications system as an autoencoder, we develop a describe with tractable mathematical models. Our main con-
fundamental new way to think about communications system
design as an end-to-end reconstruction task that seeks to jointly tributions are as follows:
arXiv:1702.00832v2 [cs.IT] 11 Jul 2017
optimize transmitter and receiver components in a single process. • We demonstrate that it is possible to learn full transmitter
We show how this idea can be extended to networks of multiple
transmitters and receivers and present the concept of radio and receiver implementations for a given channel model
transformer networks (RTNs) as a means to incorporate expert which are optimized for a chosen loss function (e.g.,
domain knowledge in the machine learning (ML) model. Lastly, minimizing block error rate (BLER)). Interestingly, such
we demonstrate the application of convolutional neural networks “learned” systems can be competitive with respect to the
(CNNs) on raw IQ samples for modulation classification which current state-of-the-art. The key idea here is to represent
achieves competitive accuracy with respect to traditional schemes
relying on expert features. The paper is concluded with a transmitter, channel, and receiver as one deep neural
discussion of open challenges and areas for future investigation. network (NN) that can be trained as an autoencoder. The
beauty of this approach is that it can even be applied to
channel models and loss functions for which the optimal
I. I NTRODUCTION solutions are unknown.
• We extend this concept to an adversarial network of
Communications is a field of rich expert knowledge about
multiple transmitter-receiver pairs competing for capacity.
how to model channels of different types [1], [2], compensate
This leads to the interference channel for which finding
for various hardware imperfections [3], [4], and design optimal
the best signaling scheme is a long-standing research
signaling and detection schemes that ensure a reliable transfer
problem. We demonstrate that such a setup can also be
of data [5]. As such, it is a complex and mature engineering
represented as an NN with multiple inputs and outputs,
field with many distinct areas of investigation which have
and that all transmitter and receiver implementations
all seen diminishing returns with regards to performance
can be jointly optimized with respect to a common or
improvements, in particular on the physical layer. Because
individual performance metric(s).
of this, there is a high bar of performance over which any
• We introduce radio transformer networks (RTNs) as a
machine learning (ML) or deep learning (DL) based approach
way to integrate expert knowledge into the DL model.
must pass in order to provide tangible new benefits.
RTNs allow, for example, to carry out predefined cor-
In domains such as computer vision and natural language
rection algorithms (“transformers”) at the receiver (e.g.,
processing, DL shines because it is difficult to characterize real
multiplication by a complex-valued number, convolution
world images or language with rigid mathematical models. For
with a vector) which may be fed with parameters learned
example, while it is an almost impossible task to write a robust
by another NN. This NN can be integrated into the
algorithm for detection of handwritten digits or objects in
end-to-end training process of a task performed on the
images, it is almost trivial today to implement DL algorithms
transformed signal (e.g., symbol detection).
that learn to accomplish this task beyond human levels of
• We study the use of NNs on complex-valued IQ samples
accuracy [6], [7]. In communications, on the other hand,
for the problem of modulation classification and show
we can design transmit signals that enable straightforward
that convolutional neural networks (CNNs), which are
algorithms for symbol detection for a variety of channel and
the cornerstone of most DL systems for computer vision,
system models (e.g., detection of a constellation symbol in
can outperform traditional classification techniques based
additive white Gaussian noise (AWGN)). Thus, as long as such
on expert features. This result mirrors a relentless trend
models sufficiently capture real effects we do not expect DL
in DL for various domains, where learned features ulti-
to yield significant improvements on the physical layer.
mately outperform and displace long-used expert features,
Nevertheless, we believe that the DL applications which
such as the scale-invariant feature transform (SIFT) [8]
we explore in this paper are a useful and insightful way of
and Bag-of-words [9].
fundamentally rethinking the communications system design
The ideas presented in this paper provide a multitude of
T. O’Shea is with the Bradley Department of Electrical and Computer interesting avenues for future research that will be discussed
Engineering, Virginia Tech and DeepSig, Arlington, VA, US ([email protected]).
J. Hoydis is with Nokia Bell Labs, Route de Villejust, 91620 Nozay, France in detail. We hope that these will stimulate wide interest within
([email protected]). the research community.
2
The rest of this article is structured as follows: Section I-A ability of algorithms and higher level programming languages
discusses potential benefits of DL for the physical layer. Sec- to make efficient use of them. The inherently concurrent nature
tion I-B presents related work. Background of deep learning is of computation and memory access across wide and deep
presented in Section II. In Section III, several DL applications NNs has demonstrated a surprising ability to readily achieve
for communications are presented. Section IV contains an high resource utilization on these architectures with minimal
overview and discussion of open problems and key areas of application specific tuning or optimization required.
future investigation. Section V concludes the article.
B. Historical context and related work
Applications of ML in communications have a long history
A. Potential of DL for the physical layer
covering a wide range of applications. These comprise channel
Apart from the intellectual beauty of a fully “learned” modeling and prediction, localization, equalization, decoding,
communications system, there are some reasons why DL could quantization, compression, demodulation, modulation recog-
provide gains over existing physical layer algorithms. nition, and spectrum sensing to name a few [19], [20] (and
First, most signal processing algorithms in communications references therein). However, to the best of our knowledge and
have solid foundations in statistics and information theory and due to the reasons mentioned above, few of these applications
are often provably optimal for tractable mathematically mod- have been commonly adopted or led to a wide commercial
els. These are generally linear, stationary, and have Gaussian success. It is also interesting that essentially all of these
statistics. A practical system, however, has many imperfections applications focus on individual receiver processing tasks
and non-linearities [4] (e.g., non-linear power amplifiers (PAs), alone, while the consideration of the transmitter or a full end-
finite resolution quantization) that can only be approximately to-end system is entirely missing in the literature.
captured by such models. For this reason, a DL-based commu- The advent of open-source DL libraries (see Section II-B)
nications system (or processing block) that does not require a and readily available specialized hardware along with the
mathematically tractable model and that can be optimized for astonishing progress of DL in computer vision have stimulated
a specific hardware configuration and channel might be able renewed interest in the application of DL for communications
to better optimize for such imperfections. and networking [21]. There are currently essentially two
Second, one of the guiding principles of communications different main approaches of applying DL to the physical
systems design is to split the signal processing into a chain layer. The goal is to either improve/augment parts of existing
of multiple independent blocks; each executing a well defined algorithms with DL, or to completely replace them.
and isolated function (e.g., source/channel coding, modulation, Among the papers falling into the first category are [22],
channel estimation, equalization). Although this approach has [23] and [24] that consider improved belief propagation
led to the efficient, versatile, and controllable systems we have channel decoding and multiple-input multiple-output (MIMO)
today, it is not clear that individually optimized processing detection, respectively. These works are inspired by the idea
blocks achieve the best possible end-to-end performance. For of deep unfolding [25] of existing iterative algorithms by
example, the separation of source and channel coding for essentially interpreting each iteration as a set of NN layers.
many practical channels and short block lengths (see [10] and In a similar manner, [26] aims at improving the solution of
references therein) as well as separate coding and modulation sparse linear inverse problems with DL.
[11] are known to be sub-optimal. Attempts to jointly optimize In the second category, papers include [27], dealing with
each of these components, e.g., based on factor graphs [12], blind detection for MIMO systems with low-resolution quan-
provide gains but lead to unwieldy and computationally com- tization, and [28], in which detection for molecular commu-
plex systems. A learned end-to-end communications system nications for which no mathematical channel model exists is
on the other hand is unlikely to have such a rigid modular studied. The idea of learning to solve complex optimization
structure as it is optimized for end-to-end performance. tasks for wireless resource allocation, such as power control,
Third, it has been shown that NNs are universal function is investigated in [29]. Some of us have also demonstrated
approximators [13] and recent work has shown a remarkable initial results in the area of learned end-to-end communications
capacity for algorithmic learning with recurrent NNs [14] that systems [30] as well as considered the problems of modula-
are known to be Turing-complete [15]. Since the execution tion recognition [31], signal compression [32], and channel
of NNs can be highly parallelized on concurrent architectures decoding [33], [34] with state-of-the art DL tools.
and easily implemented with low-precision data types [16], Notations: We use boldface upper- and lower-case letters to
there is evidence that “learned” algorithms taking this form denote matrices and column vectors, respectively. For a vector
could be executed faster and at lower energy cost than their x, xi denotes its ith element, kxk its Euclidean norm, xT its
manually “programmed” counterparts. transpose, and x y the element-wise product with y. For
Fourth, massively parallel processing architectures with a matrix X, Xij or [X]ij denotes the (i, j)-element. R and
distributed memory architectures, such as graphic processing C denote the sets of real and complex numbers, respectively.
units (GPUs) but also increasingly specialized chips for NN N (m, R) and CN (m, R) are the multivariate Gaussian and
inference (e.g., [17]), have shown to be very energy efficient complex Gaussian distributions with mean vector m and
and capable of impressive computational throughput when covariance matrix R, respectively. Bern(α) is the Bernoulli
fully utilized by concurrent algorithms [18]. The performance distribution with success probability α and ∇ is the gradient
of such architectures, however, has been largely limited by the operator.
3
θ t+1 = θ t − η∇L̃(θ t ) (4) from an input matrix X ∈ Rn×m 2 according to the following
2 In image processing, X is commonly a three-dimensional tensor with the
1 The linear activation function is typically used at the output layer in the third dimension corresponding to color channels. The filters weights are also
context of regression tasks, i.e., estimation of a real-valued vector. three-dimensional and work on each input channels simultaneously.
4
convolution: one time (see [35, Ch. 1.2] for a short history of DL). Layer-
a−1
XX b−1 by-layer pre-training [45] was also a recently popular method
f
Yi,j = Qfa−k,b−` X1+s(i−1)−k,1+s(j−1)−` (6) for scaling training to larger networks where backpropagation
k=0 `=0 once struggled. However, most systems today are able to train
networks which are both wide and deep directly using back-
where s ≥ 1 is an integer parameter called stride, n0 =
propagation and SGD methods with adaptive learning rates
1 + b n+a−2
s c and m0 = 1 + b m+b−2 s c, and it is assumed
(e.g., Adam [46]), regularization methods to prevent overfitting
that X is padded with zeros, i.e., Xi,j = 0 for all i ∈ / [1, n]
(e.g., Dropout [36]), and activations functions which reduce
and j ∈ / [1, m]. The output dimensions can be reduced by
gradient issues (e.g., ReLU [37]).
either increasing the stride s or by adding a pooling layer.
The pooling layer partitions Y into p × p regions for each of
which it computes a single output value, e.g., maximum or III. E XAMPLES OF MACHINE LEARNING APPLICATIONS
FOR THE PHYSICAL LAYER
average value, or L2-norm.
For example, taking a vectorized grayscale image input In this section, we will show how to represent an end-to-
consisting of 28 × 28 pixels and connecting it to a dense layer end communications system as an autoencoder and train it via
with the same number of activations, results in a single weight SGD. This idea is then extended to multiple transmitters and
matrix with 784 × 784 = 614, 656 free parameters. On the receivers and we study as an example the two-user interference
other hand, if we use a convolutional feature map containing channel. We will then introduce the concept of RTNs to
six filters each sized 5 × 5 pixels, we obtain a much reduced improve performance on fading channels, and demonstrate the
number of free parameters of 6 · 5 · 5 = 150. For the right kind application of CNNs to raw radio frequency time-series data
of dataset, this technique can be extremely effective. We will for the task of modulation classification.
see an application of convolutional layers in Section III-D. For
more details on CNNs, we refer to [35, Ch. 9]. A. Autoencoders for end-to-end communications systems
Figure 2: A communications system over an AWGN channel represented as an autoencoder. The input s is encoded as a one-hot
vector, the output is a probability distribution over all possible messages from which the most likely is picked as output ŝ.
non-linearly compress and reconstruct the input. In our case, any prior knowledge an encoder and decoder function that
the purpose of the autoencoder is different. It seeks to learn together achieve the same performance as the Hamming (7,4)
representations x of the messages s that are robust with code with MLD. The layout of the autoencoder is provided
respect to the channel impairments mapping x to y (i.e., noise, in Table IV. Although a single layer can represent the same
fading, distortion, etc.), so that the transmitted message can mapping from message index to corresponding transmit vector,
be recovered with small probability of error. In other words, our experiments have shown that SGD converges to a better
while most autoencoders remove redundancy from input data global solution using two transmit layers instead of one. This
for compression, this autoencoder (the “channel autoencoder”) increased dimension parameter search space may actually help
often adds redundancy, learning an intermediate representation to reduce likelihood of convergence to sub-optimal minima by
robust to channel perturbations. making such solutions more likely to emerge as saddle points
An example of such an autoencoder is shown in Fig. 2. Here, during optimization [47]. Training was done at a fixed value
the transmitter consists of a feedforward NN with multiple of Eb /N0 = 7 dB (cf. Section IV-B) using Adam [46] with
dense layers followed by a normalization layer that ensures learning rate 0.001. We have observed that increasing the batch
that physical constraints on x are met. Note that the input s to size during training helps to improve accuracy. For all other
the transmitter is encoded as a one-hot vector 1s ∈ RM , i.e., implementation details, we refer to the source code [44].
an M -dimensional vector, the sth element of which is equal Fig. 3b shows a similar comparison but for an (8,8) and
to one and zero otherwise. The channel is represented by an (2,2) communications system, i.e., R = 1. Surprisingly, while
additive noise layer with a fixed variance β = (2REb /N0 )−1 , the autoencoder achieves the same BLER as uncoded BPSK
where Eb /N0 denotes the energy per bit (Eb ) to noise power for (2,2), it outperforms the latter for (8,8) over the full range
spectral density (N0 ) ratio. The receiver is also implemented of Eb /N0 . This implies that it has learned some joint coding
as a feedforward NN. Its last layer uses a softmax activation and modulation scheme, such that a coding gain is achieved.
whose output p ∈ (0, 1)M is a probability vector over all For a truly fair comparison, this result should be compared to
possible messages. The decoded message ŝ corresponds then a higher-order modulation scheme using a channel code (or
to the index of the element of p with the highest probability. the optimal sphere packing in eight dimensions). A detailed
The autoencoder can then be trained end-to-end using SGD on performance comparison for various channel types and param-
the set of all possible messages s ∈ M using the well suited eters (n, k) with different baselines is out of the scope of this
categorical cross-entropy loss function between 1s and p.4 paper and left to future investigations.
Fig. 3a compares the block error rate (BLER), i.e.,
Pr(ŝ 6= s), of a communications system employing binary Fig. 4 shows the learned representations x of all messages
phase-shift keying (BPSK) modulation and a Hamming (7,4) for different values of (n, k) as complex constellation points,
code with either binary hard-decision decoding or maximum i.e., the x- and y-axes correspond to the first an second
likelihood decoding (MLD) against the BLER achieved by transmitted symbols, respectively. In Fig. 4d for (7, 4), we
the trained autoencoder (7,4) (with fixed energy constraint depict the seven-dimensional message representations using a
kxk22 = n). Both systems operate at rate R = 4/7. For com- two-dimensional t-distributed stochastic neighbor embedding
parison, we also provide the BLER of uncoded BPSK (4,4). (t-SNE)[48] of the noisy observations y instead. Fig. 4a shows
This result shows that the autoencoder has learned without the simple (2, 2) system which converges rapidly to a classical
quadrature phase shift keying (QPSK) constellation with some
4 A more memory-efficient approach to implement this architecture is by arbitrary rotation. Similarly, Fig. 4b shows a (4, 2) system
replacing the one-hot encoded input and the first dense layer by an embedding which leads to a rotated 16-PSK constellation. The impact
that turns message indices into vectors. The loss function can then be replaced
by the sparse categorical cross-entropy that accepts message indices rather of the chosen normalization becomes clear from Fig. 4c for
than one-hot vectors as labels. This was done in our experiments [44]. the same parameters but with an average power normalization
6
100 100
10−1 10−1
Block error rate
10−3 10−3
Uncoded BPSK (4,4) Uncoded BPSK (8,8)
Hamming (7,4) Hard Decision Autoencoder (8,8)
10−4 10−4
Autoencoder (7,4) Uncoded BPSK (2,2)
Hamming (7,4) MLD Autoencoder (2,2)
10−5 10−5
−4 −2 0 2 4 6 8 −2 0 2 4 6 8 10
Eb /N0 [dB] Eb /N0 [dB]
(a) (b)
Figure 3: BLER versus Eb /N0 for the autoencoder and several baseline communication schemes
10−1
C. Radio transformer networks for augmented signal process- where hc ∼ CN (0, L−1 IL ) are i.i.d. Rayleigh fading channel
ing algorithms taps, nc ∼ CN (0, (REb /N0 )−1 In ) is receiver noise, and xc ∈
Cn is the transmitted signal, where we assume in (12) xc,i = 0
Many of the physical phenomena undergone in a communi-
for i ≤ 0. Here, the goal of the parameter estimator is to
cations channel and in transceiver hardware can be inverted us-
predict a complex-valued vector ω c (represented by 2L real
ing compact parametric models/transformations. Widely used
values) that is used in the transformation layer to compute the
transformations include re-sampling to estimated symbol/clock
complex convolution of yc with ω c . Thus, the RTN tries to
timing, mixing with an estimated carrier tone, and convolving
equalize the channel output through inverse filtering in order
with an inverse channel impulse response. The estimation
to simplify the task of the discriminative network. We have
processes for parameters to seed these transformations (e.g.,
implemented the estimator as an NN with two dense layers
frequency offset, symbol timing, impulse response) is often
with tanh activations followed by a dense output layer with
very involved and specialized based on signal specific proper-
linear activations.
ties and/or information from pilot tones (see, e.g., [3]).
While the plain autoencoder struggles to meet the perfor-
One way of augmenting DL models with expert propagation
mance of differential BPSK (DBPSK) with maximum likeli-
domain knowledge but not signal specific assumptions is
hood sequence estimation (MLE) and a Hamming (7,4) code,
through the use of an RTN as shown in Fig. 8. An RTN
the autoencoder with RTN outperforms it. Another advantage
consists of three parts: (i) a learned parameter estimator gω :
of RTNs is faster training convergence which can be seen from
Rn 7→ Rp which computes a parameter vector ω ∈ Rp from
0 Fig. 10 that compares the validation loss of the autoencoder
its input y, (ii) a parametric transform t : Rn × Rp 7→ Rn that
with and without RTN as a function of the training epochs.
applies a deterministic (and differentiable) function to y which
We have observed in our experiments that the autoencoder
is parametrized by ω and suited to the propagation phenomena,
0 with RTN consistently outperforms the plain autoencoder,
and (iii) a learned discriminative network g : Rn 7→ M which
independently of the chosen hyper-parameters. However, the
produces the estimate ŝ of the transmitted message (or other
0 performance differences diminish when the encoder and de-
label information) from the canonicalized input y ∈ Rn .
coder networks are made wider and trained for more iterations.
By allowing the parameter estimator gω to take the form
Although there is theoretically nothing an RTN-augmented NN
of an NN, we can train the system end-to-end to optimize
can do that a plain NN cannot, the RTN helps by incorporating
for a given loss function. Importantly, the training process of
domain knowledge to simplify the target manifold, similar
such an RTN does not seek to directly improve the parameter
estimation itself but rather optimizes the way the parameters 7 We assume complex-valued channel uses, so that transmitter and receiver
are estimated to obtain the best end-to-end performance (e.g., have 2n real-valued inputs and outputs.
9
Figure 8: A radio receiver represented as an RTN. The input y first runs through a parameter estimation network gω (y), has
a known transform t(y, ω) applied to generate the canonicalized signal y, and then is classified in the discriminative network
g(y) to produce the output ŝ.
100 1
Autoencoder
Categorical cross-entropy loss
Autoencoder + RTN
0.8
10−1
Block error rate
0.6
10−2
0.4
Figure 9: BLER versus Eb /N0 for various communication Figure 10: Autoencoder training loss with and without RTN
schemes over a channel with L = 3 Rayleigh fading taps
posed in [53] and [57]. The short-time nature of the examples 0.6
places this task on difficult end of the modulation classification GFSK
spectrum since we cannot compute expert features with high PAM4
0.4
stability over long periods of time. We can see that the CNN QAM16
outperforms the boosted feature-based classifier by around
QAM64
4 dB in the low to medium SNR range while the performance 0.2
at high SNR is almost identical. Performance in the single QPSK
tree case is about 6 dB worse than the CNN at medium SNR WBFM
and 3.5 % worse at high SNR. Fig. 12 shows the confusion 0.0
SK
B
SK
K
SK
M4
6
4
SK
FM
M1
M6
FS
-DS
BP
GF
QP
WB
PA
CP
QA
QA
AM
cases for the CNN are between QAM16 and QAM64 and Prediction
between analog modulations Wideband FM (WBFM) and
Figure 12: Confusion matrix of the CNN (SNR = 10 dB)
double-sideband AM (AM-DSB), even at high SNR. The
confusion between AM and FM arises during times when
the underlying voice signal is idle or does not cary much voice recognition, and natural language processing domains
information. The distinction between QAM16 and QAM64 (e.g., MNIST10 or ImageNet11 ), nothing comparable exists for
is very hard with a short-time observation over only a few communications. This domain is somewhat different because it
symbols which share constellation points. The accuracy of the deals with inherently man-made signals that can be accurately
feature-based classifier saturates at high SNR for the same generated synthetically, allowing the possibility of standard-
reasons. In [58], the authors report on a successful application izing data generation routines rather than just data in some
of a similar CNN for the detection of black hole mergers in cases. It would be also desirable to establish a set of common
astrophysics from noisy time-series data. problems and the corresponding datasets (or data-generating
software) on which researchers can benchmark and compare
IV. D ISCUSSION AND OPEN RESEARCH CHALLENGES their algorithms. One such example task is modulation clas-
A. Data sets and challenges sification in Section III-D; others could include mapping of
impaired received IQ samples or symbols to codewords or
In order to compare the performance of ML models and
bits. Even “autoencoder competitions” could be held for a
algorithms, it is crucial to have common benchmarks and
standardized set of benchmark impairments, taking the form
open datasets. While this is the rule in the computer vision,
of canonical “impairment layers” that would need to be made
8 RML2016.10b—https://radioml.com/datasets/radioml-2016-10-dataset/ available for some of the major DL libraries (see Section II-B).
9 At the time of writing of this document, XGB (http://xgboost.readthedocs.
10 http://yann.lecun.com/exdb/mnist/
io/) was together with CNNs the ML model that consistently won competions
on the data-science platform Kaggle (https://www.kaggle.com/). 11 http://www.image-net.org/
11
B. Data representation, loss functions, and training SNR two. Second, a complication arises in complex-valued NNs
As DL for communications is a new field, little is known since traditional loss and activation functions are generally not
about optimal data representations, loss-functions, and training holomorphic so that their gradients are not defined. A solution
strategies. For example, binary signals can be represented to this problem is Wirtinger calculus [62]. Although complex-
as binary or one-hot vectors, modulated (complex) symbols, valued NNs might be easier to train and consume less memory,
or integers, and the optimal representation might depend we currently believe that they do not provide any significant
among other factors of the NN architecture, learning objective, advantage in terms of expressive power. Nevertheless, we keep
and loss function. In decoding problems, for instance, one them as an interesting topic for future research.
would have the choice between plain channel observations
or (clipped) log-likelihood ratios. In general, it seems that D. ML-augmented signal processing
there is a representation which is most suited to solve a The biggest challenge of learned end-to-end communica-
particular task via an NN. Similarly, it is not obvious at which tions systems is that of scalability to large message sets M.
SNR(s) DL processing blocks should be trained. It is clearly Already for k = 100 bits, i.e., M = 2100 possible messages,
desirable that a learned system operates at any SNR, regardless the training complexity is prohibitive since the autoencoder
at which SNR or SNR-range it was trained. However, we must see at least every message once. Also naive neural
have observed that this is generally not the case. Training at channel decoders (as studied in [33]) suffer from this “curse of
low SNR for instance does not allow for the discovery of dimensionality” since they need to be trained on all possible
structure important in higher SNR scenarios. Training across codewords. Thus, rather than switching immediately to learned
wide ranges of SNR can also severely effect training time. The end-to-end communications systems or fully replacing certain
authors of [58] have observed that starting off the training at algorithms by NNs, one more gradual approach might be
high SNR and then gradually lowering it with each epoch led that of augmenting only specific sub-tasks with DL. A very
to significant performance improvements for their application. interesting approach in this context is deep unfolding of
A related question is the optimal choice of loss function. existing iterative algorithms outlined in [25]. This approach
In Sections III-A to III-C, we have treated communications offers the potential to leverage additional side information
as a classification problem for which the categorical cross- from training data to improve an existing signal processing al-
entropy is a common choice. However, for alternate output gorithm. It has been recently applied in the context of channel
data representations, the choice is less obvious. Applying an decoding and MIMO detection [22], [23], [24]. For instance
inappropriate loss function can lead to poor results. in [22], it was shown that training with a single codeword
Choosing the right NN architecture and training parameters is sufficient since the structure of the code is embedded in
for SGD (such as mini-batch size and learning rate) are also the NN through the Tanner graph. The concept of RTNs as
important practical questions for which no satisfying hard presented in Section III-C is another way of incorporating both
rules exist. Some guidelines can be found in [35, Ch. 11], side information from existing models along with information
but methods for how to select such hyper-parameters are derived from a rich dataset into a DL algorithm to improve
currently an active area of research and investigation in the performance while reducing model and training complexity.
DL world. Examples include architecture search guided by
hyper-gradients and differential hyper-parameters [59] as well
E. System identification for end-to-end learning
as genetic algorithm or particle swarm style optimization [60].
In Sections III-A to III-C, we have tacitly assumed that the
transfer function of the channel is known so that the back-
C. Complex-valued neural networks propagation algorithm can compute its gradient. For example,
Owing to the widely used complex baseband representation, for a Rayleigh fading channel, the autoencoder needs to know
we typically deal with complex numbers in communications. during the training phase the exact realization of the channel
Most related signal processing algorithms rely on phase ro- coefficients to compute how a slight change in the transmitted
tations, complex conjugates, absolute values, etc. For this signal x impacts the received signal y. While this is easily
reason, it would be desirable to have NNs operate on complex possible for simulated systems, it poses a major challenge
rather than real numbers [61]. However, none of the previously for end-to-end learning over real channels and hardware. In
described DL libraries (see Section II-B) currently support this essence, the hardware and channel together form a black-box
due to several reasons. First, it is possible to represent all whose input and output can be observed, but for which no
mathematical operations in the complex domain with a purely exact analytic expression is known a priori. Constructing a
real-valued NN of twice the size, i.e., each complex number model for a black box from data is called system identification
is simply represented by two real values. For example, an NN [63], which is widely used in control theory. Transfer learning
with a scalar complex input and output connected through [64] is one appealing candidate for adapting an end-to-end
a single complex weight, i.e., y = wx, where y, w, x ∈ C, communications system trained on a statistical model to a real-
can be represented as a real-valued NN y = Wx, where world implementation which has worked well in other domains
the vectors y, x ∈ R2 contain the real and imaginary parts (e.g., computer vision). An important related question is that
of y and x in each dimension and W ∈ R2×2 is a weight of how one can learn a general model for a wide range of
matrix. Note that the real-valued version of this NN has communication scenarios and tasks that would avoid retraining
four parameters while the complex-valued version has only from scratch for every individual setting.
12
F. Learning from CSI and beyond [9] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–
162, 1954.
Accurate channel state information (CSI) is a fundamental [10] A. Goldsmith, “Joint source/channel coding for wireless channels,” in
requirement for multi-user MIMO communications. For this Proc. IEEE Vehicular Technol. Conf., vol. 2, 1995, pp. 614–618.
reason, current cellular communication systems invest signif- [11] E. Zehavi, “8-PSK trellis codes for a Rayleigh channel,” IEEE Trans.
Commun., vol. 40, no. 5, pp. 873–884, 1992.
icant resources (energy and time) in the acquisition of CSI [12] H. Wymeersch, Iterative receiver design. Cambridge University Press,
at the base station and user equipment. This information is 2007, vol. 234.
generally not used for anything apart from precoding/detection [13] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
networks are universal approximators,” Neural networks, vol. 2, no. 5,
or other tasks directly related to processing of the current data pp. 359–366, 1989.
frame. Storing and analyzing large amounts of CSI (or other [14] S. Reed and N. de Freitas, “Neural programmer-interpreters,” arXiv
radio data)—possibly enriched with location information— preprint arXiv:1511.06279, 2015.
[15] H. T. Siegelmann and E. D. Sontag, “On the computational power of
poses significant potential for revealing novel big-data-driven neural nets,” in Proc. 5th Annu. Workshop Computational Learning
physical-layer understanding algorithms beyond immidiate ra- Theory. ACM, 1992, pp. 440–449.
dio environment needs. New applications beyond the tradi- [16] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural
networks on CPUs,” in Proc. Deep Learning and Unsupervised Feature
tional scope of communications, such as tracking and iden- Learning NIPS Workshop, 2011.
tification of humans (through walls) [65] as well as gesture [17] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
and emotion recognition [66], could be achieved using ML on efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
radio signals. [18] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsupervised
learning using graphics processors,” in Proc. Int. Conf. Mach. Learn.
(ICML). ACM, 2009, pp. 873–880.
V. C ONCLUSION [19] M. Ibnkahla, “Applications of neural networks to digital
We have discussed several promising new applications of communications–A survey,” Elsevier Signal Processing, vol. 80,
no. 7, pp. 1185–1215, 2000.
DL to the physical layer. Most importantly, we have introduced [20] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machine-learning
a new way of thinking about communications as an end-to-end techniques in cognitive radios,” IEEE Commun. Surveys Tuts., vol. 15,
reconstruction optimization task using autoencoders to jointly no. 3, pp. 1136–1159, 2013.
[21] J. Qadir, K.-L. A. Yau, M. A. Imran, Q. Ni, and A. V. Vasilakos,
learn transmitter and receiver implementations as well as “IEEE Access Special Section Editorial: Artificial Intelligence Enabled
signal encodings without any prior knowledge. Comparisons Networking,” IEEE Access, vol. 3, pp. 3079–3082, 2015.
with traditional baselines in various scenarios reveal extremely [22] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear
codes using deep learning,” in Proc. IEEE Annu. Allerton Conf. Com-
competitive BLER performance, although the scalability to mun., Control, and Computing (Allerton), 2016, pp. 341–346.
long block lengths remains a challenge. Apart from potential [23] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decoding
performance improvements in terms of reliability or latency, of linear block codes,” arXiv preprint arXiv:1702.07560, 2017.
[24] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” arXiv
our approach can provide interesting insight about the opti- preprint arXiv:1706.01151, 2017.
mal communication schemes (e.g., constellations) in scenarios [25] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding:
where the optimal schemes are unknown (e.g., interference Model-based inspiration of novel deep architectures,” arXiv preprint
arXiv:1409.2574, 2014.
channel). We believe that this is the beginning of a wide [26] M. Borgerding and P. Schniter, “Onsager-corrected deep learning for
range of studies into DL and ML for communications and sparse linear inverse problems,” arXiv preprint arXiv:1607.05966, 2016.
are excited at the possibilities this could lend towards future [27] Y.-S. Jeon, S.-N. Hong, and N. Lee, “Blind detection for MIMO systems
with low-resolution ADCs using supervised learning,” arXiv preprint
wireless communications systems as the field matures. For arXiv:1610.07693, 2016.
now, there are a great number of open problems to solve [28] N. Farsad and A. Goldsmith, “Detection algorithms for communication
and practical gains to be had. We have identified important systems using deep learning,” arXiv preprint arXiv:1705.08044, 2017.
[29] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
key areas of future investigation and highlighted the need for “Learning to optimize: Training deep neural networks for wireless
benchmark problems and data sets that can be used to compare resource management,” arXiv preprint arXiv:1705.09412, 2017.
performance of different ML models and algorithms. [30] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate:
Channel auto-encoders, domain specific regularizers, and attention,” in
Proc. IEEE Int. Symp. Signal Process. and Inf. Technol. (ISSPIT), 2016,
R EFERENCES pp. 223–228.
[31] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio mod-
[1] T. S. Rappaport, Wireless communications: Principles and practice, ulation recognition networks,” in Proc. Int. Conf. Eng. Applications of
2nd ed. Prentice Hall, 2002. Neural Networks. Springer, 2016, pp. 213–226.
[2] R. M. Gagliardi and S. Karp, Optical communications, 2nd ed. Wiley, [32] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Unsupervised representation
1995. learning of structured radio communication signals,” in Proc. IEEE Int.
[3] H. Meyr, M. Moeneclaey, and S. A. Fechtel, Digital communication Workshop Sensing, Processing and Learning for Intelligent Machines
receivers: Synchronization, channel estimation, and signal processing. (SPLINE), 2016, pp. 1–5.
John Wiley & Sons, Inc., 1998. [33] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learning-
[4] T. Schenk, RF imperfections in high-rate wireless systems: Impact and based channel decoding,” in Proc. IEEE 51st Annu. Conf. Inf. Sciences
digital compensation. Springer Science & Business Media, 2008. Syst. (CISS), 2017, pp. 1–6.
[5] J. Proakis and M. Salehi, Digital Communications, 5th ed. McGraw- [34] S. Cammerer, T. Gruber, J. Hoydis, and S. t. Brink, “Scaling deep
Hill Education, 2007. learning-based decoding of polar codes via partitioning,” arXiv preprint
[6] Y. LeCun et al., “Generalization and network design strategies,” Con- arXiv:1702.06901, 2017.
nectionism in perspective, pp. 143–155, 1989. [35] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: 2016.
Surpassing human-level performance on imagenet classification,” in [36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
Proc. IEEE Int. Conf. Computer Vision, 2015, pp. 1026–1034. R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
[8] D. G. Lowe, “Object recognition from local scale-invariant features,” in from overfitting.” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
Proc. IEEE Int. Conf. Computer Vision, 1999, pp. 1150–1157. 2014.
13
[37] V. Nair and G. E. Hinton, “Rectified linear units improve restricted [52] J. Estaran et al., “Artificial neural networks for linear and non-linear
boltzmann machines,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, impairment mitigation in high-baudrate IM/DD systems,” in Proc. 42nd
pp. 807–814. European Conf. Optical Commun. (ECOC). VDE, 2016, pp. 1–3.
[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, [53] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for recognition of communication signals,” IEEE Trans. Commun., vol. 46,
fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. no. 4, pp. 431–436, 1998.
[39] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, [54] A. Fehske, J. Gaeddert, and J. H. Reed, “A new approach to signal clas-
C. Zhang, and Z. Zhang, “MXNet: A flexible and efficient machine sification using spectral correlation and neural networks,” in IEEE Int.
learning library for heterogeneous distributed systems,” arXiv preprint Symp. New Frontiers in Dynamic Spectrum Access Networks (DYSPAN),
arXiv:1512.01274, 2015. 2005, pp. 144–150.
[40] M. Abadi et al., “TensorFlow: Large-scale machine learning on [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
heterogeneous systems,” 2015, software available from tensorflow.org. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[Online]. Available: http://tensorflow.org/ [56] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
[41] R. Al-Rfou, G. Alain, A. Almahairi et al., “Theano: A Python frame- Learn. Res., vol. 12, pp. 2825–2830, 2011.
work for fast computation of mathematical expressions,” arXiv preprint [57] A. Abdelmutalab, K. Assaleh, and M. El-Tarhuni, “Automatic mod-
arXiv:1605.02688, 2016. ulation classification based on high order cumulants and hierarchical
[42] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like polynomial classifiers,” Physical Communication, vol. 21, pp. 10–18,
environment for machine learning,” in BigLearn, NIPS Workshop, 2011. 2016.
[43] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015. [58] D. George and E. Huerta, “Deep neural networks to enable real-time
[44] T. O’Shea and J. Hoydis, “Source code,” https://github.com/ multimessenger astrophysics,” arXiv preprint arXiv:1701.00008, 2016.
-available-after-review, 2017. [59] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyper-
[45] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for parameter optimization through reversible learning,” in Proc. 32nd Int.
deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, Conf. Mach. Learn. (ICML), 2015.
2006. [60] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
[46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” mization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, Feb. 2012.
arXiv preprint arXiv:1412.6980, 2014. [61] A. Hirose, Complex-valued neural networks. Springer Science &
[47] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Business Media, 2006.
Y. Bengio, “Identifying and attacking the saddle point problem in [62] M. F. Amin, M. I. Amin, A. Al-Nuaimi, and K. Murase, “Wirtinger
high-dimensional non-convex optimization,” in Advances in Neural calculus based gradient descent and Levenberg-Marquardt learning al-
Information Processing Systems (NIPS), 2014, pp. 2933–2941. gorithms in complex-valued neural networks,” in Int. Conf. on Neural
[48] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Information Processing. Springer, 2011, pp. 550–559.
Learn. Res., vol. 9, no. Nov, pp. 2579–2605, 2008. [63] G. C. Goodwin and R. L. Payne, Dynamic system identification: exper-
[49] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, iment design and data analysis. Academic press, 1977.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in [64] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
Advances in Neural Information Processing Systems (NIPS), 2014, pp. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
2672–2680. [65] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, “Capturing the
[50] M. Abadi and D. G. Andersen, “Learning to protect communications human figure through a wall,” ACM Trans. Graphics (TOG), vol. 34,
with adversarial neural cryptography,” arXiv preprint arXiv:1610.06918, no. 6, p. 219, 2015.
2016. [66] M. Zhao, F. Adib, and D. Katabi, “Emotion recognition using wireless
[51] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer signals,” in Proc. ACM Annu. Int. Conf. Mobile Computing and Net-
networks,” in Advances in Neural Information Processing Systems working, 2016, pp. 95–108.
(NIPS), 2015, pp. 2017–2025.