An Introduction To Deep Learning For The Physical Layer
An Introduction To Deep Learning For The Physical Layer
Abstract—We present and discuss several novel applications of levels of accuracy [6], [7]. In communications, however, we
deep learning for the physical layer. By interpreting a communi- can design transmit signals that enable straightforward ana-
cations system as an autoencoder, we develop a fundamental new lytic algorithms for symbol detection for a variety of channel
way to think about communications system design as an end-to-
end reconstruction task that seeks to jointly optimize transmitter and system models (e.g., detection of a constellation symbol
and receiver components in a single process. We show how this in additive white Gaussian noise (AWGN)). Thus, as long as
idea can be extended to networks of multiple transmitters and such models sufficiently capture real effects, we do not expect
receivers and present the concept of radio transformer networks DL to yield significant improvements on the physical layer.
as a means to incorporate expert domain knowledge in the Nevertheless, we believe that the DL applications which
machine learning model. Lastly, we demonstrate the application
of convolutional neural networks on raw IQ samples for mod- we explore in this paper are a useful and insightful way of
ulation classification which achieves competitive accuracy with fundamentally rethinking the communications system design
respect to traditional schemes relying on expert features. This problem, and hold promise for performance improvements
paper is concluded with a discussion of open challenges and in complex communications scenarios that are difficult to
areas for future investigation. describe with tractable mathematical models. Our main con-
Index Terms—Machine learning, deep learning, physical layer, tributions are as follows:
digital communications, modulation, radio communication, cog- • We demonstrate that it is possible to learn full trans-
nitive radio. mitter and receiver implementations for a given channel
model which are optimized for a chosen loss function
(e.g., minimizing block error rate (BLER)). Interestingly,
I. I NTRODUCTION
such “learned” systems can be competitive with respect to
OMMUNICATIONS is a field of rich expert knowledge
C about how to model channels of different types [1], [2],
compensate for various hardware imperfections [3], [4], and
the current state-of-the-art. The key idea here is to repre-
sent transmitter, channel, and receiver as one deep neural
network (NN) that can be trained as an autoencoder. The
design optimal signaling and detection schemes that ensure beauty of this approach is that it can even be applied to
a reliable transfer of data [5]. As such, it is a complex and channel models and loss functions for which the optimal
mature engineering field with many distinct areas of investi- solutions are unknown.
gation which have all seen diminishing returns with regards • We extend this concept to an adversarial network of
to performance improvements, in particular on the physical multiple transmitter-receiver pairs competing for capacity.
layer. Because of this, there is a high bar of performance over This leads to the interference channel for which find-
which any machine learning (ML) or deep learning (DL) based ing the best signaling scheme is a long-standing research
approach must pass in order to provide tangible new benefits. problem. We demonstrate that such a setup can also
In domains such as computer vision and natural language be represented as an NN with multiple inputs and out-
processing, DL shines because it is difficult to characterize puts, and that all transmitter and receiver implementations
real world images or language with rigid mathematical mod- can be jointly optimized with respect to a common or
els. For example, while it is an almost impossible task to individual performance metric(s).
write a robust algorithm for detection of handwritten digits or • We introduce radio transformer networks (RTNs) as a
objects in images, it is straightforward today to implement DL way to integrate expert knowledge into the DL model.
algorithms that learn to accomplish this task beyond human RTNs allow, for example, to carry out predefined cor-
rection algorithms (“transformers”) at the receiver (e.g.,
Manuscript received February 24, 2017; revised July 11, 2017 and multiplication by a complex-valued number, convolution
September 18, 2017; accepted September 21, 2017. Date of publication
October 2, 2017; date of current version December 22, 2017. The associate with a vector) which may be fed with parameters learned
editor coordinating the review of this paper and approving it for publication by another NN. This NN can be integrated into the
was M. Zorzi. (Corresponding author: Timothy O’Shea.) end-to-end training process of a task performed on the
T. O’Shea is with the Bradley Department of Electrical and Computer
Engineering, Virginia Tech and DeepSig, Arlington, VA 22203 USA (e-mail: transformed signal (e.g., symbol detection).
[email protected]). • We study the use of NNs on complex-valued IQ samples
J. Hoydis is with the Department of Software-Defined Mobile for the problem of modulation classification and show
Networks, Nokia Bell Labs, 91620 Nozay, France (e-mail:
[email protected]). that convolutional neural networks (CNNs), which are
Digital Object Identifier 10.1109/TCCN.2017.2758370 the cornerstone of most DL systems for computer vision,
2332-7731 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
564 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017
can outperform traditional classification techniques based evidence that “learned” algorithms taking this form could be
on expert features. This result mirrors a relentless trend executed faster and at lower energy cost than their manually
in DL for various domains, where learned features ulti- “programmed” counterparts.
mately outperform and displace long-used expert features, Fourth, massively parallel processing architectures with
such as the scale-invariant feature transform (SIFT) [8] distributed memory architectures, such as graphic process-
and Bag-of-words [9]. ing units (GPUs) but also increasingly specialized chips for
The ideas presented in this paper provide a multitude of NN inference (e.g., [17]), have shown to be very energy
interesting avenues for future research that will be discussed efficient and capable of impressive computational through-
in detail. We hope that these will stimulate wide interest within put when fully utilized by concurrent algorithms [18]. The
the research community. performance of such architectures, however, has been largely
The rest of this article is structured as follows: Section I-A limited by the ability of algorithms and higher level program-
discusses potential benefits of DL for the physical layer. ming languages to make efficient use of them. The inherently
Section I-B presents related work. Background of DL is concurrent nature of computation and memory access across
presented in Section II. In Section III, several DL applica- wide and deep NNs has demonstrated a surprising ability to
tions for communications are presented. Section IV contains readily achieve high resource utilization on these architec-
an overview and discussion of open problems and key areas tures with minimal application specific tuning or optimization
of future investigation. Section V concludes the article. required.
A. Potential of DL for the Physical Layer B. Historical Context and Related Work
Apart from the intellectual beauty of a fully “learned” com- Applications of ML in communications have a long history
munications system, there are some reasons why DL could covering a wide range of applications. These comprise channel
provide gains over existing physical layer algorithms. modeling and prediction, localization, equalization, decoding,
First, most signal processing algorithms in communications quantization, compression, demodulation, modulation recog-
have solid foundations in statistics and information theory and nition, and spectrum sensing, to name a few [19], [20] (and
are often provably optimal for tractable mathematically mod- references therein). However, to the best of our knowledge and
els. These are generally linear, stationary, and have Gaussian due to the reasons mentioned above, few of these applications
statistics. A practical system, however, has many imperfec- have been commonly adopted or led to a wide commercial
tions and non-linearities [4] (e.g., non-linear power amplifiers success. It is also interesting that essentially all of these appli-
(PAs), finite resolution quantization) that can only be approx- cations focus on individual receiver processing tasks alone,
imately captured by such models. For this reason, a DL-based while the consideration of the transmitter or a full end-to-end
communications system (or processing block) that does not system is entirely missing in the literature.
require a mathematically tractable model and that can be opti- The advent of open-source DL libraries (see Section II-B)
mized for a specific hardware configuration and channel might and readily available specialized hardware along with the
be able to better optimize for such imperfections. astonishing progress of DL in computer vision have stimulated
Second, one of the guiding principles of communications renewed interest in the application of DL for communications
systems design is to split the signal processing into a chain and networking [21] (and other papers in the special issue).
of multiple independent blocks; each executing a well defined There are currently essentially two different main approaches
and isolated function (e.g., source/channel coding, modulation, of applying DL to the physical layer. The goal is to either
channel estimation, equalization). Although this approach has improve/augment parts of existing algorithms with DL, or to
led to the efficient, versatile, and controllable systems we have completely replace them.
today, it is not clear that individually optimized processing Among the papers trying to improve existing algorithms
blocks achieve the best possible end-to-end performance. For are [22]–[24] that consider belief propagation channel decod-
example, the separation of source and channel coding for many ing and multiple-input multiple-output (MIMO) detection,
practical channels and short block lengths (see [10] and refer- respectively. These works are inspired by the idea of deep
ences therein) as well as separate coding and modulation [11] unfolding [25] of existing iterative algorithms by essentially
are known to be sub-optimal. Attempts to jointly optimize interpreting each iteration as a set of NN layers. In a similar
each of these components, e.g., based on factor graphs [12], manner, [26] aims at improving the solution of sparse linear
provide gains but lead to unwieldy and computationally com- inverse problems with DL.
plex systems. A learned end-to-end communications system, The approach of replacing existing algorithms through
however, does not require such a rigid modular structure as it DL is adopted in [27], dealing with blind detection for
is optimized for end-to-end performance. MIMO systems with low-resolution quantization, and [28],
Third, it has been shown that NNs are universal function in which detection for molecular communications for which
approximators [13] and recent work has shown a remarkable no mathematical channel model exists is studied. The idea
capacity for algorithmic learning with recurrent NNs [14] that of learning to solve complex optimization tasks for wire-
are known to be Turing-complete [15]. Since the execution less resource allocation is investigated in [29]. Some of us
of NNs can be highly parallelized on concurrent architectures have obtained initial results in the area of learned end-to-end
and implemented with low-precision data types [16], there is communications systems [30] and considered the problems
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 565
TABLE I
of modulation recognition [31], signal compression [32], and L IST OF L AYER T YPES
channel decoding [33], [34] with state-of-the art DL tools.
Notations: We use boldface upper- and lower-case letters
to denote matrices and column vectors, respectively. For a
vector x, xi denotes its ith element, x its Euclidean norm,
xT its transpose, and x y the element-wise product with
y. For a matrix X, Xij or [X]ij denotes the (i, j)-element. R
and C denote the sets of real and complex numbers, respec-
tively. N (m, R) and CN (m, R) are the multivariate Gaussian TABLE II
L IST OF ACTIVATION F UNCTIONS
and complex Gaussian distributions with mean vector m and
covariance matrix R, respectively. Bern(α) is the Bernoulli
distribution with success probability α and ∇ is the gradient
operator.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
566 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 567
Fig. 2. A communications system over an AWGN channel represented as an autoencoder. The input s is encoded as a one-hot vector, the output is a
probability distribution over all possible messages from which the most likely is picked as output ŝ.
tion whose output p ∈ (0, 1)M is a probability vector over all replacing the one-hot encoded input and the first dense layer by an embedding
that turns message indices into vectors. The loss function can then be replaced
possible messages. The decoded message ŝ corresponds then by the sparse categorical cross-entropy that accepts message indices rather
to the index of the element of p with the highest probability. than one-hot vectors as labels. This was done in our experiments [44].
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
568 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017
Fig. 3. BLER versus Eb /N0 for the autoencoder and several baseline communication schemes.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 569
TABLE IV
L AYOUT OF THE AUTOENCODER U SED IN F IGS . 3a AND 3b. I T H AS
(2M + 1)(M + n) + 2M T RAINABLE PARAMETERS , R ESULTING IN 62,
791, AND 135,944 PARAMETERS FOR THE (2,2), (7,4), AND (8,8)
AUTOENCODER , R ESPECTIVELY
Fig. 6. BLER versus Eb /N0 for the two-user interference channel achieved
by the autoencoder (AE) and 22k/n -QAM time-sharing (TS) for different
parameters (n, k).
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
570 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 571
Fig. 8. A radio receiver represented as an RTN. The input y first runs through a parameter estimation network gω (y), has a known transform t(y, ω) applied
to generate the canonicalized signal y, and then is classified in the discriminative network g(y) to produce the output ŝ.
Fig. 9. BLER versus Eb /N0 for various communication schemes over a Fig. 10. Autoencoder training loss with and without RTN.
channel with L = 3 Rayleigh fading taps.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
572 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017
TABLE V
L AYOUT OF THE CNN FOR M ODULATION C LASSIFICATION
W ITH 324,330 T RAINABLE PARAMETERS
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 573
recognition, and natural language processing domains (e.g., C. Complex-Valued Neural Networks
MNIST8 or ImageNet,9 ) nothing comparable exists for com- Owing to the widely used complex baseband representation,
munications. This domain is somewhat different because it we typically deal with complex numbers in communications.
deals with inherently man-made signals that can be accurately Most related signal processing algorithms rely on phase rota-
generated synthetically, allowing the possibility of standard- tions, complex conjugates, absolute values, etc. For this reason,
izing data generation routines rather than just data in some it would be desirable to have NNs operate on complex rather
cases. It would be also desirable to establish a set of common than real numbers [61]. However, none of the previously
problems and the corresponding datasets (or data-generating described DL libraries (see Section II-B) currently support
software) on which researchers can benchmark and compare this due to several reasons. First, it is possible to represent all
their algorithms. One such example task is modulation clas- mathematical operations in the complex domain with a purely
sification in Section III-D; others could include mapping of real-valued NN of twice the size, i.e., each complex number
impaired received IQ samples or symbols to codewords or is simply represented by two real values. For example, an NN
bits. Even “autoencoder competitions” could be held for a with a scalar complex input and output connected through a
standardized set of benchmark impairments, taking the form single complex weight, i.e., y = wx, where y, w, x ∈ C, can be
of canonical “impairment layers” that would need to be made represented as a real-valued NN y = Wx, where the vectors
available for some of the major DL libraries (see Section II-B). y, x ∈ R2 contain the real and imaginary parts of y and x in
each dimension and W ∈ R2×2 is a weight matrix. Note that
B. Data Representation, Loss Functions, and Training SNR the real-valued version of this NN has four parameters while
the complex-valued version has only two. Second, a com-
As DL for communications is a new field, little is known plication arises in complex-valued NNs since traditional loss
about optimal data representations, loss-functions, and train- and activation functions are generally not holomorphic so that
ing strategies. For example, binary signals can be represented their gradients are not defined. A solution to this problem is
as binary or one-hot vectors, modulated (complex) symbols, Wirtinger calculus [62]. Although complex-valued NNs might
or integers, and the optimal representation might depend be simpler to train and consume less memory, we currently
among other factors of the NN architecture, learning objec- believe that they do not provide any significant advantage in
tive, and loss function. In decoding problems, for instance, terms of expressive power. Nevertheless, we keep them as an
one would have the choice between plain channel observa- interesting topic for future research.
tions or (clipped) log-likelihood ratios. In general, it seems
that there is a representation which is most suited to solve a
D. ML-Augmented Signal Processing
particular task via an NN. Similarly, it is not obvious at which
SNR(s) DL processing blocks should be trained. It is clearly The biggest challenge of learned end-to-end communica-
desirable that a learned system operates at any SNR, regard- tions systems is that of scalability to large message sets M .
less at which SNR or SNR-range it was trained. However, Already for k = 100 bits, i.e., M = 2100 possible messages,
we have observed that this is generally not the case. Training the training complexity is prohibitive since the autoencoder
at low SNR for instance does not allow for the discovery of must see at least every message once. Also naive neural chan-
structure important in higher SNR scenarios. Training across nel decoders (as studied in [33]) suffer from this “curse of
wide ranges of SNR can also severely effect training time. dimensionality” since they need to be trained on all possi-
George and Huerta [58] have observed that starting off the ble codewords. Thus, rather than switching immediately to
training at high SNR and then gradually lowering it with each learned end-to-end communications systems or fully replac-
epoch led to significant performance improvements for their ing certain algorithms by NNs, one more gradual approach
application. A related question is the optimal choice of loss might be that of augmenting only specific sub-tasks with DL.
function. In Sections III-A–III-C, we have treated communi- A very interesting approach in this context is deep unfolding
cations as a classification problem for which the categorical of existing iterative algorithms outlined in [25]. This approach
cross-entropy is a common choice. However, for alternate out- offers the potential to leverage additional side information
put data representations, the choice is less obvious. Applying from training data to improve an existing signal processing
an inappropriate loss function can lead to poor results. algorithm. It has been recently applied in the context of chan-
Choosing the right NN architecture and training parame- nel decoding and MIMO detection [22]–[24]. For instance
ters for SGD (such as mini-batch size and learning rate) are in [22], it was shown that training with a single codeword
also important practical questions for which no satisfying hard is sufficient since the structure of the code is embedded in
rules exist. Some guidelines can be found in [35, Ch. 11], the NN through the Tanner graph. The concept of RTNs as
but methods for how to select such hyper-parameters are cur- presented in Section III-C is another way of incorporating both
rently an active area of research and investigation in the DL side information from existing models along with information
world. Examples include architecture search guided by hyper- derived from a rich dataset into a DL algorithm to improve
gradients and differential hyper-parameters [59] as well as performance while reducing model and training complexity.
genetic algorithm or particle swarm style optimization [60].
E. System Identification for End-to-End Learning
8 http://yann.lecun.com/exdb/mnist/ In Sections III-A–III-C, we have tacitly assumed that
9 http://www.image-net.org/ the transfer function of the channel is known so that the
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
574 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 3, NO. 4, DECEMBER 2017
backpropagation algorithm can compute its gradient. For our approach can provide interesting insight about the optimal
example, for a Rayleigh fading channel, the autoencoder needs communication schemes (e.g., constellations) in scenarios
to know during the training phase the exact realization of the where the optimal schemes are unknown (e.g., interference
channel coefficients to compute how a slight change in the channel). We believe that this is the beginning of a wide range
transmitted signal x impacts the received signal y. While this of studies into DL and ML for communications and are excited
is straightforward for simulated systems, it poses a major chal- at the possibilities this could lend towards future wireless com-
lenge for end-to-end learning over real channels and hardware. munications systems as the field matures. For now, there are
In essence, the hardware and channel together form a black- a great number of open problems to solve and practical gains
box whose input and output can be observed, but for which to be had. We have identified important key areas of future
no exact analytic expression is known a priori. Constructing investigation and highlighted the need for benchmark prob-
a model for a black box from data is called system identifi- lems and data sets that can be used to compare performance
cation [63], which is widely used in control theory. Transfer of different ML models and algorithms.
learning [64] is one appealing candidate for adapting an end-
to-end communications system trained on a statistical model to R EFERENCES
a real-world implementation which has worked well in other [1] T. S. Rappaport, Wireless Communications: Principles and Practice,
domains (e.g., computer vision). An important related ques- 2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 2002.
[2] R. M. Gagliardi and S. Karp, Optical Communications, 2nd ed.
tion is that of how one can learn a general model for a wide New York, NY, USA: Wiley, 1995.
range of communication scenarios and tasks that would avoid [3] H. Meyr, M. Moeneclaey, and S. A. Fechtel, Digital Communication
Receivers: Synchronization, Channel Estimation, and Signal Processing.
retraining from scratch for every individual setting. New York, NY, USA: Wiley, 1998.
[4] T. Schenk, RF Imperfections in High-Rate Wireless Systems: Impact and
Digital Compensation. Dordrecht, The Netherlands: Springer, 2008.
F. Feedback, CSI, and Privacy [5] J. Proakis and M. Salehi, Digital Communications, 5th ed. Boston, MA,
USA: McGraw-Hill Educ., 2007.
Feedback is a fundamental enabler in many adaptive com- [6] Y. LeCun, “Generalization and network design strategies,” in
munications systems (e.g., adaptive modulation and coding Connectionism in Perspective. Amsterdam, The Netherlands:
North-Holland, 1989, pp. 143–155.
(ACM)), while accurate channel state information (CSI) is [7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rec-
needed for multi-user MIMO communications. For this rea- tifiers: Surpassing human-level performance on imagenet classifica-
son, current cellular communication systems invest significant tion,” in Proc. IEEE Int. Conf. Comput. Vis., Santiago, Chile, 2015,
pp. 1026–1034.
resources (energy and time) in the acquisition and feedback [8] D. G. Lowe, “Object recognition from local scale-invariant features,” in
of CSI. In this work, we consider only learned physical Proc. IEEE Int. Conf. Comput. Vis., 1999, pp. 1150–1157.
[9] Z. S. Harris, “Distributional structure,” Word, vol. 10, nos. 2–3,
layer schemes that can cope with, but do not adapt to time- pp. 146–162, 1954.
varying channels. However, in [50], the concept of learning [10] A. Goldsmith, “Joint source/channel coding for wireless channels,”
in Proc. IEEE Veh. Technol. Conf., vol. 2. Chicago, IL, USA, 1995,
information encoder and decoder which adapt to a variable pp. 614–618.
key to provide privacy is introduced. A similar approach [11] E. Zehavi, “8-PSK trellis codes for a Rayleigh channel,” IEEE Trans.
Commun., vol. 40, no. 5, pp. 873–884, May 1992.
could be used to provide privacy at the physical layer and to [12] H. Wymeersch, Iterative Receiver Design, vol. 234. Cambridge, U.K.:
extend the channel autoencoder to support learning schemes Cambridge Univ. Press, 2007.
with CSI feedback. Finally, CSI is generally only used for [13] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
networks are universal approximators,” Neural Netw., vol. 2, no. 5,
precoding and mode selection, rather than being further ana- pp. 359–366, 1989.
lyzed to extract information. Storing and analyzing large [14] S. Reed and N. de Freitas, “Neural programmer-interpreters,” arXiv
preprint arXiv:1511.06279, 2015.
amounts of CSI (or other radio data)—possibly enriched with [15] H. T. Siegelmann and E. D. Sontag, “On the computational power of
location information—poses significant potential for revealing neural nets,” in Proc. 5th Annu. Workshop Comput. Learn. Theory,
Pittsburgh, PA, USA, 1992, pp. 440–449.
novel big-data-driven physical-layer understanding algorithms [16] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural
beyond immediate radio environment needs. New applications networks on CPUs,” in Proc. Deep Learn. Unsupervised Feature Learn.
NIPS Workshop, vol. 1. 2011, p. 4.
outside the traditional scope of communications, such as track- [17] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
ing and identification of humans (through walls) [65] as well energy-efficient reconfigurable accelerator for deep convolutional neu-
as gesture and emotion recognition [66], could be achieved ral networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017.
using ML on radio signals. [18] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsuper-
vised learning using graphics processors,” in Proc. Int. Conf. Mach.
Learn. (ICML), Montreal, QC, Canada, 2009, pp. 873–880.
V. C ONCLUSION [19] M. Ibnkahla, “Applications of neural networks to digital
communications—A survey,” Elsevier Signal Process., vol. 80,
We have discussed several promising new applications of no. 7, pp. 1185–1215, 2000.
DL to the physical layer. Most importantly, we have introduced [20] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machine-learning
techniques in cognitive radios,” IEEE Commun. Surveys Tuts., vol. 15,
a new way of thinking about communications as an end-to-end no. 3, pp. 1136–1159, 3rd Quart., 2013.
reconstruction optimization task using autoencoders to jointly [21] M. Zorzi, A. Zanella, A. Testolin, M. D. F. De Grazia, and M. Zorzi,
“Cognition-based networks: A new perspective on network optimization
learn transmitter and receiver implementations as well as using learning and distributed intelligence,” IEEE Access, vol. 3,
signal encodings without any prior knowledge. Comparisons pp. 1512–1530, 2015.
[22] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode lin-
with traditional baselines in various scenarios reveal extremely ear codes using deep learning,” in Proc. IEEE Annu. Allerton Conf.
competitive BLER performance, although the scalability to Commun. Control Comput. (Allerton), Monticello, IL, USA, 2016,
pp. 341–346.
long block lengths remains a challenge. Apart from potential [23] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decoding
performance improvements in terms of reliability or latency, of linear block codes,” arXiv preprint arXiv:1702.07560, 2017.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.
O’SHEA AND HOYDIS: INTRODUCTION TO DL FOR PHYSICAL LAYER 575
[24] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” arXiv [54] A. Fehske, J. Gaeddert, and J. H. Reed, “A new approach to signal clas-
preprint arXiv:1706.01151, 2017. sification using spectral correlation and neural networks,” in Proc. IEEE
[25] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding: Int. Symp. New Front. Dyn. Spectr. Access Netw. (DYSPAN), Baltimore,
Model-based inspiration of novel deep architectures,” arXiv preprint MD, USA, 2005, pp. 144–150.
arXiv:1409.2574, 2014. [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[26] M. Borgerding and P. Schniter, “Onsager-corrected deep learning for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.
sparse linear inverse problems,” in Proc. IEEE Glob. Conf. Signal Inf. (ICLR), San Diego, CA, USA, 2015.
Process. (GlobalSIP), Washington, DC, USA, 2016, pp. 227–231. [56] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
[27] Y.-S. Jeon, S.-N. Hong, and N. Lee, “Blind detection for MIMO systems Learn. Res., vol. 12, pp. 2825–2830, Feb. 2011.
with low-resolution ADCs using supervised learning,” in Proc. IEEE Int. [57] A. Abdelmutalab, K. Assaleh, and M. El-Tarhuni, “Automatic mod-
Conf. Commun. (ICC), Paris, France, 2017, pp. 1–6. ulation classification based on high order cumulants and hierarchical
[28] N. Farsad and A. Goldsmith, “Detection algorithms for communication polynomial classifiers,” Phys. Commun., vol. 21, pp. 10–18, Dec. 2016.
systems using deep learning,” arXiv preprint arXiv:1705.08044, 2017. [58] D. George and E. A. Huerta, “Deep neural networks to enable real-time
[29] H. Sun et al., “Learning to optimize: Training deep neural networks for multimessenger astrophysics,” arXiv preprint arXiv:1701.00008, 2016.
wireless resource management,” arXiv preprint arXiv:1705.09412, 2017. [59] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyper-
[30] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: parameter optimization through reversible learning,” in Proc. 32nd Int.
Channel auto-encoders, domain specific regularizers, and attention,” in Conf. Mach. Learn. (ICML), Lille, France, 2015, pp. 2113–2122.
Proc. IEEE Int. Symp. Signal Process. Inf. Technol. (ISSPIT), Limassol, [60] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
Cyprus, 2016, pp. 223–228. optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, Jan. 2012.
[31] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio modula- [61] A. Hirose, Complex-Valued Neural Networks. Heidelberg, Germany:
tion recognition networks,” in Proc. Int. Conf. Eng. Appl. Neural Netw., Springer, 2006.
Aberdeen, U.K., 2016, pp. 213–226. [62] M. F. Amin, M. I. Amin, A. Y. H. Al-Nuaimi, and K. Murase, “Wirtinger
[32] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Unsupervised representa- calculus based gradient descent and Levenberg–Marquardt learning algo-
tion learning of structured radio communication signals,” in Proc. IEEE rithms in complex-valued neural networks,” in Proc. Int. Conf. Neural
Int. Workshop Sens. Process. Learn. Intell. Mach. (SPLINE), Aalborg, Inf. Process., Shanghai, China, 2011, pp. 550–559.
Denmark, 2016, pp. 1–5. [63] G. C. Goodwin and R. L. Payne, Dynamic System Identification:
[33] T. Gruber, S. Cammerer, J. Hoydis, and S. T. Brink, “On deep learning- Experiment Design and Data Analysis. New York, NY, USA: Academic
based channel decoding,” in Proc. IEEE 51st Annu. Conf. Inf. Sci. Press, 1977.
Syst. (CISS), Baltimore, MD, USA, 2017, pp. 1–6. [64] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
[34] S. Cammerer, T. Gruber, J. Hoydis, and S. T. Brink, “Scaling deep Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
learning-based decoding of polar codes via partitioning,” arXiv preprint [65] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, “Capturing the
arXiv:1702.06901, 2017. human figure through a wall,” ACM Trans. Graph., vol. 34, no. 6, p. 219,
[35] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, 2015.
MA, USA: MIT Press, 2016. [66] M. Zhao, F. Adib, and D. Katabi, “Emotion recognition using wire-
[36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and less signals,” in Proc. ACM Annu. Int. Conf. Mobile Comput. Netw.,
R. Salakhutdinov, “Dropout: A simple way to prevent neural networks New York, NY, USA, 2016, pp. 95–108.
from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
2014.
[37] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” in Proc. Int. Conf. Mach. Learn. (ICML), Haifa,
Israel, 2010, pp. 807–814.
[38] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
ding,” in Proc. 22nd ACM Int. Conf. Multimedia, Orlando, FL, USA, Timothy O’Shea (S’05–M’08–SM’13) received
2014, pp. 675–678. the M.S. degree in electrical engineering from
[39] T. Chen et al., “MXNet: A flexible and efficient machine learn-
North Carolina State University in 2007. He is cur-
ing library for heterogeneous distributed systems,” arXiv preprint
arXiv:1512.01274, 2015. rently pursuing the Ph.D. degree in electrical engi-
[40] M. Abadi et al. (2015). TensorFlow: Large-Scale Machine Learning on neering with Virginia Tech, where he is a Research
Heterogeneous Systems. [Online]. Available: http://tensorflow.org/ Associate. He is the Founder and CTO with DeepSig
[41] R. Al-Rfou et al., “Theano: A Python framework for fast computation Inc., Arlington, VA, USA. He was an Engineering
of mathematical expressions,” arXiv preprint arXiv:1605.02688, 2016. Researcher with UMIACS affiliated research center
[42] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A MATLAB- with the University of Maryland. He has been a Core
like environment for machine learning,” in Proc. BigLearn NIPS Contributor and a Technical Advisor to the GNU
Workshop, 2011, pp. 1–6. Radio Project since 2006. His research interests
[43] F. Chollet. (2015). Keras. [Online]. Available: include the application of machine learning and deep learning to software
https://github.com/fchollet/keras radio and novel applications in signal processing and synthesis of radio
[44] T. O’Shea and J. Hoydis. (2017). Source Code. [Online]. Available: communications and sensing systems.
https://github.com/radioml/introdlphy/
[45] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm
for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,
2006.
[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proc. Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA,
2015, pp. 1–15.
[47] Y. N. Dauphin et al., “Identifying and attacking the saddle point problem Jakob Hoydis (S’08–M’12) received the Diploma
in high-dimensional non-convex optimization,” in Proc. Adv. Neural Inf. (Dipl.-Ing.) degree in electrical engineering and
Process. Syst. (NIPS), Montreal, QC, Canada, 2014, pp. 2933–2941. information technology from RWTH Aachen
[48] L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. University, Germany, in 2008 and the Ph.D. degree
Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. from Supélec, Gif-sur-Yvette, France, in 2012.
[49] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural He is a Technical Staff Member with Nokia Bell
Inf. Process. Syst. (NIPS), 2014, pp. 2672–2680. Labs, France, where he is investigating applications
[50] M. Abadi and D. G. Andersen, “Learning to protect communications of deep learning for the physical layer. He was
with adversarial neural cryptography,” arXiv preprint arXiv:1610.06918, the Co-Founder and CTO of the social network
2016. SPRAED and worked for Alcatel-Lucent Bell Labs,
[51] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, Stuttgart, Germany. His research interests are in
“Spatial transformer networks,” in Proc. Adv. Neural Inf. Process.
Syst. (NIPS), 2015, pp. 2017–2025. the areas of machine learning, cloud computing, SDR, large random matrix
[52] J. Estaran et al., “Artificial neural networks for linear and non-linear theory, information theory, and signal processing, and their applications
impairment mitigation in high-baudrate IM/DD systems,” in Proc. 42nd to wireless communications. He was a recipient of the 2012 Publication
Eur. Conf. Opt. Commun. (ECOC), Düsseldorf, Germany, 2016, pp. 1–3. Prize of the Supélec Foundation, the 2013 VDE ITG Förderpreis, the 2015
[53] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation Leonard G. Abraham Prize of the IEEE COMSOC, the WCNC’2014 Best
recognition of communication signals,” IEEE Trans. Commun., vol. 46, Paper Award, and has been nominated as an Exemplary Reviewer 2012 for
no. 4, pp. 431–436, Apr. 1998. the IEEE C OMMUNICATION L ETTERS.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on October 07,2023 at 01:31:01 UTC from IEEE Xplore. Restrictions apply.